Introduction

Part 2 of the series. Part 1 covered the engine design and the FIX protocol, and covered the zero-allocation techniques in the hot path. This part covers the runtime layer: Java version selection, GC algorithm choice, JVM flag rationale, and benchmark results across HotSpot C2 and Graal JIT.


Why Java 25

Java 25 features that drove the version choice.

Compact Object Headers. Object headers shrink from 12 bytes to 8 bytes in production (opt-in via -XX:+UseCompactObjectHeaders in Java 25). For the engine’s object graph — a pool of 16 FixMessage instances, session state, and the encoder/decoder buffers — the absolute saving is small. The more meaningful effect is reduced cache-line pressure: 8-byte headers pack better into 64-byte cache lines, which is essential in tight loops over arrays of objects.

Generational ZGC as default. The ZGenerational flag was deprecated and removed as redundant in Java 24. From Java 24 onwards, -XX:+UseZGC gets generational ZGC without any additional flags. The sub-millisecond pause guarantee — regardless of heap size — is what makes ZGC viable for a latency-sensitive engine. G1’s stop-the-world pauses can reach low double-digit milliseconds under load; at 44K orders/second a 10ms pause stalls ~440 orders.

Virtual threads (stable since Java 21). The session loop and simulator accept loop run on virtual threads. Blocking socket.read() with SO_TIMEOUT=50ms parks the virtual thread and releases the carrier thread — no platform thread is held during the wait. This has no hot-path allocation impact but keeps resource consumption flat as session count scales.


GC Selection

Two GCs are used across the test suite, each for a different purpose.

Task GC Heap Rationale
Stress / throughput benchmarks ZGC 2 GB Sub-ms pauses; JVMCI support on GraalVM
OOM resistance test EpsilonGC 1.5 GB No collection — allocation is permanent
Latency tests ZGC 512 MB Realistic working footprint

EpsilonGC as a correctness instrument. Epsilon never collects. Every heap allocation is a one-way ratchet — committed until process exit. This turns the OOM resistance test into a strict regression gate: if the client hot path allocates a single object per order, the 1.5 GB ceiling will be exhausted within minutes and the test fails with OutOfMemoryError. A passing run under EpsilonGC is a stronger statement than a ThreadMXBean snapshot — it proves zero allocation under sustained load at scale, not just at a point in time.

The contrast with ThreadMXBean.getThreadAllocatedBytes(): that API is updated at safepoints, not continuously. A sub-TLAB allocation might not be reported. EpsilonGC has no such blind spot.


JVM Flags

--enable-preview
-Xms2g -Xmx2g
-XX:+UseZGC
-XX:+UseCompactObjectHeaders
-XX:+AlwaysPreTouch
-XX:ReservedCodeCacheSize=512m
-Xlog:gc*:stdout:time,uptime,level

-Xms2g -Xmx2g (fixed heap). Equal min and max eliminates heap resize pauses. ZGC still concurrently compacts and returns pages to the OS, but the committed size stays constant. Resize pauses are rare in practice but non-zero.

-XX:+UseCompactObjectHeaders. 8-byte headers vs 12-byte headers.

-XX:+AlwaysPreTouch. Pre-commits all heap pages at startup by touching every page in the committed region. Without it, first-touch page faults cause latency spikes during the early warmup phase as the OS maps physical memory on demand. With a 2 GB heap this adds seconds to startup — acceptable given that the benchmark warmup phase takes far longer.

-XX:ReservedCodeCacheSize=512m. The default 256 MB code cache can approach capacity under a 1M-order benchmark with both C1-compiled and C2-compiled methods resident. When the cache reaches capacity, the JIT stops compiling new methods and previously compiled methods may be deoptimised — throughput drops noticeably. 512 MB is generous enough to never fill during the test.

-Xlog:gc*:stdout:time,uptime,level. Unified JVM logging for GC events. Provides pause time, cause (allocation stall vs proactive), and heap occupancy at a glance. ZGC pauses logged here were consistently under 1ms throughout all benchmark runs.

Simulator isolation

The exchange simulator runs in a child JVM via ProcessBuilder. Its ExchangeOrder allocations (~368 bytes per order) are invisible to the client JVM. Without isolation, the OOM resistance test would exhaust the 1.5 GB ceiling in under two minutes from simulator-side allocations — a false positive that would mask client-side regressions.


Benchmark Methodology

FixEngineStressTest runs 1,000,000 orders through the full client cycle:

  1. Warmup — 200,000 orders. C1 and C2 reach steady state. Not measured.
  2. Steady state — 800,000 orders. Throughput measured as orders / elapsed_ns * 1e9. Allocation on the sending thread sampled via com.sun.management.ThreadMXBean.getThreadAllocatedBytes(threadId) before and after, divided by order count.

The allocation assertion threshold is < 50 bytes/order (hard test failure). In practice, all JIT runs produce exactly 0 bytes/order on the sending thread.

All tests run on the same development laptop over localhost. Results are single-run under normal multitasking load — representative of relative performance, not lab-grade reproducible in the strict sense.


HotSpot C2 vs Graal JIT

Both JVMs use the same flag set. GraalVM CE 25 substitutes the Graal compiler (via JVMCI) for HotSpot’s C2 as the top-tier JIT.

Four back-to-back sweeps — each sweep runs OpenJDK immediately followed by GraalVM in the same Gradle session to eliminate cross-session machine-state variance. One additional run was discarded (both runtimes degraded sharply — consistent with a transient load spike) and is not shown.

Sweep OpenJDK 25 HotSpot C2 GraalVM CE 25 Graal JIT GraalVM lead
1 57,226 /s 58,198 /s +1.7%
2 58,005 /s 59,303 /s +2.2%
3 57,684 /s 58,295 /s +1.1%
4 57,732 /s 57,810 /s +0.1%
range 57,226–58,005 /s 57,810–59,303 /s  

Both JITs deliver ~57–59K orders/sec. Both produce 0 bytes/order on the sending thread — the zero-allocation design holds on both compilers.

GraalVM advantage: ~1–2% across clean sweeps — within measurement noise; the two JITs are effectively equivalent on this workload.

ZGC pauses were under 1ms on both JVMs throughout. No allocation stalls observed — consistent with 0 bytes/order.


OOM Resistance

oomResistanceTest runs EpsilonGC with a 1.5 GB ceiling. The test drives two phases:

Phase Min orders Min wall time Purpose
Warmup 200 K 10 s Drive HotSpot to C2-compile all hot paths
Steady 1 M 60 s Assertion window — per-thread allocation measured

The test passed with no OutOfMemoryError over 1,093 seconds of steady state. Heap grew at ~3.1 MB/s — at 400s into the steady phase, ~1.24 GB was in use against the 1.5 GB ceiling.

Heap growth profile

The breakdown below shows where each byte comes from. The principle and source of the growth are the same across environments.

Heap used during the 60 s steady phase (EpsilonGC, 1.5 GB ceiling)

  MB
1200 |                                              ·
1100 |                                         ·
1000 |                                    ·
 900 |                               ·
 800 |                          ·
 700 |                     ·
 600 |                ·
 500 |           ·
 400 |      ·         ← ForkJoinPool JIT-compiler threads + Read-Poller
 300 |·
 200 |──────────────────────────────────────────────── sending thread: 0 bytes/order
     +-----------------------------------------------------------
     0s         15s         30s         45s         60s
t (s) Heap used
0 205 MB
5 282 MB
10 364 MB
15 448 MB
20 529 MB
25 615 MB
30 694 MB
35 778 MB
40 861 MB
45 942 MB
50 1,026 MB
55 1,107 MB
60 1,190 MB

The heap grows linearly throughout. It looks alarming — until you look at the per-thread allocation breakdown:

=== Per-Thread Allocation (steady phase, 3.55 M orders) ===
Thread                                   Total allocated    Per order
────────────────────────────────────────────────────────────────────
Test worker (sending thread)               190,096 bytes     0 bytes
stress-session (session loop)                    —           0 bytes
ForkJoinPool-1-worker-2  (JIT compiler)  309,705,864 bytes  87 bytes
ForkJoinPool-1-worker-3  (JIT compiler)  252,581,256 bytes  71 bytes
ForkJoinPool-1-worker-4  (JIT compiler)  230,350,128 bytes  64 bytes
Read-Poller (virtual-thread I/O)         192,424,176 bytes  54 bytes

The 985 MB of heap growth is entirely accounted for by HotSpot’s background JIT-compiler threads (ForkJoinPool workers) and the virtual-thread Read-Poller — none of it comes from the message-processing hot path. The three JIT threads together explain 792 MB; the Read-Poller explains 192 MB; rounding accounts for the rest.

The sending thread allocates 0 bytes per order across 3.55 million orders. This is the definitive proof.

Why the heap grows and what would happen with hot-path allocation

EpsilonGC accumulates all JIT profiling data, MethodData objects, and IR nodes permanently — these are a structural cost of running HotSpot’s tiered compiler for 70 seconds. They grow regardless of whether the hot path allocates or not.

What would differ is the slope. If the encoding and decoding path allocated even 200 bytes/order at 59,000 orders/second:

200 bytes × 59,000 orders/s = 11.8 MB/s extra

That would exhaust the remaining 310 MB of headroom within 26 seconds of steady-state, crashing the JVM well before the 60-second floor.

The striking detail is not the total heap growth — it is what drives it. Three background JIT-compiler threads account for ~792 MB; the message-processing hot path accounts for exactly zero.

GraalVM CE 25 comparison

The same test runs against GraalVM CE 25 (Graal JIT instead of HotSpot C2):

  OpenJDK 25 (C2) GraalVM CE 25 (Graal)
Sending thread 0 bytes/order 0 bytes/order
Heap growth rate ~3.1 MB/s ~3.1 MB/s
Test result PASS (1,093 s) PASS

Both JITs confirm the same result: sending thread allocates 0 bytes/order.


Transport and Thread Model Comparison

The engine supports four combinations of transport and session-thread type. Each was benchmarked over five runs (60 s steady phase, ZGC, 2 GB, desktop Intel i7-13700 24 threads) to produce stable distributions:

Variant Transport Thread Avg throughput
Baseline Blocking socket Virtual ~57,400 /s
Option 1 Blocking socket Platform ~54,800 /s
Option 2 NIO non-blocking Virtual ~64,500 /s
Option 1+2 NIO non-blocking Platform ~64,200 /s

Two conclusions stand out.

Transport is the meaningful axis. Switching from blocking socket to NIO adds ~7,000 orders/second regardless of thread type. Switching thread type with the same transport adds at most ~2,600 orders/second. The transport choice dominates.

Virtual threads consistently outperform platform threads on blocking sockets (~5%). The five-run ranges do not overlap:

Run Virtual (blocking) Platform (blocking)
1 57,013 55,560
2 57,886 54,474
3 57,516 54,218
4 57,589 54,667
5 57,298 55,053
Range 57,013–57,886 54,218–55,560

The gap is architectural. Java 21 virtual threads transparently replace blocking socket I/O with the NIO Read-Poller internally — socket.read() parks the virtual thread and wakes it the instant data arrives. A platform thread genuinely blocks the OS thread on recv() and wakes only when the SO_TIMEOUT fires or data arrives. The session thread’s responsiveness to incoming execution reports determines how quickly backpressure on the test thread’s sends is relieved.

To explore whether the virtual/platform gap on blocking sockets could be closed by tuning SO_TIMEOUT, five values were tested in a single session (platform thread + blocking socket, all variants run back-to-back via throughputSoSweep so machine state is constant):

SO_TIMEOUT Throughput Notes
1 ms 45,986 /s ~6% lower — SocketTimeoutException overhead
10 ms 48,665 /s Within measurement noise
50 ms (default) 48,662 /s Baseline; timeout fires rarely
100 ms 47,237 /s Within measurement noise
200 ms 48,934 /s Within measurement noise

10 ms through 200 ms are all within measurement noise (~47,200–48,900 /s). The platform thread blocks on recv() and is woken by data arriving — once data is flowing at ~49K orders/s the timeout fires essentially never, so its value above ~10 ms is irrelevant. Only at 1 ms does the timeout fire frequently enough (~1,000 times/s) that SocketTimeoutException construction and catch-block overhead meaningfully degrades throughput.

Note: these absolute figures are lower than the transport comparison table above because that table was collected in a different session under lighter system load. The relative ordering is what matters — running all variants in a single sweep eliminates session-to-session variance (which can be ±10–15% due to thermal state and background processes).

With NIO, the thread type becomes irrelevant: Option 2 and Option 1+2 are within 300 orders/second of each other across all runs — well within measurement noise.


Latency

Two measurements, both under ZGC with a 512 MB heap on loopback.

Pipeline latency — 50K orders, full throughput

All 50K orders are dispatched without waiting for responses. The throughput-optimised sending pattern.

Metric Value
p50 182 ms
p90 198 ms
p99 203 ms
max 204 ms

This is queue-depth latency, not wire round trip time. “At 44K orders/second with ~8K orders buffered in the OS TCP receive queue, expected p50 ≈ queue_depth / throughput_rate = 8,000 / 44,000 ≈ 182ms. The engine is not the bottleneck; the socket buffer is.”

Single-order RTT — one order in-flight, synchronous

Sends one order, blocks via LockSupport.park() until the ExecReport(New) callback fires, then sends the next. Zero pipeline depth. Three runs on the same development laptop under normal multitasking load.

Metric Run 1 Run 2 Run 3
min 19 µs 14 µs 13 µs
p50 39 µs 39 µs 40 µs
p90 48 µs 49 µs 51 µs
p99 116 µs 99 µs 91 µs
p99.9 394 µs 198 µs 267 µs
max 1.5 ms 0.4 ms 1.1 ms

p50 consistently 39–40 µs. p99 consistently sub-120 µs across all three runs. This measures the true wire round-trip: FIX encode → socket write → loopback TCP → simulator decode → ExecReport encode → loopback TCP → socket read → FIX decode → callback.

p99.9 under 400 µs on all runs confirms it is not systemic.

The pipeline/latency tradeoff

Mode p50 p99 What it measures
Pipelined (44K/sec) 182 ms 203 ms Queue depth ÷ throughput rate
Synchronous (1 in-flight) 39–40 µs 91–116 µs True wire round-trip

The ~4,500× p50 difference is the pipeline/latency tradeoff made concrete. A production system chooses a pipeline depth — or implements adaptive depth control.

Blocking I/O note. Both latency measurements use java.io.Socket with SO_TIMEOUT=50ms. In synchronous mode the session thread settles into socket.read() with actual latency dominated by the loopback round-trip and FIX processing, not the polling interval. The NIO transport benchmarked in the previous section produces higher throughput; its effect on single-order RTT latency has not been separately measured. io_uring, instead of making a system call per I/O operation (which crosses the user/kernel boundary each time), would lower the floor further but would not change the fundamental pipeline/synchronous distinction.


Part 3 covers GraalVM native image: what AOT compilation changes about the measurement, results, and a consolidated comparison of all runtime configurations.