Introduction
Part 2 of the series. Part 1 covered the engine design and the FIX protocol, and covered the zero-allocation techniques in the hot path. This part covers the runtime layer: Java version selection, GC algorithm choice, JVM flag rationale, and benchmark results across HotSpot C2 and Graal JIT.
Why Java 25
Java 25 features that drove the version choice.
Compact Object Headers. Object headers shrink from 12 bytes to 8 bytes in production (opt-in via -XX:+UseCompactObjectHeaders in Java 25). For the engine’s object graph — a pool of 16 FixMessage instances, session state, and the encoder/decoder buffers — the absolute saving is small. The more meaningful effect is reduced cache-line pressure: 8-byte headers pack better into 64-byte cache lines, which is essential in tight loops over arrays of objects.
Generational ZGC as default. The ZGenerational flag was deprecated and removed as redundant in Java 24. From Java 24 onwards, -XX:+UseZGC gets generational ZGC without any additional flags. The sub-millisecond pause guarantee — regardless of heap size — is what makes ZGC viable for a latency-sensitive engine. G1’s stop-the-world pauses can reach low double-digit milliseconds under load; at 44K orders/second a 10ms pause stalls ~440 orders.
Virtual threads (stable since Java 21). The session loop and simulator accept loop run on virtual threads. Blocking socket.read() with SO_TIMEOUT=50ms parks the virtual thread and releases the carrier thread — no platform thread is held during the wait. This has no hot-path allocation impact but keeps resource consumption flat as session count scales.
GC Selection
Two GCs are used across the test suite, each for a different purpose.
| Task | GC | Heap | Rationale |
|---|---|---|---|
| Stress / throughput benchmarks | ZGC | 2 GB | Sub-ms pauses; JVMCI support on GraalVM |
| OOM resistance test | EpsilonGC | 1.5 GB | No collection — allocation is permanent |
| Latency tests | ZGC | 512 MB | Realistic working footprint |
EpsilonGC as a correctness instrument. Epsilon never collects. Every heap allocation is a one-way ratchet — committed until process exit. This turns the OOM resistance test into a strict regression gate: if the client hot path allocates a single object per order, the 1.5 GB ceiling will be exhausted within minutes and the test fails with OutOfMemoryError. A passing run under EpsilonGC is a stronger statement than a ThreadMXBean snapshot — it proves zero allocation under sustained load at scale, not just at a point in time.
The contrast with ThreadMXBean.getThreadAllocatedBytes(): that API is updated at safepoints, not continuously. A sub-TLAB allocation might not be reported. EpsilonGC has no such blind spot.
JVM Flags
--enable-preview
-Xms2g -Xmx2g
-XX:+UseZGC
-XX:+UseCompactObjectHeaders
-XX:+AlwaysPreTouch
-XX:ReservedCodeCacheSize=512m
-Xlog:gc*:stdout:time,uptime,level
-Xms2g -Xmx2g (fixed heap). Equal min and max eliminates heap resize pauses. ZGC still concurrently compacts and returns pages to the OS, but the committed size stays constant. Resize pauses are rare in practice but non-zero.
-XX:+UseCompactObjectHeaders. 8-byte headers vs 12-byte headers.
-XX:+AlwaysPreTouch. Pre-commits all heap pages at startup by touching every page in the committed region. Without it, first-touch page faults cause latency spikes during the early warmup phase as the OS maps physical memory on demand. With a 2 GB heap this adds seconds to startup — acceptable given that the benchmark warmup phase takes far longer.
-XX:ReservedCodeCacheSize=512m. The default 256 MB code cache can approach capacity under a 1M-order benchmark with both C1-compiled and C2-compiled methods resident. When the cache reaches capacity, the JIT stops compiling new methods and previously compiled methods may be deoptimised — throughput drops noticeably. 512 MB is generous enough to never fill during the test.
-Xlog:gc*:stdout:time,uptime,level. Unified JVM logging for GC events. Provides pause time, cause (allocation stall vs proactive), and heap occupancy at a glance. ZGC pauses logged here were consistently under 1ms throughout all benchmark runs.
Simulator isolation
The exchange simulator runs in a child JVM via ProcessBuilder. Its ExchangeOrder allocations (~368 bytes per order) are invisible to the client JVM. Without isolation, the OOM resistance test would exhaust the 1.5 GB ceiling in under two minutes from simulator-side allocations — a false positive that would mask client-side regressions.
Benchmark Methodology
FixEngineStressTest runs 1,000,000 orders through the full client cycle:
- Warmup — 200,000 orders. C1 and C2 reach steady state. Not measured.
- Steady state — 800,000 orders. Throughput measured as
orders / elapsed_ns * 1e9. Allocation on the sending thread sampled viacom.sun.management.ThreadMXBean.getThreadAllocatedBytes(threadId)before and after, divided by order count.
The allocation assertion threshold is < 50 bytes/order (hard test failure). In practice, all JIT runs produce exactly 0 bytes/order on the sending thread.
All tests run on the same development laptop over localhost. Results are single-run under normal multitasking load — representative of relative performance, not lab-grade reproducible in the strict sense.
HotSpot C2 vs Graal JIT
Both JVMs use the same flag set. GraalVM CE 25 substitutes the Graal compiler (via JVMCI) for HotSpot’s C2 as the top-tier JIT.
Four back-to-back sweeps — each sweep runs OpenJDK immediately followed by GraalVM in the same Gradle session to eliminate cross-session machine-state variance. One additional run was discarded (both runtimes degraded sharply — consistent with a transient load spike) and is not shown.
| Sweep | OpenJDK 25 HotSpot C2 | GraalVM CE 25 Graal JIT | GraalVM lead |
|---|---|---|---|
| 1 | 57,226 /s | 58,198 /s | +1.7% |
| 2 | 58,005 /s | 59,303 /s | +2.2% |
| 3 | 57,684 /s | 58,295 /s | +1.1% |
| 4 | 57,732 /s | 57,810 /s | +0.1% |
| range | 57,226–58,005 /s | 57,810–59,303 /s |
Both JITs deliver ~57–59K orders/sec. Both produce 0 bytes/order on the sending thread — the zero-allocation design holds on both compilers.
GraalVM advantage: ~1–2% across clean sweeps — within measurement noise; the two JITs are effectively equivalent on this workload.
ZGC pauses were under 1ms on both JVMs throughout. No allocation stalls observed — consistent with 0 bytes/order.
OOM Resistance
oomResistanceTest runs EpsilonGC with a 1.5 GB ceiling. The test drives two phases:
| Phase | Min orders | Min wall time | Purpose |
|---|---|---|---|
| Warmup | 200 K | 10 s | Drive HotSpot to C2-compile all hot paths |
| Steady | 1 M | 60 s | Assertion window — per-thread allocation measured |
The test passed with no OutOfMemoryError over 1,093 seconds of steady state. Heap grew at ~3.1 MB/s — at 400s into the steady phase, ~1.24 GB was in use against the 1.5 GB ceiling.
Heap growth profile
The breakdown below shows where each byte comes from. The principle and source of the growth are the same across environments.
Heap used during the 60 s steady phase (EpsilonGC, 1.5 GB ceiling)
MB
1200 | ·
1100 | ·
1000 | ·
900 | ·
800 | ·
700 | ·
600 | ·
500 | ·
400 | · ← ForkJoinPool JIT-compiler threads + Read-Poller
300 |·
200 |──────────────────────────────────────────────── sending thread: 0 bytes/order
+-----------------------------------------------------------
0s 15s 30s 45s 60s
| t (s) | Heap used |
|---|---|
| 0 | 205 MB |
| 5 | 282 MB |
| 10 | 364 MB |
| 15 | 448 MB |
| 20 | 529 MB |
| 25 | 615 MB |
| 30 | 694 MB |
| 35 | 778 MB |
| 40 | 861 MB |
| 45 | 942 MB |
| 50 | 1,026 MB |
| 55 | 1,107 MB |
| 60 | 1,190 MB |
The heap grows linearly throughout. It looks alarming — until you look at the per-thread allocation breakdown:
=== Per-Thread Allocation (steady phase, 3.55 M orders) ===
Thread Total allocated Per order
────────────────────────────────────────────────────────────────────
Test worker (sending thread) 190,096 bytes 0 bytes
stress-session (session loop) — 0 bytes
ForkJoinPool-1-worker-2 (JIT compiler) 309,705,864 bytes 87 bytes
ForkJoinPool-1-worker-3 (JIT compiler) 252,581,256 bytes 71 bytes
ForkJoinPool-1-worker-4 (JIT compiler) 230,350,128 bytes 64 bytes
Read-Poller (virtual-thread I/O) 192,424,176 bytes 54 bytes
The 985 MB of heap growth is entirely accounted for by HotSpot’s background JIT-compiler threads (ForkJoinPool workers) and the virtual-thread Read-Poller — none of it comes from the message-processing hot path. The three JIT threads together explain 792 MB; the Read-Poller explains 192 MB; rounding accounts for the rest.
The sending thread allocates 0 bytes per order across 3.55 million orders. This is the definitive proof.
Why the heap grows and what would happen with hot-path allocation
EpsilonGC accumulates all JIT profiling data, MethodData objects, and IR nodes permanently — these are a structural cost of running HotSpot’s tiered compiler for 70 seconds. They grow regardless of whether the hot path allocates or not.
What would differ is the slope. If the encoding and decoding path allocated even 200 bytes/order at 59,000 orders/second:
200 bytes × 59,000 orders/s = 11.8 MB/s extra
That would exhaust the remaining 310 MB of headroom within 26 seconds of steady-state, crashing the JVM well before the 60-second floor.
The striking detail is not the total heap growth — it is what drives it. Three background JIT-compiler threads account for ~792 MB; the message-processing hot path accounts for exactly zero.
GraalVM CE 25 comparison
The same test runs against GraalVM CE 25 (Graal JIT instead of HotSpot C2):
| OpenJDK 25 (C2) | GraalVM CE 25 (Graal) | |
|---|---|---|
| Sending thread | 0 bytes/order | 0 bytes/order |
| Heap growth rate | ~3.1 MB/s | ~3.1 MB/s |
| Test result | PASS (1,093 s) | PASS |
Both JITs confirm the same result: sending thread allocates 0 bytes/order.
Transport and Thread Model Comparison
The engine supports four combinations of transport and session-thread type. Each was benchmarked over five runs (60 s steady phase, ZGC, 2 GB, desktop Intel i7-13700 24 threads) to produce stable distributions:
| Variant | Transport | Thread | Avg throughput |
|---|---|---|---|
| Baseline | Blocking socket | Virtual | ~57,400 /s |
| Option 1 | Blocking socket | Platform | ~54,800 /s |
| Option 2 | NIO non-blocking | Virtual | ~64,500 /s |
| Option 1+2 | NIO non-blocking | Platform | ~64,200 /s |
Two conclusions stand out.
Transport is the meaningful axis. Switching from blocking socket to NIO adds ~7,000 orders/second regardless of thread type. Switching thread type with the same transport adds at most ~2,600 orders/second. The transport choice dominates.
Virtual threads consistently outperform platform threads on blocking sockets (~5%). The five-run ranges do not overlap:
| Run | Virtual (blocking) | Platform (blocking) |
|---|---|---|
| 1 | 57,013 | 55,560 |
| 2 | 57,886 | 54,474 |
| 3 | 57,516 | 54,218 |
| 4 | 57,589 | 54,667 |
| 5 | 57,298 | 55,053 |
| Range | 57,013–57,886 | 54,218–55,560 |
The gap is architectural. Java 21 virtual threads transparently replace blocking socket I/O with the NIO Read-Poller internally — socket.read() parks the virtual thread and wakes it the instant data arrives. A platform thread genuinely blocks the OS thread on recv() and wakes only when the SO_TIMEOUT fires or data arrives. The session thread’s responsiveness to incoming execution reports determines how quickly backpressure on the test thread’s sends is relieved.
To explore whether the virtual/platform gap on blocking sockets could be closed by tuning SO_TIMEOUT, five values were tested in a single session (platform thread + blocking socket, all variants run back-to-back via throughputSoSweep so machine state is constant):
| SO_TIMEOUT | Throughput | Notes |
|---|---|---|
| 1 ms | 45,986 /s | ~6% lower — SocketTimeoutException overhead |
| 10 ms | 48,665 /s | Within measurement noise |
| 50 ms (default) | 48,662 /s | Baseline; timeout fires rarely |
| 100 ms | 47,237 /s | Within measurement noise |
| 200 ms | 48,934 /s | Within measurement noise |
10 ms through 200 ms are all within measurement noise (~47,200–48,900 /s). The platform thread blocks on recv() and is woken by data arriving — once data is flowing at ~49K orders/s the timeout fires essentially never, so its value above ~10 ms is irrelevant. Only at 1 ms does the timeout fire frequently enough (~1,000 times/s) that SocketTimeoutException construction and catch-block overhead meaningfully degrades throughput.
Note: these absolute figures are lower than the transport comparison table above because that table was collected in a different session under lighter system load. The relative ordering is what matters — running all variants in a single sweep eliminates session-to-session variance (which can be ±10–15% due to thermal state and background processes).
With NIO, the thread type becomes irrelevant: Option 2 and Option 1+2 are within 300 orders/second of each other across all runs — well within measurement noise.
Latency
Two measurements, both under ZGC with a 512 MB heap on loopback.
Pipeline latency — 50K orders, full throughput
All 50K orders are dispatched without waiting for responses. The throughput-optimised sending pattern.
| Metric | Value |
|---|---|
| p50 | 182 ms |
| p90 | 198 ms |
| p99 | 203 ms |
| max | 204 ms |
This is queue-depth latency, not wire round trip time. “At 44K orders/second with ~8K orders buffered in the OS TCP receive queue, expected p50 ≈ queue_depth / throughput_rate = 8,000 / 44,000 ≈ 182ms. The engine is not the bottleneck; the socket buffer is.”
Single-order RTT — one order in-flight, synchronous
Sends one order, blocks via LockSupport.park() until the ExecReport(New) callback fires, then sends the next. Zero pipeline depth. Three runs on the same development laptop under normal multitasking load.
| Metric | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| min | 19 µs | 14 µs | 13 µs |
| p50 | 39 µs | 39 µs | 40 µs |
| p90 | 48 µs | 49 µs | 51 µs |
| p99 | 116 µs | 99 µs | 91 µs |
| p99.9 | 394 µs | 198 µs | 267 µs |
| max | 1.5 ms | 0.4 ms | 1.1 ms |
p50 consistently 39–40 µs. p99 consistently sub-120 µs across all three runs. This measures the true wire round-trip: FIX encode → socket write → loopback TCP → simulator decode → ExecReport encode → loopback TCP → socket read → FIX decode → callback.
p99.9 under 400 µs on all runs confirms it is not systemic.
The pipeline/latency tradeoff
| Mode | p50 | p99 | What it measures |
|---|---|---|---|
| Pipelined (44K/sec) | 182 ms | 203 ms | Queue depth ÷ throughput rate |
| Synchronous (1 in-flight) | 39–40 µs | 91–116 µs | True wire round-trip |
The ~4,500× p50 difference is the pipeline/latency tradeoff made concrete. A production system chooses a pipeline depth — or implements adaptive depth control.
Blocking I/O note. Both latency measurements use java.io.Socket with SO_TIMEOUT=50ms. In synchronous mode the session thread settles into socket.read() with actual latency dominated by the loopback round-trip and FIX processing, not the polling interval. The NIO transport benchmarked in the previous section produces higher throughput; its effect on single-order RTT latency has not been separately measured. io_uring, instead of making a system call per I/O operation (which crosses the user/kernel boundary each time), would lower the floor further but would not change the fundamental pipeline/synchronous distinction.
Part 3 covers GraalVM native image: what AOT compilation changes about the measurement, results, and a consolidated comparison of all runtime configurations.