Introduction

Part 3 of the series. Part 2 covered JVM selection, GC tuning, and JIT benchmark results. This part covers GraalVM native image: what changes when the JIT is removed entirely, how we adapted the measurement methodology for SubstrateVM’s constraints (SubstrateVM is the runtime that native image embeds), and how AOT-compiled throughput compares to all three JIT configurations.


What Native Image Answers

The JIT benchmarks in Part 2 measure peak throughput - the throughput the JIT reaches after a 200K-order warmup phase has given C2 or Graal enough profiling data to compile the hot path. That is a meaningful number, but it answers a specific question: what is the ceiling of JIT-optimised performance?

The GraalVM native image (native-image) answers:

  • what does steady-state throughput look like with no JIT warmup,
  • no JIT compiler threads consuming CPU, and
  • a minimal runtime footprint?

native-image performs ahead-of-time (AOT) compilation: the entire application is analysed statically and compiled to a native executable at build time. The resulting binary embeds a minimal runtime (SubstrateVM) that includes a GC and thread scheduler but no JIT compiler. There is no bytecode at runtime. There is no class loading after startup.

SubstrateVM Constraints

SubstrateVM is not HotSpot. Several things that work transparently on HotSpot require explicit handling:

ZGC unavailability is the most significant constraint. ZGC is the benchmark GC for the JIT runs. For the native image throughput baseline we use Serial GC - a stop-the-world collector - which is a different GC profile. The throughput comparison between Graal JIT (ZGC) and native image (Serial GC) conflates runtime type and GC algorithm. This is an inherent limitation of the platform.

ThreadMXBean unavailability means we cannot measure per-thread allocation directly. Instead we use an Epsilon GC build: no collection, so total heap growth over the steady-state phase divided by order count gives an all-thread allocation upper bound. This is not the same measurement as the JIT’s per-sending-thread figure.

java.class.path is empty in a native binary - there is no classpath at runtime. The simulator child JVM needs an explicit classpath passed via environment variable (SIMULATOR_CLASSPATH). NativeStressMain reads this and passes it to ProcessBuilder.


Results (in-session 3-runtime sweep)

  OpenJDK 25 C2 GraalVM CE 25 JIT Native Serial GC Native Epsilon GC
Throughput 57,476/sec 58,844/sec 61,728/sec 47,512/sec
Run-to-run CV 0.56% 0.59% 0.20% -
Alloc/order (sending thread) 0 bytes 0 bytes n/a† 510 bytes‡
Heap at logon 107 MB 235 MB 1 MB 1 MB
Heap post-warmup 107 MB 235 MB 1 MB 2,605 MB§
Binary / image size - - 27.7 MB 24.8 MB
Time to first order ~3–5 s ~3–5 s <100 ms <100 ms

†Serial GC collects; heap-delta is not a useful allocation proxy.

‡All-threads upper bound via Epsilon heap-delta. Not comparable to the JIT’s per-sending-thread ThreadMXBean figure.

§Epsilon never collects. The 1.5 GB heap fills over the 800K-order steady-state phase at ~24 MB/s total-process allocation rate.

In-session 3-runtime five-run sweep (2026-05-04)

allRuntimesSweep runs OpenJDK ×5 → GraalVM JIT ×5 → Native Serial ×5 - fifteen runs back-to-back in a single Gradle invocation. All runs share the same machine state, eliminating the 10–15% cross-session drift that would otherwise contaminate runtime comparisons from distinct run sessions. Baseline transport variant (blocking socket + virtual thread). JIT runs use ZGC 2 GB; native runs use Serial GC 512 MB.

Run OpenJDK 25 C2 GraalVM CE 25 JIT Native Serial GC
1 57,680/sec 58,817/sec 61,826/sec
2 57,825/sec 58,693/sec 61,815/sec
3 57,552/sec 59,413/sec 61,519/sec
4 57,315/sec 58,829/sec 61,727/sec
5 57,008/sec 58,468/sec 61,753/sec
Mean 57,476/sec 58,844/sec 61,728/sec
Range 1.42% 1.61% 0.50%
CV 0.56% 0.59% 0.20%

The native binary’s CV (coefficient of variation: standard deviation / mean x 100%) of 0.20% is roughly a third of either JIT runtime’s. AOT compilation eliminates JIT-warmup variance entirely; Serial GC’s deterministic stop-the-world pauses also produce tighter run-to-run timing than ZGC’s concurrent cycles for this workload.


Analysis

Throughput: native image exceeds GraalVM JIT, by less than cross-session figures suggested.

The in-session sweep gives definitive pairwise deltas without machine-state contamination:

Comparison Δ
GraalVM JIT vs OpenJDK C2 +1,368 (+2.38%)
Native vs GraalVM JIT +2,884 (+4.90%)
Native vs OpenJDK C2 +4,252 (+7.40%)

The in-session methodology ensures the figures are comparable: the native lead is real but modest at +5–7%, and the GraalVM JIT advantage over OpenJDK is real but modest at +2.4%. With CVs of 0.2–0.6% and deltas of 2.4–7.4%, every pairwise difference is well outside both runtimes’ noise bands.

When compiling ahead-of-time instead of using the JIT, performance loss is expected. Here native is above JIT, which is unusual and reflects the workload’s character: a few hundred bytes of ASCII encode, a blocking write, a blocking read, a checksum loop. There is minimal virtual dispatch for the JIT to devirtualise speculatively, and minimal allocation profile for the JIT to specialise around. Profile-guided optimisations add relatively little over static AOT compilation for this workload, and the Graal AOT compiler applies the same aggressive inlining it uses in JIT mode, without the runtime overhead of profiling and deoptimisation infrastructure.

The native binary also wins on predictability. With a CV of 0.20% - about a third of either JIT runtime’s - its run-to-run distribution is far tighter. For a system whose SLA is expressed in terms of worst-case throughput (rather than mean), a 3× reduction in measured variance is a structural win independent of the absolute throughput advantage.

Heap footprint: the native image’s clearest structural win.

1 MB at logon vs 107 MB (OpenJDK) and 235 MB (GraalVM JIT). JIT runtimes load the JVM class library into the live heap at startup. Native image bakes it into the binary’s read-only image heap - it does not contribute to the live heap, and the OS page cache shares it across processes. For a system that runs many instances (containers, lambdas), that 100× reduction in per-instance heap means proportionally more instances fit within the same memory budget.

Allocation: 510 bytes/order is an upper bound, not a regression.

At 47,512 orders/sec × 510 bytes = 24 MB/s total-process allocation. The JIT build’s 0 bytes/order covered only the sending thread via ThreadMXBean. The native image figure covers all threads: sending thread, session thread, SubstrateVM internal threads. The most likely source is overhead from how the native runtime manages its “virtual threads” (which is less polished than the JVM’s version). The FIX encode/decode path on the sending thread itself contributes zero - consistent with every JIT measurement.

Cold start: no JIT ramp.

JIT builds need 3–5 seconds before C2 or Graal has compiled the hot path. The native binary is at its full throughput from the first order. For exchange connections that reconnect after market close, or for container deployments where instances are frequently recycled, the absence of a warmup tax is operationally meaningful.


Consolidated Comparison

All four runtime configurations on the same hardware and test:

Runtime GC Throughput Heap at logon Cold start
OpenJDK 25 HotSpot C2 ZGC 2 GB 57,476/sec 107 MB ~3–5 s
GraalVM CE 25 Graal JIT ZGC 2 GB 58,844/sec 235 MB ~3–5 s
GraalVM Native (Serial GC) Serial 512 MB 61,728/sec 1 MB <100 ms
GraalVM Native (Epsilon GC) Epsilon 1.5 GB 47,512/sec 1 MB <100 ms

For a deployment where peak throughput is the only criterion: native image Serial GC, which leads on this workload.

For a deployment where startup latency, memory density, and peak throughput all matter: native image Serial GC again - it wins on all three axes here, an outcome that is workload-specific and shouldn’t be generalised.

For a deployment where ecosystem maturity and tooling support matter more than the last 5–7% of throughput: HotSpot C2 remains the conservative baseline. GraalVM JIT sits between the two - modestly faster than C2 (+2.4%), with the same JVM tooling.


What’s Next

The current peak sits at ~61K orders/second on a desktop JVM. The next phase is about exploring where that ceiling actually is - how close a zero-GC, zero-allocation Java implementation can get to six-figure throughput before hitting the limits of what a general-purpose runtime and TCP stack can offer, and where dedicated infrastructure would need to take over.