Introduction
After reading that a Claude agent team capably built a C compiler, I had to try and build a zero GC Java FIX engine myself. The result is a high-performance, low-latency FixEngine written in pure Java, with no garbage collection on the hot path. I will probably make the code available in GitHub at some point.
It took a day to get a working prototype that could execute a simple Kraken FIX order flow.
The Golden Rule: Don’t Create Rubbish You Have to Clean Up
A FIX engine is not all that interesting, it simply translates messages between two parties, however the technical aspects of how it does this are fascinating. The main challenge is speed — the faster it can process messages, the better. In financial trading, every millisecond counts.
The single biggest enemy of speed in Java programs is the garbage collector (GC) — Java’s automatic memory cleaner. When your program creates lots of temporary objects (like strings, arrays, or other data), Java has to pause everything periodically to clean them up. These pauses are unpredictable and can last milliseconds — which in trading is an eternity.
The entire design of this engine is built around one rule: never create temporary objects when handling a message. Everything is pre-prepared.
What Was Built
The implementation proved that the full encode → send → receive → decode → callback cycle can run with no heap allocation on the sending thread in steady state, while sustaining ~44–52K orders/second.
In this post, we’ll cover the main components of the engine and how they achieve zero GC:
FixDecoder— streaming message decoder with a flyweight over an 8KB receive bufferFixEncoder— stateless message encoder writing into a pre-allocated 4KB bufferFixSession— single-threaded FIX 4.4 session (Logon through ACTIVE, heartbeats, TestRequest/Response, SequenceReset)PlainFixTransport/TcpFixTransport— plain TCP and TLS 1.3 transports
The unifying design goal is deterministic low latency: not merely fast on average, but fast with a tight tail distribution. A GC pause of even 1 ms is unacceptable in a market-making or order-routing context. Every design decision below flows from eliminating three root causes of latency jitter:
- JVM heap allocation on the hot path — triggers GC cycles
- Lock contention — causes thread suspension and OS scheduling delay
- Unnecessary memory copies — wastes memory bandwidth and pollutes CPU caches
The FIX Protocol — What We’re Encoding
FIX 4.4 is a plain-text, tag=value protocol over TCP. Every message has the form:
8=FIX.4.4|9=<bodyLen>|35=<msgType>|34=<seqNum>|49=<sender>|56=<target>|52=<time>|...|10=<checksum>|
Fields (tags) are |-delimited (SOH, 0x01). Tags 8, 9, and 10 are the envelope; everything between 35 and checksum is the body.
We implemented:
| Direction | Messages |
|---|---|
| Outbound | Logon, Heartbeat, TestRequest, NewOrderSingle, OrderCancelRequest |
| Inbound | Logon, Heartbeat, TestRequest, SequenceReset, ExecutionReport |
FixDecoder — Zero-GC Incremental Parsing
Receive Buffer and Partial-Read Handling
The decoder owns byte[] recvBuf (8,192 bytes). The transport writes directly into this buffer at recvBuf[dataEnd]. Two cursors manage state:
recvBuf:
[0 .. dataStart-1] ← already-consumed bytes (logically free)
[dataStart .. dataEnd-1] ← buffered bytes awaiting complete message
[dataEnd .. capacity-1] ← free space for next read
Partial FIX messages (where a TCP segment boundary falls mid-message) are handled naturally: dataEnd advances after each read, and decode() only advances dataStart when a complete, validated message is extracted.
Flyweight Message Wrapping (Zero-Copy)
A FixMessage does not own its bytes. Importantly it holds:
byte[] buf; // reference to decoder's recvBuf
int[] tagNums; // parallel array: FIX tag numbers
int[] valOffsets; // parallel array: byte offset of each tag's value
int[] valLengths; // parallel array: byte length of each tag's value
int tagCount;
getValueOffset(tag) is a linear scan over tagNums[] — O(n) with n ≤ 64. A HashMap<Integer,Integer> would be O(1) average but with substantial constant factors: boxing, hash computation, collision resolution, and pointer indirection across heap objects. For 10–20 tags per message, linear scan over a contiguous int[] is faster due to data locality and cache line efficiency.
Cache locality example: tagNums[0..63] (256 bytes) fits in 4 cache lines. A HashMap with 64 entries would touch at minimum: the HashMap header, the table[] array pointer, up to 64 Entry objects each on a separate cache line. On a cache-cold decode, the array scan wins decisively.
Hot-Field Caching
During addTag(), if the tag number is 35 (MsgType) or 34 (SeqNum), the value is parsed eagerly and stored in cachedMsgType and cachedSeqNum. Subsequent accesses (getMsgType(), getSeqNum()) are O(1) field reads — no array scan.
Object Pool
FixMessagePool maintains an array of 16 pre-allocated FixMessage instances. Acquire/release are stack operations on a simple counter:
acquire() → return available > 0 ? pool[--available] : null
release(msg) → pool[available++] = msg
No synchronisation (single-threaded), no allocation, O(1). The null return on exhaustion is a backpressure signal — the decoder stops consuming from the receive buffer if the application hasn’t released messages fast enough.
Lazy Buffer Compaction
When dataStart > recvBuf.length / 2, the decoder compacts:
System.arraycopy(recvBuf, dataStart, recvBuf, 0, dataEnd - dataStart);
dataEnd -= dataStart;
dataStart = 0;
System.arraycopy maps to an architecture-specific memory copy intrinsic (typically memmove) — the fastest available copy path on the JVM. Compacting only at the 50% threshold amortises the copy cost over many messages.
FixEncoder — Zero-GC Serialisation
Pre-allocated Send Buffer
The encoder owns a single byte[] sendBuf (4,096 bytes) passed in at construction time. All messages are serialised into this buffer in-place. The instance is not thread-safe by design — it is owned by the single session thread — so no synchronisation is needed.
The alternative, constructing a ByteArrayOutputStream or StringBuilder per message, would generate one allocation per field, per tag delimiter, and per numeric conversion. At 50,000 messages/second those allocations create sustained GC pressure.
Header Back-Fill
FIX requires tag 9 (BodyLength) to precede the body, but BodyLength is not known until the body is serialised. The encoder resolves this with a two-region layout:
sendBuf:
[0 .. HEADER_RESERVE-1] ← 20 bytes reserved for header back-fill
[HEADER_RESERVE .. pos] ← body written here first
[pos .. pos+7] ← checksum (tag 10) appended last
After the body is complete, pos - HEADER_RESERVE gives the exact body length. The header (8=FIX.4.4\x01, 9=NNN\x01, 35=X\x01) is then written right-to-left from HEADER_RESERVE - 1 backwards into the reserved region. A small 8-byte scratch buffer on the instance holds the intermediate byte representation of the length while computing the number of decimal digits needed.
This approach achieves single-pass serialisation with zero copies. The alternative — two passes or a temporary buffer — would double memory bandwidth usage.
Allocation-Free Integer and Timestamp Encoding
AsciiUtil.writeInt(buf, offset, value) encodes an int directly to ASCII decimal bytes:
count digits → write right-to-left using DIGITS[] lookup table
The DIGITS array ({'0','1',...,'9'}) fits in a single 64-byte cache line. Digit extraction uses integer division and modulo — no heap allocation, no Integer.toString(), no String.getBytes().
Timestamp encoding (AsciiUtil.writeTimestamp) uses the Hinnant civil-from-days algorithm: given an epoch-millisecond value, it decomposes year, month, and day using pure integer arithmetic (shifted-month Gregorian calendar). Hours, minutes, seconds, and milliseconds are derived via successive modulo/division. The result is a 21-byte fixed-format ASCII timestamp written directly to the buffer.
Equivalent code using java.time.Instant and DateTimeFormatter would allocate: a ZonedDateTime, at minimum one StringBuilder inside the formatter, and the final String. The Hinnant approach allocates nothing.
FixSession — Single-Threaded State Machine
Threading Model
FixSession runs entirely on one virtual thread (Java 21 Thread.ofVirtual().start()). Virtual threads are cheap to create (no OS thread per session), but more importantly the single-thread discipline means:
Not sure why Claude chose a virtual thread here, it should stick with a regular OS thread, that’s pinned and dedicated to a core. The JVM’s virtual thread scheduler is not designed for low-latency, high-throughput workloads; it may introduce scheduling jitter and context switch overhead. A pinned OS thread ensures consistent CPU cache locality and predictable execution.
- No locks on the hot path.
outSeqNum,inSeqNum,lastSendTimeNanos,lastRecvTimeNanosare plainint/longfields with novolatileorAtomicLongsynchronisation barrier overhead. - No cache-line invalidation storms. Shared mutable state accessed by multiple threads requires cache coherency traffic between CPU cores (MESI protocol). With one thread, all these fields stay in L1/L2 cache of the single core running the session.
volatile SessionState state and AtomicBoolean running are the only cross-thread fields, used only for lifecycle management from an external control thread — not in the main receive/send loop.
TcpFixTransport — NIO + TLS with Direct Buffers
Direct ByteBuffer Allocation
appSendBuf = ByteBuffer.allocateDirect(appBufSize);
netSendBuf = ByteBuffer.allocateDirect(netBufSize);
netRecvBuf = ByteBuffer.allocateDirect(netBufSize);
appRecvBuf = ByteBuffer.allocateDirect(appBufSize);
allocateDirect allocates off-heap memory via malloc (or mmap). Three key properties:
- Not subject to GC pauses. The GC never stops-the-world to scan or relocate this memory.
- Zero-copy to kernel. When
SocketChannel.write(netSendBuf)is called, the JVM can pass the native pointer directly to the kernel’s socket write path.
NIO SocketChannel
SocketChannel provides the java.nio Channel abstraction over a TCP socket. Blocking mode is used because this session runs on a dedicated thread, making a Selector loop unnecessary. channel.read(netRecvBuf) is a direct syscall (read(2)) with no intermediate copies.
PlainFixTransport — TCP_NODELAY
socket.setTcpNoDelay(true);
Nagle’s algorithm coalesces small writes into larger TCP segments to improve throughput. For FIX messages (typically 100–500 bytes), this would add up to 200 ms latency (the delayed-ACK timeout). TCP_NODELAY disables Nagle, ensuring each write() produces an immediate segment transmission. This is mandatory for any FIX implementation.
SO_SNDBUF and SO_RCVBUF are left at OS defaults. For a latency-sensitive single-connection client, the default kernel buffer sizes (typically 128 KB–4 MB) are more than adequate; tuning these would only matter for high-throughput streaming where kernel buffer sizing affects TCP window size and throughput.
Summary
Claude coordinates low-latency optimisations across the stack with impressive consistency. The engine achieves low and deterministic latency through five interlocking mechanisms:
- Zero allocation on the hot path — pre-allocated
byte[]buffers, object pools, and allocation-free utility methods eliminate GC pressure entirely. - Zero unnecessary copies — flyweight
FixMessage, directByteBuffersend, and back-fill encoding mean bytes are written once and sent directly to the kernel. - Zero lock contention — single-threaded session ownership means no synchronisation primitives are needed on any field touched in the main loop.
- Minimal syscall overhead —
TCP_NODELAY, direct buffers for zero-copy kernel send, and NIO channels all reduce the cost of each I/O operation. - Cache-friendly data structures — contiguous
int[]tag arrays,DIGITSlookup tables, and pre-encoded config bytes exploit CPU spatial and temporal locality.