Zero GC Fix Engine

Introduction

I wanted to see how far I could push a zero-GC Java FIX engine, with a focus on low latency and high throughput, and tight memory discipline. I set the design and requirements up front and leveraged Claude to accelerate the build out. The result is a high-performance, low-latency FixEngine written in pure Java, with no garbage collection on the hot path. I will probably make the code available in GitHub at some point.

It took a day to get a working prototype that could execute a simple, yet complete Kraken FIX order flow.

The Golden Rule: Don’t Create Rubbish You Have to Clean Up

A FIX engine is not all that interesting, it simply translates messages between two parties, but the technical detail in how it’s achieved is interesting. The main challenge is speed - the faster it can process messages, the better. In financial trading, every millisecond counts.

The single biggest enemy of speed in Java programs is the garbage collector (GC) - Java’s automatic memory cleaner. When a program creates lots of temporary objects (like strings, arrays, or other data), Java has to pause everything periodically to clean them up. These pauses are unpredictable and can last milliseconds - which in trading is an eternity.

The entire design of this engine is built around one rule: never create temporary objects when handling a message. Everything is pre-prepared.

What Was Built

The implementation proved that the full encode → send → receive → decode → callback cycle can run with no heap allocation on the sending thread in steady state, while sustaining ~44–52K orders/second.

In this post, I’ll cover the main components of the engine and how they achieve zero GC:

FixDecoder - streaming message decoder with a flyweight over an 8KB receive buffer
FixEncoder - stateless message encoder writing into a pre-allocated 4KB buffer
FixSession - single-threaded FIX 4.4 session (Logon through ACTIVE, heartbeats, TestRequest/Response, SequenceReset)
PlainFixTransport / TcpFixTransport - plain TCP and TLS 1.3 transports

The unifying design goal is deterministic low latency: not merely fast on average, but predictably low at the tail. A GC pause of even 1 ms is unacceptable in a market-making or order-routing context. All design decisions aim to eliminate three root causes of latency jitter:

JVM heap allocation on the hot path - triggers GC cycles
Lock contention - causes thread suspension and OS scheduling delay
Unnecessary memory copies - wastes memory bandwidth and pollutes CPU caches

The FIX Protocol - What’s Encoded

FIX 4.4 is a plain-text, tag=value protocol over TCP. Every message has the form:

8=FIX.4.4|9=<bodyLen>|35=<msgType>|34=<seqNum>|49=<sender>|56=<target>|52=<time>|...|10=<checksum>|

Fields (tags) are |-delimited (SOH, 0x01). Tags 8, 9, and 10 are the envelope; everything between 35 and checksum is the body.

Implemented:

Direction	Messages
Outbound	Logon, Heartbeat, TestRequest, NewOrderSingle, OrderCancelRequest
Inbound	Logon, Heartbeat, TestRequest, SequenceReset, ExecutionReport

FixDecoder - Zero-GC Incremental Parsing

Receive Buffer and Partial-Read Handling

The decoder owns byte[] recvBuf (8,192 bytes). The transport writes directly into this buffer at recvBuf[dataEnd]. Two cursors manage state:

recvBuf:
[0 .. dataStart-1]  ← already-consumed bytes (logically free)
[dataStart .. dataEnd-1]  ← buffered bytes awaiting complete message
[dataEnd .. capacity-1]  ← free space for next read

Partial FIX messages (where a TCP segment boundary falls mid-message) are handled naturally: dataEnd advances after each read, and decode() only advances dataStart when a complete, validated message is extracted.

Flyweight Message Wrapping (Zero-Copy)

A FixMessage does not own its bytes. Importantly it holds:

byte[] buf;        // reference to decoder's recvBuf
int[] tagNums;     // parallel array: FIX tag numbers
int[] valOffsets;  // parallel array: byte offset of each tag's value
int[] valLengths;  // parallel array: byte length of each tag's value
int tagCount;

getValueOffset(tag) is a linear scan over tagNums[] - O(n) with n ≤ 64. A HashMap<Integer,Integer> would be O(1) average but with substantial constant factors: boxing, hash computation, collision resolution, and pointer indirection across heap objects. For 10–20 tags per message, linear scan over a contiguous int[] is faster due to data locality and cache line efficiency.

Cache locality example: tagNums[0..63] (256 bytes) fits in 4 cache lines. A HashMap with 64 entries would touch at minimum: the HashMap header, the table[] array pointer, up to 64 Entry objects each on a separate cache line. On a cache-cold decode, the array scan wins.

Hot-Field Caching

During addTag(), if the tag number is 35 (MsgType) or 34 (SeqNum), the value is parsed eagerly and stored in cachedMsgType and cachedSeqNum. Subsequent accesses (getMsgType(), getSeqNum()) are O(1) field reads - no array scan.

Object Pool

FixMessagePool maintains an array of 16 pre-allocated FixMessage instances. Acquire/release are stack operations on a simple counter:

acquire() → return available > 0 ? pool[--available] : null
release(msg) → pool[available++] = msg

No synchronisation (single-threaded), no allocation, O(1). The null return on exhaustion is a backpressure signal - the decoder stops consuming from the receive buffer if the application hasn’t released messages fast enough.

Lazy Buffer Compaction

When dataStart > recvBuf.length / 2, the decoder compacts:

System.arraycopy(recvBuf, dataStart, recvBuf, 0, dataEnd - dataStart);
dataEnd -= dataStart;
dataStart = 0;

System.arraycopy maps to an architecture-specific memory copy intrinsic (typically memmove) - the fastest available copy path on the JVM. Compacting only at the 50% threshold amortises the copy cost over many messages.

FixEncoder - Zero-GC Serialisation

Pre-allocated Send Buffer

The encoder owns a single byte[] sendBuf (4,096 bytes) passed in at construction time. All messages are serialised into this buffer in-place. The instance is not thread-safe by design - it is owned by the single session thread - so no synchronisation is needed.

The alternative, constructing a ByteArrayOutputStream or StringBuilder per message, would generate one allocation per field, per tag delimiter, and per numeric conversion. At 50,000 messages/second those allocations create sustained GC pressure.

Header Back-Fill

FIX requires tag 9 (BodyLength) to precede the body, but BodyLength is not known until the body is serialised. The encoder resolves this with a two-region layout:

sendBuf:
[0 .. HEADER_RESERVE-1]  ← 20 bytes reserved for header back-fill
[HEADER_RESERVE .. pos]  ← body written here first
[pos .. pos+7]           ← checksum (tag 10) appended last

After the body is complete, pos - HEADER_RESERVE gives the exact body length. The header (8=FIX.4.4\x01, 9=NNN\x01, 35=X\x01) is then written right-to-left from HEADER_RESERVE - 1 backwards into the reserved region. A small 8-byte scratch buffer on the instance holds the intermediate byte representation of the length while computing the number of decimal digits needed.

This approach achieves single-pass serialisation with zero copies. The alternative - two passes or a temporary buffer - would double memory bandwidth usage.

Allocation-Free Integer and Timestamp Encoding

AsciiUtil.writeInt(buf, offset, value) encodes an int directly to ASCII decimal bytes:

count digits → write right-to-left using DIGITS[] lookup table

The DIGITS array ({'0','1',...,'9'}) fits in a single 64-byte cache line. Digit extraction uses integer division and modulo - no heap allocation, no Integer.toString(), no String.getBytes().

Timestamp encoding (AsciiUtil.writeTimestamp) uses the Hinnant civil-from-days algorithm: given an epoch-millisecond value, it decomposes year, month, and day using pure integer arithmetic (shifted-month Gregorian calendar). Hours, minutes, seconds, and milliseconds are derived via successive modulo/division. The result is a 21-byte fixed-format ASCII timestamp written directly to the buffer.

Equivalent code using java.time.Instant and DateTimeFormatter would allocate: a ZonedDateTime, at minimum one StringBuilder inside the formatter, and the final String. The Hinnant approach allocates nothing.

FixSession - Single-Threaded State Machine

Threading Model

FixSession runs entirely on one virtual thread (Java 21 Thread.ofVirtual().start()). Virtual threads are cheap to create (no OS thread per session), but most importantly the single-thread focus means:

No locks on the hot path - outSeqNum, inSeqNum, lastSendTimeNanos, lastRecvTimeNanos are plain int/long fields with no volatile or AtomicLong synchronisation barrier overhead.
No cache-line invalidation storms - shared mutable state accessed by multiple threads requires cache coherency traffic between CPU cores (MESI protocol). With one thread, all these fields stay in L1/L2 cache of the single core running the session.

volatile SessionState state and AtomicBoolean running are the only cross-thread fields, used only for lifecycle management from an external control thread - not in the main receive/send loop.

A virtual thread was fine to get the prototype up and running but it’s the wrong call for production and I’d replace it with a regular OS thread pinned to a dedicated core. The JVM’s virtual thread scheduler is not designed for low-latency, high-throughput workloads; it may introduce scheduling jitter and context switch overhead. A pinned OS thread ensures consistent CPU cache locality and predictable execution.

TcpFixTransport - NIO + TLS with Direct Buffers

Direct ByteBuffer Allocation

appSendBuf = ByteBuffer.allocateDirect(appBufSize);
netSendBuf = ByteBuffer.allocateDirect(netBufSize);
netRecvBuf = ByteBuffer.allocateDirect(netBufSize);
appRecvBuf = ByteBuffer.allocateDirect(appBufSize);

allocateDirect allocates off-heap memory via malloc (or mmap). Two key properties:

Not subject to GC pauses - the GC never stops-the-world to scan or relocate this memory.
Zero-copy to kernel - when SocketChannel.write(netSendBuf) is called, the JVM can pass the native pointer directly to the kernel’s socket write path.

NIO SocketChannel

SocketChannel provides the java.nio Channel abstraction over a TCP socket. Blocking mode is used because this session runs on a dedicated thread, making a Selector loop unnecessary. channel.read(netRecvBuf) is a direct syscall (read(2)) with no intermediate copies.

PlainFixTransport - TCP_NODELAY

socket.setTcpNoDelay(true);

Nagle’s algorithm coalesces small writes into larger TCP segments to improve throughput. For FIX messages (typically 100–500 bytes), this would add up to 200 ms latency (the delayed-ACK timeout). TCP_NODELAY disables Nagle, ensuring each write() produces an immediate segment transmission. This is mandatory for any FIX implementation.

SO_SNDBUF and SO_RCVBUF are left at OS defaults. For a latency-sensitive single-connection client, the default kernel buffer sizes (typically 128 KB–4 MB) are more than enough; tuning these would only matter for high-throughput streaming where kernel buffer sizing affects TCP window size and throughput.

Summary

The optimisations consistently work together across the stack. The engine achieves low and deterministic latency through five interlocking mechanisms:

Zero allocation on the hot path - pre-allocated byte[] buffers, object pools, and allocation-free utility methods eliminate GC pressure entirely.
Zero unnecessary copies - flyweight FixMessage, direct ByteBuffer send, and back-fill encoding mean bytes are written once and sent directly to the kernel.
Zero lock contention - single-threaded session ownership means no synchronisation primitives are needed on any field touched in the main loop.
Minimal syscall overhead - TCP_NODELAY, direct buffers for zero-copy kernel send, and NIO channels all reduce the cost of each I/O operation.
Cache-friendly data structures - contiguous int[] tag arrays, DIGITS lookup tables, and pre-encoded config bytes capitalise on CPU spatial and temporal locality.