Garbage Collectors: Shenandoah, ZGC, etc.

Introduction

The Garbage Collector (GC) is an important part of the Java Virtual Machine (JVM). It manages an application’s memory allocation, identifies memory that is no longer used and collects it for re-use.

There are currently five GC-implementations available in Java 15 (tested against Azul Zulu build):

Serial GC
Parallel GC
G1 GC
Z GC
Shenandoah GC

Let’s review these through a benchmark to get some practical insight into their high-level performance characteristics.

Collectors

Serial

The serial collector uses a single thread to handle GC. This would be suited to an application that occupies few resources, i.e. threads and memory (up to 100 MB).

Parallel

The parallel collector (also known as the throughput collector) is like the serial collector except that it has multiple threads to handle GC. Consider this collector if throughput is the priority, above responsiveness.

G1

The G1 (also known as Garbage First) collector is a mostly concurrent collector, i.e. most of the GC steps are concurrent but not all. Its performance characteristics lie as a compromise between latency and throughput.

Z

The Z Garbage Collector is the low-pause garbage collector, it does not stop the execution of application threads. Its goals are the same as Shenandoah but using coloured references and remapping.

Shenandoah

The Shenandoah Garbage Collector is a low pause time garbage collector, it achieves this by doing more GC steps concurrently. Like ZGC, concurrent copy and compact stages give event shorter pauses, however its implementation is slightly different with its Brooks pointers approach.

A good presentation on Shenandoah GC.

Benchmarks

Allocate

Allocate a byte array of some size repeatedly.

Allocate while 50% occupied

Allocate byte arrays that will occupy 50% of the heap, and then allocate a byte array of some size repeatedly. Originally I configured this test as “while 70% occupied”, however the Shenandoah benchmark would run out of Java heap space. This could be a result of the overhead of its use of Brooks pointers, something to re-visit later perhaps.

Allocate 60%

Allocate a number of byte arrays of some size so that they occupy 60% of the heap, do this repeatedly.

Results

Throughput: Average time per operation.

Allocate / Allocate while 50% occupied:

In general, G1GC performs the best with in the ‘Allocate’ and ‘Allocate while 50% occupied’ benchmarks, where objects are not humungous in size, i.e. 4 MB in these tests.
In those instances where allocating 4 MB, ZGC and Shenandoah perform better.
Note that, in general ZGC and Shenandoah performance is comparable with G1GC.

Allocate 60%:

This test generates a lot of garbage, here ZGC and Shenandoah perform far better.
It’s interesting to see that at humungous objects territory, the G1GC’s performance is comparable to ZGC and Shenandoah.
Note the change in scale, these operations are taking milliseconds versus that the microseconds for the above two benchmarks.

Latency: 99.9th percentile operation time.

Allocate: Allocate: 99.9th percentile

G1GC appears to have better 99.9th percentile latency in the ‘Allocate’ benchmark, but as expected struggled with 4 MB objects.
Shenandoah and ZGC are generally better than the serial and parallel collectors except at the 4 MB object range.

Allocate while 50% occupied: Allocate while 50% occupied: 99.9th percentile

In general, ZGC and Shenandoah perform the best at 4 kB, 40 kB and 400 kB. Parallel GC did do particularly well at 400 kB.
Interestingly at 4 MB, ZGC really struggles with these large objects – something for a later post perhaps.

Allocate 60%: Allocate 60%: 99.9th percentile

ZGC does the best in all the tests in this benchmark.
This benchmark which focuses on generating a lot of garbage before releasing it, shows how much better the later collectors perform.

Closing

From these benchmarks:

For throughput, G1 and Shenandoah GC are the best collectors.
For latency-sensitive application, ZGC is the best collector.

Obviously each application will have their own unique usage profile and should be benchmarked to identify the most suitable garbage collector to use.

Appendix

Throughput

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
Allocate 4kB	0.339	0.334	0.236	0.243	0.24
Allocate 40kB	2.506	2.506	1.816	1.905	1.865
Allocate 400kB	26.194	24.941	17.834	18.184	17.963
Allocate 4MB	240.664	241.458	187.925	180.764	182.533

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
50% Occupied / Allocate 4kB	0.341	0.355	0.234	0.244	0.244
50% Occupied / Allocate 40kB	2.606	2.644	2.014	2.109	1.948
50% Occupied / Allocate 400kB	25.269	24.681	18.706	20.408	18.541
50% Occupied / Allocate 4MB	244.635	245.725	311.757	193.762	185.755

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
Allocate 60% in 4kB chunks	744.214	818.615	501.143	161.28	155.462
Allocate 60% in 40kB chunks	534.277	706.001	436.163	117.387	112.27
Allocate 60% in 400kB chunks	514.895	536.347	438.745	111.01	108.313
Allocate 60% in 4MB chunks	500.248	549.07	132.809	111.872	114.179

Latency

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
Allocate 4kB	10.496	10.288	2.3	3.4	4.065
Allocate 40kB	22.199	20.672	12.992	14.896	14.496
Allocate 400kB	213.012	154.108	51.072	125.696	59.004
Allocate 4MB	574.464	591.872	1564.672	583.57	542.593

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
50% Occupied / Allocate 4kB	10.4	10.288	6.6	2.3	4.896
50% Occupied / Allocate 40kB	21.792	21.472	16.672	14.992	13.6
50% Occupied / Allocate 400kB	151.718	89.856	148.992	137.216	123.15
50% Occupied / Allocate 4MB	1046.528	2179.949	1499.136	4426.924	689.152

Benchmark (microseconds)	Serial	Parallel	G1GC	ZGC	Shenandoah
Allocate 60% in 4kB chunks	985.661	1009.779	563.085	189.006	409.993
Allocate 60% in 40kB chunks	726.663	819.986	488.636	145.752	150.209
Allocate 60% in 400kB chunks	716.177	896.532	467.141	127.14	173.277
Allocate 60% in 4MB chunks	711.983	891.29	152.83	126.878	314.049