This is the final chapter in our trilogy of exploring deflate performance. In this I will give a detailed description of how you can get close to constant time compression for gzip/zip and standard deflate running at approximately 150MB/s per core.
- Prelude: Optimized gzip/zip packages
- Part 1: High throughput performance
- Part 2: Web server performance
Edit: Some nice people at Reddit kindly pointed out that I used the “Constant time” term in a wrong context. I have adjusted article to use the correct term “linear time”.
Update (18 Sept 2015): Level 1 compression has been updated to a faster version, so the difference to deflate level 1 has gotten smaller. The benchmark sheet has been updated, so some of these numbers may be inaccurate.
I have added a special compression mode to my compression library, named
ConstantCompression, which allows near linear time compression. This is done by completely disabling matching of previous data, and only reduce the number of bits to represent each character.
This means that often used characters, like ‘e’ and ‘ ‘ (space) in text use the fewest bits to represent, and rare characters like ‘¤’ takes more bits to represent. For more information see wikipedia or this nice video.
Since this type of compression doesn’t have big variance on the input, the compression speed is mostly unaffected by the input data, and is usually more than 150MB/s for a single core.
The downside is that the compression ratio is usually considerably worse than even the fastest conventional compression. The compression ratio can never be better than 8:1 (12.5%).
So the linear time compression can be used as a “better than nothing” mode, where you cannot risk the encoder to slow down on some content. For comparison, the size of the “Twain” text is 233460 bytes (+29% vs. level 1) and encode speed is 144MB/s (4.5x the speed of level 1). So in this case you trade a 30% size increase for a 4 times speedup.
For a more detailed insight into the performance characteristics, see my performance test spreadsheet. I have added constant time compression to add performance measurements, and added it to separate charts. Note that the speedup factor is compared to standard library level at 1, since that is the closest comparison.
An important thing is that the output is fully compatible with deflate/gzip/zip compression, so there is no need for a special decoder.
When is it useful?
You will see the biggest difference on data that is medium-to-hard to compress. Attempting to compress data that is already compressed will give a 15x speed up, with performance above 300MB/s. This will make it pretty safe to add this compression even if some of your data is already compressed. On data that is compressible, but with an average of 50% compression you should expect about 5x speedup. The final size will almost always be bigger than level 1, but for you the tradeoff may be fine in these cases:
- When the alternative is no compression because of CPU usage.
- When the compressibility of your content varies a lot. For instance Docker/OS image downloads, backup data, etc.
- Streaming of data, like map-reduce data, where the slight compression will help you keep the IO down.
- Transfer of data between servers in the same datacenter.
If you have more use cases feel free to add them in the comments below.
How does it work?
When this compression scheme is enabled we disable the part of deflate that is usually the most time-expensive, which is matching the current input with previous input. In deflate you can specify that “the next 10 characters should be copied from the previous output 500 bytes back“. This was you can represent many characters as two numbers; length of the match and the offset.
The existing matching uses hashing to reduce the number of possible matches to search a lot, but checking the matches and finding the length will always take some time. The linear time mode, will completely disable the hashing and match search.
What we now have left is the Entropy Coder of the deflate algorithm. There is a little assembly to create byte distribution histograms faster, but other that that it uses simpler versions of the existing Go library for the best possible speed. This feature is also present in zlib, but is referred to as “Huffman Only”, and it is in my opinion underused compared to how useful it is.
The input is split into blocks between 32 and 64 kilobytes in size, so Huffman tables are built for local content. The blocks are also independent, which makes the speedup using pgzip close to linear to your core count.
I write “close to linear time”, because you can expect small variations in speed. If your content is hard to compress, the output will be bigger, and therefore take longer to write, but the variation should be within 10-20%. On incompressible content it will operate faster, since it will be able to skip blocks and add them without compression, which of course is faster.
If you would like to see at alternatives for this, you can have a look at this Entropy Coder Benchmark. The zlib Huffman only mode is listed as “zlibh”.