I have worked on optimizing the standard library deflate function, and I am happy to announce revised gzip/zip packages, that on x64 is about 30-50% faster with slightly improved compression.
All packages are drop-in replacements for the standard libraries, so you can use them by simply changing imports.
The heaviest speedups mimmics the work done for the Cloudflare zlib optimizations, but with a few Go-specific optimizations.
The biggest gains are on machines with SSE4.2 instructions available on Intel Nehalem (2009) and AMD Bulldozer (2012). The optimized functions are:
- Minimum matches are 4 bytes, this leads to fewer searches and better compression.
- Stronger hash (iSCSI CRC32) for matches on x64 with SSE 4.2 support. This leads to fewer hash collisions.
- Literal byte matching using SSE 4.2 for faster long-match comparisons.
- Bulk hashing on matches.
- Much faster dictionary indexing with NewWriterDict()/Reset().
- CRC32 optimized for 10x speedup on SSE 4.2. Available separately.
- Make Bit Coder faster by assuming we are on a 64 bit CPU.
- Remove some branches by splitting the main deflate loop.
The real speedup depends a lot on your data. Some data types will see larger speedup than others. To get a real-world impression the speed of compressing a a 2.3MB JSON file.
benchmark old ns/op new ns/op delta BenchmarkGzipL1 95035436 71914113 -24.33% BenchmarkGzipL2 100665758 74774276 -25.72% BenchmarkGzipL3 111666387 80764620 -27.67% BenchmarkGzipL4 141848114 101145785 -28.69% BenchmarkGzipL5 185630618 127187274 -31.48% BenchmarkGzipL6 207511870 137047840 -33.96% BenchmarkGzipL7 265115163 183970522 -30.61% BenchmarkGzipL8 454926020 348619940 -23.37% BenchmarkGzipL9 488327935 377671600 -22.66% benchmark old MB/s new MB/s speedup BenchmarkGzipL1 52.21 69.00 1.32x BenchmarkGzipL2 49.29 66.36 1.35x BenchmarkGzipL3 44.43 61.43 1.38x BenchmarkGzipL4 34.98 49.06 1.40x BenchmarkGzipL5 26.73 39.01 1.46x BenchmarkGzipL6 23.91 36.20 1.51x BenchmarkGzipL7 18.72 26.97 1.44x BenchmarkGzipL8 10.91 14.23 1.30x BenchmarkGzipL9 10.16 13.14 1.29x
At the default compression level 1.5 times the throughput at higher compression levels.
Furthermore “pgzip” (multi-cpu gzip for longer streams) has also been updated to the new deflate/crc32, so it you update the repo you will also get a “free” speed boost there. See pgzip home.
I will probably tidy up the crc32 optimization and submit it to Go to be part of the standard library.