Optimized gzip/zip packages, 30-50% faster

I have worked on optimizing the standard library deflate function, and I am happy to announce revised gzip/zip packages, that on x64 is about 30-50% faster with slightly improved compression.

Project: https://github.com/klauspost/compress

All packages are drop-in replacements for the standard libraries, so you can use them by simply changing imports.

Compressed car by César; Musée d'Art Contemporain, Marseille.
Compressed car by César; Musée d’Art Contemporain, Marseille. Photo (CC) by marcovdz

The heaviest speedups mimmics the work done for the Cloudflare zlib optimizations, but with a few Go-specific optimizations.

The biggest gains are on machines with SSE4.2 instructions available on Intel Nehalem (2009) and AMD Bulldozer (2012). The optimized functions are:

  • Minimum matches are 4 bytes, this leads to fewer searches and better compression.
  • Stronger hash (iSCSI CRC32) for matches on x64 with SSE 4.2 support. This leads to fewer hash collisions.
  • Literal byte matching using SSE 4.2 for faster long-match comparisons.
  • Bulk hashing on matches.
  • Much faster dictionary indexing with NewWriterDict()/Reset().
  • CRC32 optimized for 10x speedup on SSE 4.2. Available separately.
  • Make Bit Coder faster by assuming we are on a 64 bit CPU.
  • Remove some branches by splitting the main deflate loop.

 

The real speedup depends a lot on your data. Some data types will see larger speedup than others. To get a real-world impression the speed of compressing a a 2.3MB JSON file.

benchmark           old ns/op     new ns/op     delta
BenchmarkGzipL1     95035436      71914113      -24.33%
BenchmarkGzipL2     100665758     74774276      -25.72%
BenchmarkGzipL3     111666387     80764620      -27.67%
BenchmarkGzipL4     141848114     101145785     -28.69%
BenchmarkGzipL5     185630618     127187274     -31.48%
BenchmarkGzipL6     207511870     137047840     -33.96%
BenchmarkGzipL7     265115163     183970522     -30.61%
BenchmarkGzipL8     454926020     348619940     -23.37%
BenchmarkGzipL9     488327935     377671600     -22.66%

benchmark           old MB/s     new MB/s     speedup
BenchmarkGzipL1     52.21        69.00        1.32x
BenchmarkGzipL2     49.29        66.36        1.35x
BenchmarkGzipL3     44.43        61.43        1.38x
BenchmarkGzipL4     34.98        49.06        1.40x
BenchmarkGzipL5     26.73        39.01        1.46x
BenchmarkGzipL6     23.91        36.20        1.51x
BenchmarkGzipL7     18.72        26.97        1.44x
BenchmarkGzipL8     10.91        14.23        1.30x
BenchmarkGzipL9     10.16        13.14        1.29x

At the default compression level 1.5 times the throughput at higher compression levels.

Furthermore “pgzip” (multi-cpu gzip for longer streams) has also been updated to the new deflate/crc32, so it you update the repo you will also get a “free” speed boost there. See pgzip home.

I will probably tidy up the crc32 optimization and submit it to Go to be part of the standard library.