The point here is to show that both Huffman and range decoding are all vectorizable and competitive vs. rANS in software.
What I care about is fast entropy decoding. Even scalar encoding of any of these techniques is already fast enough for most use cases. For the distribution use case, you encode once on some server(s) and then distribute to millions of users.
I've constructed fast SSE 4.1 and AVX-2 vectorized decoders for range coding and Huffman coding. The SSE 4.1 vectorized 24-bit range decoder (using just 16 streams for simplicity) is here on github. This code is easily ported to AVX-2 and Huffman in a day or two. The Steam Hardware Survey shows that 99.58% of surveyed users (as of 3/23) support SSE 4.1, so it seems pointless to focus on or benchmark scalar decoders any more.
There's nothing particularly special or exotic about these implementations. The range coder is LZMA's binary range coder (which is also used by LZHAM), which I adapted to handle 8-bit symbols with 12-bit probabilities and 24-bit registers. It appears quite similar to a range coder released by Dmitry Subbotin, except it handles carries differently. The Huffman decoder uses the Package Merge algorithm.
The tricky renormalization step (where each vector lane is fed with more bytes from the source streams) is basically the same between rANS, Huffman and range decoding. Range decoding needs integer division which is scary at first, but the CPU engineers have solved that problem (if all you need is 24-bits). Also, each decoder uses some sort of table to accelerate decoding.
The vectorized versions are surprisingly similar once implemented. They do some math, do a gather from a table, write the output symbols, do some more math, then fetch some bytes from the source buffer and renormalize and distribute to each lane in parallel these source bytes. Once you've implemented one well you've more or less got the others.
Here are some benchmarks on the same machine (a Dell Ice Lake laptop, Core i7 1065G7) on Calgary Corpus book1 (768,771 bytes). All coders use 8-bit symbols.
SSE 4.1 Huffman/Range Decoding
- Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)
AVX-2 Huffman/Range Decoding
- Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
- Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes
Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.
Collet's Scalar Huffman/FSE Decoding
Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):- Huffman: 840 MiB/sec., 439,150 bytes
- FSE: 365 MiB/sec., 437,232 bytes
Also see FPC on github by Konstantinos Agiannis, which gets on a Core i5 ~814 MiB/sec. for scalar length-limited Huffman decoding.
Giesen's Vectorized rANS Decoding
Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:
- rANS: 213.9 MiB/s, 435,604 bytes
- interleaved rANS: 324.7 MiB/s, 435,606 bytes
- SIMD rANS: 556.5 MiB/s, 435,626 bytes
It's easy to construct a Huffman decoding table that supplies 2-3 symbols per decode step. (My Huffman benchmarks don't demonstrate this ability, but they could if I limited the max Huffman code size to 6 or 7 bits, which is practical for some problems.) AFAIK this ability has not been demonstrated with rANS. Until it is, Huffman isn't going anywhere. Considering rANS is patented, it's just not worth the risk and trouble for questionable decoding gains when usable alternatives exist. I'll revisit this in 20+ years (the lifetime of patents in the US).