The claimed rANS decoding advantages are being exaggerated

To put it simply:

SSE 4.1, on an Intel Ice Lake CPU:

SIMD Range: 64 streams: 738 MiB/sec., 436,349 bytes

SIMD Range: 16 streams: 619.1 MiB/sec.

SIMD rANS: 8 streams: 556.5 MiB/s, 435,626 bytes

SIMD Huffman: 32 streams: 1.1 GiB/sec.

The vectorized 24-bit Pavlov/Subbotin range coder is here.

The rANS decoder I'm comparing against appears to be well implemented. (If I could use rANS, which I can't because it's patented, I would start with this code.) It could be unrolled more to 16 (vs. 8) or more streams, and that's going to boost decoding perf. some. Let's say ~25% faster, so ~695 MiB/sec. Range is still comparable, and Huffman is of course faster. The benchmark data is showing it's pointless to risk infringement.

Also, if you show me probably any vectorized rANS decoder, many of the vectorization techniques also easily apply to SIMD range and Huffman decoders too. They each have to do many of the same things, and this helps equalize the perceived performance advantages of rANS decoding.

Ultimately working around range coding's division requirement is not the biggest problem. It's the gathers and keeping all the lanes normalized and fed efficiently - which are the same problems the other decoders face. Once you solve these problems for one entropy coder you can apply them to the others. Many of the implementation differences of the actual entropy coding technique melt away once vectorized.

Note encoding is a different beast. I've been unable to see a way to easily interleave the output bytes from X parallel range coders. The SSE 4.1 range sample code linked to above has to swizzle the bytes after encoding into a single stream, however there is no special side band or signaling information needed (just like with rANS). (This is just like my LZHAM codec, which has been interleaving arithmetic and Huffman coded streams with no extra signaling or side band data for many years now - including before FSE/ANS was a thing.) With rANS encoding interleaving the outputs seems very easy. With more research into other entropy coders perhaps it's possible to eliminate this advantage too.

A straightforward AVX-2 port of this SSE 4.1 range coder is ~65% faster on Ice Lake, putting it roughly in the same ballpark as the AVX-2 rANS decoders I've seen benchmarked. This makes sense because many of the bottlenecks are similar and have little to do with the actual entropy coding technique.

(Note I could upload my SSE 4.1 Huffman decoder to github, but honestly 1.1 GiB/sec. is not a big deal. Fast scalar Huffman on my machine gets ~800 MiB/sec. SSE 4.1 lacks easy variable per-lane bit shifting which hurts Huffman a lot. Straightforward AVX-2 Huffman decoding with an 8-bit alphabet gets ~1.8 GiB/sec. on this machine.)

The claimed rANS decoding advantages are being exaggerated

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112