Quantcast
Channel: Richard Geldreich's Blog
Viewing all 302 articles
Browse latest View live

Updated bc7enc_rdo with improved smooth block handling

$
0
0

The command line tool now detects extremely smooth blocks and encodes them with a significantly higher MSE scale factor. It computes a per-block mask image, filters it, then supplies an array of per-block MSE scale factors to the ERT. -zu disables this. 

The end result is much less significant artifacts on regions containing very smooth blocks (think gradients). This does hurt rate-distortion performance.




(The second image was resampled to 1/4th res for blogger.)


bc7e.ispc integrated into bc7enc_rdo

$
0
0
bc7e.ispc is a very powerful/fast 8 mode encoder. It supports the entire BC7 format, unlike bc7enc's default encoder. It's 2-3x faster than ispc_texcomp at the same average quality. Now that it's been combined with bc7enc_rdo you can do optional RDO BC7 encoding using this encoder, assuming you can tolerate the slower encode times.

This is now checked into the bc7enc_rdo repo.

Command: bc7enc xmen_1024.png -u6 -U -z1.0 -zc4096 -zm

4.53 bits/texel (Deflate), 37.696 dB PSNR

BC7 mode histogram:
0: 3753
1: 15475
2: 1029
3: 6803
4: 985
5: 2173
6: 35318
7: 0



First-ever RDO ASTC encodings

$
0
0
Here are my first-ever RDO LDR ASTC 4x4 encodings. Perhaps they are the first ever for the ASTC texture format: 

Non-RDO:
5.951 bits/texel, 45.1 dB, 75773 PSNR/bpt 


RDO:
4.286 bpt, 38.9 dB, 90752 PSNR/bpt


Biased difference:

I used astcenc to generate a .astc file, loaded it into memory, then used the code in ert.cpp/.h with a custom callback that decodes ASTC blocks. All the magic is in the ERT. Here's a match injection histogram - this works: 1477,466,284,382,265,398,199,109,110,87,82,105,193,3843

Another encode at lambda=.5:




These RDO ASTC encodes do not have any ultra-smooth block handling, because it's just something I put together in 15 minutes. If you look at the planet you can see the artifacts that are worse than they should be.

Next are larger blocks.

First RDO LDR ASTC 6x6 encodings

$
0
0
This is 6x6 block size, using the ERT in bc7enc_rdo:

Left=Non-RDO, 37.3 dB, 2.933 bits/texel (Deflate) 
Right=RDO lambda=.5, 36.557 dB, 2.399 bpt





Using more aggressive ERT settings, but the same lambda: 




Average rate-distortion curves for bc7enc_rdo

$
0
0

bc7enc_rdo is now a library that's utilized by the command line tool, which is far simpler now. This makes it trivial to call multiple times to generate large .CSV files.

If you can only choose one set of settings for bc7enc_rdo, choose "-zn -U -u6". (I've set the default BC7 encoding level to 6, not sure that's checked in yet.) I'll be making bc7e.ispc the new default on my next checkin - it's clearly better.

All other settings were the tool's defaults (linear metrics, window size=128 bytes).



My intuition was to limit the BC7 modes, bias the modes/weights/p-bits/etc. That works and is super fast to encode (if you can't afford any RDO post-processing at all), but the end result is lower quality across most of the usable range of bitrates. Just use bc7e.ispc.

128 vs. 1024 window size:



One block window size, one match per block:



Lena is Retired

$
0
0
As an open source author, I will not assist or waste time implementing support for any new image/video/GPU texture file format that is not fuzzed, or if it uses the "test" image "lena" (or "lenna") for development, testing, statistical analysis, or optimization purposes, and all test images must be legal with clear copyright attribution. This is one of my freedoms as an open source author.

The model herself has requested the public to lose (i.e. delete, remove, or stop using) the image:

Technically, this archaic image is also useless for testing purposes. There is no one image, 5 images, or even 100 images that are useful for testing new image/texture/video codecs. We use thousands of textures and images for testing and optimization purposes. There is no longer any need to focus intensely on a single image during research or while building new software. Unlike the 70's/80's, we all now have easy access to millions of far better images available on the internet, many of them acquired using modern digital photography equipment. On a modern machine using OpenCL we can compress a directory of over six thousand .PNG images/textures in a little over 6 minutes.

This was posted on Twitter in public by Charles Poynton, PhD (HDTV/UHDTV) after I announced this, after a reader posted that this nearly 50 year old drum scan of a 70's halftoned porn mag picture was being used in a project's "test corpus":


And this was posted on Twitter in public by Chris Green ("Half-Life 2" rendering lead):


PS - It's indicative of how warped, certifiable, or completely out of touch many men working in software and the video game industry are that I have received threats (including death threats) over this stance.

LZ_XOR

$
0
0
If LZ compression is compilation, what other instructions are useful to add? I've spent the last several years, off and on, trying to figure this out.

LZSS has LIT [byte] and COPY [length, distance]. LZ_XOR is like LZSS but with one new instruction the decompressor can execute. This variant adds XOR [length, distance, 1 or more bytes].

XOR (or ADD) gives the system partial matches, which is particularly useful on GPU texture data, log files, object/executable files, bitmaps, and audio files. In LZ, the COPY instruction must refer to dictionary positions that perfectly match the input file bytes (otherwise, the decompressor will copy the wrong bytes!). LZ_XOR can use any match distance/length pair, even referring to bytes in the dictionary which don't perfectly match (or match at all!) the bytes to compress, although not every distance/length pair will result in an efficient partial match. The compiler's goal is to find those XOR's that result in partial matches which are efficient to code.

A LZ_XOR compiler ("compressor" in LZ-parlance) can optimize for minimum total Hamming distance between the lookahead buffer and the previously encoded/emitted bytes in the sliding dictionary. Or, the compressor can first optimize for minimal Hamming distance in one pass, and then minimal bit prices in a second optimal parsing (optimizing compiler) pass.

LZ_XOR is surprisingly strong and flexible:

1. You can constrain the XOR instruction in the compiler (parser) to only use a limited # of unique symbols. 

This property is valuable when implementing shuffle-based Huffman decoding in AVX2/AVX-512, which is limited to only 16 or 32 unique symbols.

2. XOR is a superset of plain LZ: With 8-bit symbols you don't need LIT's or COPY instructions at all ("pure" LZ_XOR, or XOR-only). This simplifies the decompressor (eliminating unpredictable jumps), and simplifies the instruction stream so less signaling needs to be encoded. Everything is an XOR so there's no need to tell the decompressor that the instruction is a LIT or COPY.

3. Most of the usual tricks used to increase LZSS's ratio by other codecs (more contexts, LZMA-like state machines+contexts, REP matches, circular match history buffers, fast entropy coding, larger dictionaries, etc.) are compatible with LZ_XOR too. I've implemented and tested several LZ_XOR variants with GPU parsing that are LZ4-like and Zstd-like, along with ideas from LZMA/LZHAM. 

4. A file compressed with LZ_XOR will have significantly less overall instructions to execute vs. LZSS to decompress the file (roughly 30-60% less). This results in a decompressor which spends more time copying or XOR'ing bytes, and less time unpacking and interpreting the compressed instruction stream.

5. Another strong variant on some files (audio, bitmaps) is LZ_ADD. Instead of optimizing for minimum total Hamming distance, you optimize for minimum total absolute error.

6. LZ_XOR is strong on uncompressible files (with very few matches), such as GPU texture data. These are the files that LZ4 fails to compress well.

7. LZ_XOR is like LZMA's or LZHAM's LAM's ("Literals after Matches") taken to the next level. 

Example disassembly of a simple LZ_XOR system with only LIT and XOR constrained to only use 32 unique XOR bytes (no copies - they are implemented as XOR's with all-0 XOR bytes):


Another example that uses XOR, COPY, and LITS, along with REP0/1 distances (from LZMA), and a significantly more complex control/instruction stream:


Here's "Alice in Wonderland" compressed with plain LZSS on the left, and LZ_ADD on the right. Notice how much faster it makes progress through the file vs. LZSS:


Compressing the 25 char ASCII string "executable files go here". First run uses LZ-style LIT+COPY plus a new instruction, ADD. Second run just uses LIT+COPY:


XOR-only decompression can be done in two simple steps, using a single output buffer: first entropy decode the XOR bytes into a buffer the size of the output block. This can be done at ~1 GB/sec. using a fast order-0 Huffman or FSE decoder, such as this one. The bulk of the XOR byte values will be 0. The frequency histogram will have one enormous spike with a quick falloff.

Next, execute the XOR control stream and XOR the bytes in the sliding dictionary with the "patch" bytes in the lookahead buffer. (The "patch" bytes are in the brackets in the above disassemblies. In XOR-only there's guaranteed to be a single XOR byte for every byte in the block.) This step is roughly 1-1.5 GiB/sec. It's in-place so it's very cache friendly.

LZ_XOR places very heavy demands on the compressor's parser to find the partial matches with minimal Hamming distance (or emitted bits), making it a good fit for GPU parsing. My experimental compiler currently evaluates every single possible way to code the lookahead bytes against all the bytes in the sliding dictionary. It uses Dijkstra's shortest path algorithm, where each edge is an instruction, the costs are bit prices, and the nodes are lookahead byte positions. There are numerous heuristics and search optimizations that can be done to speed up partial matching. This is the Approximate String Matching Problem.

LZ_XOR gives an RDO GPU texture compressor a lot more freedom to distort a texture block's encoded bytes to increase ratio. With plain LZ/LZSS-style systems, your primary tool to increase the compression ratio (and trade off texture quality) is to increase the number and length of the LZ matches in the encoded texture data. With LZ_XOR you can replace bytes with other bytes which minimize the Hamming distance between the block you're encoding and the already coded blocks in the sliding dictionary. This is a more flexible way of increasing distortion without slamming in large 3-4 byte matches.

While building the above experimental codecs, I also would use the GPU to compute Hamming Correlation Matrix visualizations. This is for the first 494 bytes of alice29.txt (Alice in Wonderland). Each row represents how much the 18 byte pattern starting at that file offset correlates with all the previous byte patterns. White=low Hamming distance, black=high distance.



Other LZ_XOR variants

$
0
0

Other LZ instruction set variants that can utilize XOR:

1. ROLZ_XOR: Reduced-Offset LZ (or here) that uses XOR partial matches. Simplifies the search, but the decompressor has to keep the offsets table up to date which costs extra memory and time. Can issue just XOR's, or LIT/XOR, or LIT/COPY/XOR.

I implemented a ROLZ variant of LZHAM a few years back but the decompressor overhead wasn't acceptable to me (otherwise it worked surprisingly well).

2. LZRW_XOR: A variant of LZRW (or here) that instead of matches it issues XOR or XOR+LIT instructions instead. The compression search problem is simplified, however it's more complex vs. plain LZRW. XOR is more flexible vs. COPY (any distance can be utilized) so this should have higher gain vs. plain LZRW with a large enough hash table.



LZ_XOR on enwik8

$
0
0
First results of LZ_XOR on enwik8 (a common 100MB test file of Wikipedia data, higher ratio is smaller file):

LZ4:    58.09% 2.369 GiB/sec decompress
Zstd:   69.03%  .639 GiB/sec
LZ_XOR: 63.08% 1.204 GiB/sec (36,922,969 bytes)
lzham_codec_devel: 74.93% .205 GiB/sec

In this run LZ_XOR used a 128KB dictionary and I limited its parsing depth to speed up the encode.

Options:

LZ4_HC: level LZ4HC_CLEVEL_MAX
Zstd: level 11
lzham_codec_devel: defaults
LZ_XOR: 128KB, GPU assisted, exhaustive partial match searching disabled

Some LZ_XOR statistics:

Total LIT: 263887
Total XOR: 2380411
Total COPY: 9544076

Total used distance history: 1238007

Total LIT bytes: 755210
Total XOR bytes: 15157561
Total COPY bytes: 84087229

Total instructions: 12188374

One nice property of highly constrained length limited prefix codes

$
0
0
I realized earlier that I don't need AVX2 for really fast Huffman (really length limited prefix) decoding. Using a 6-bit max Huffman code size, a 12-bit decode table (16KB - which fits in the cache) is guaranteed to always give you 2 symbols per lookup. With "normal" fast Huffman decoding using a 12-bit table/max code size, you only get 1 or 2. Big difference.

This means less conditionals during decoding, and higher baseline throughput. The downside is you have to figure out how to exploit a constrained decoder like this, but I've already found one way (LZ_XOR). LZ_XOR is magical and fits a length limited prefix decoder design like this well because during your partial matching searches you can just skip any non-codable partial matches (i.e. ones that use XOR delta bytes you can't code).

LZ_XOR has so much flexibility that it can just toss uncodable partial matches and it still has plenty of usable things it can code. LZ_XOR's delta bytes are limited to 32 unique decodable symbols - which turns out to be plenty.

The compiler's instruction selector can shorten a long XOR match to make it codable using other instructions.

At 6-bits the algorithm you use to length limit the Huffman codes really matters. The one I'm using now (from miniz) is dead simple and not very efficient. I need Package-Merge.

Also, Universal codes can be decoded without using lookups (to decode the index anyway) using CLZ instructions. Unfortunately to do CLZ with anything less than AVX512 seems to require exploiting the FP instructions. In my tests with LZ_XOR various modified types of Universal codes are only 1-4% less efficient than Huffman.

LZ_XOR on canterbury corpus

$
0
0

LZ_XOR 128KB dictionary, AVX2, BMI1, mid-level CPU parsing, Ice Lake CPU (Core i7 1065G7 @ 1.3GHz, Dell Laptop).


Only the XOR bytes are entropy coded, otherwise everything else (the control stream, the usually rare literal runs) are sent byte-wise. It uses 6-bit length limited prefix codes in 16 streams, AVX2 gathers and shuffle-based LUT's. I also have a two gather version (one gather to get the bits, another to do the Huffman lookups) that decodes 2 symbols per gather, which is slightly faster (2.2 GiB/sec. vs. 1.9 GiB/sec.) but only on large buffers. I posted a pic of cppspmd_fast inner loop on my Twitter.

BMI1 made very little if any difference that I could detect.

The compressor isn't optimized yet. It's like 100KB/sec. and all on one thread. That's next. LZ_XOR trades off strong parsing in the encoder for less instructions to execute in the decompressor, longer XOR matches, and usually rare (.5-1%) literals.

Fast AVX2 PNG writer

Lagrangian RDO PNG

$
0
0

Turns out PNG is very amendable to RDO optimization approaches, but few have really tried.

This is something I've been wanting to try for a while. This experiment only injects 3 pixel matches into the PNG Paeth (#4) predictor bytes. It uses an accurate Deflate bitprice model which is computed by first compressing the image to 24bpp PNG using predictor #4, then the 3 pixel matches are inserted.

Original PNG (oxipng): 16.06bpp


Lagrangian RDO proprocess on Paeth bytes+oxipng: 6.17bpp 28.141 dB


Note that blogger seems to losslessly recompress these PNG's (why?), but if you run oxipng on them you get the expected size:


EA/Microsoft Neural Network GPU Texture Compression Patents

$
0
0
Both Microsoft and EA have patented various ways of bolting on neural networks to GPU (ASTC/BC7/BC6h) texture compressors, in order to accelerate determining the compression params (mode, partition etc.):

EA: https://patents.google.com/patent/US10930020B2/en

Microsoft: https://patents.google.com/patent/US10504248B2/en

What this boils down to: techniques like TexNN are potentially patented. This work was done in 2017/2018 but it wasn't public until 2019:
https://cs.unc.edu/~psrihariv/publication/texnn/

Over a decade ago, I and others researched attempting to bolt real-time GPU tex compressors after existing lossy compressors (like JPEG etc.). The results suffer because you're dealing with 2 generations of lossy artifacts. Also existing image compressors aren't designed with normal maps etc. 

If you're working on a platform with limited computing resources (the web, mobile), or limited to no SIMD (web), real-time recompression has additional energy and resource constraints.

Both my "crunch" library and Basis Universal bypass the need for fast real-time texture compression entirely. They compress directly in the GPU texture domain. This approach is used by Unity and many AAA video games:
https://www.mobygames.com/person/190072/richard-geldreich/

I've been researching the next series of codecs for Basis Universal. This is why I wrote RDO PNG, QOI and LZ4, and bc7enc_rdo.

I am strongly anti-software patent. All the EA/MS patents will do is push developers towards open solutions, which will likely be better in the long run anyway. Their patents add nothing and were obvious. These corporations incentivize their developers to patent everything they can get their hands on (via bonuses etc.), which ultimately results in exploiting the system by patenting trivial or obvious ideas. Ultimately this slows innovation and encourages patent wars, which is bad for everybody.

It's possible to speed up a BC7 encoder without using neural networks. An encoder can first try the first 16 partition patterns, and find which is best. The results can be used to predict which more complex patterns are likely to improve the results. See the table and code here - this works:
https://github.com/richgel999/bc7enc_rdo/blame/master/bc7enc.cpp#L1714

It's potentially possible to use a 3 or 4 level hierarchy to determine the BC7 partition pattern. bc7enc.cpp only uses 2 levels. This would reduce the # of partition patterns to examine to only a handful.

To cut down the # of BC7 modes to check, you can first rule out mode 7 because it's only useful for blocks containing alpha. Then try mode 6. If it's good enough, stop. Then try mode 1. If that's good enough, stop, etc. Only a subset of blocks need to use the complex multiple subset modes. In many cases the 3 subset modes can be ignored entirely with little noticeable impact on the results. The component "rotation" feature is usually low value.

These optimizations cause divergence in SIMD encoders, unfortunately. The Neural Network encoders also suffer from the same problem. Neural Network encoders also must be trained, and if the texture to compress doesn't resemble the training data they could have severe and unpredictable quality cliffs. 

TexNN was first published here: 

The Dark Horse of the Image Codec World: Near-Lossless Image Formats Using Ultra-Fast LZ Codecs

$
0
0

I think simple ultra-high speed lossy (or near-lossless) image codecs, built from the new generation of fast LZ codecs, are going to become more relevant in the future.

Computing bottlenecks change over time. As disk space, disk bandwidth, and internet bandwidth increases, older image codecs that squeeze every last bit out of the resulting file become less valuable for many use cases. Eventually websites or products using noticeably lossy compressed images will be less attractive. The bandwidth savings from overly lossy image codecs will become meaningless, and the CPU/user time and battery or grid energy spent on complex decompression steps will be wasted.

Eventually, much simpler codecs with lower (weaker) compression ratios that introduce less distortion, but have blazingly fast decompression rates are going to become more common. This core concept motivates this work.

One way to construct a simple lossless or lossy image codec with fast decompression is to combine a custom encoding tool with the popular lossless LZ4 codec. LZ4's compression and decompression is extremely fast, and the library is reliable, updated often, extensively fuzz tested, and very simple to use. 

To make it lossy, the encoder needs to precondition the image data so when it's subsequently compressed by LZ4, the proportion of 4+ byte matches vs. literals is increased compared to the original image data. I've been constructing LZ Preconditioners, and building new LZ codecs that amend themselves to this preconditioning step, over the past year.

Such a codec will not be able to compete against JPEG, WebP, JPEG 2000, etc. for perceived quality per bit. However, it'll be extremely fast to decode, very simple, and will likely not bloat executables because the LZ4 library is already present in many codebases. Using LZ4 introduces no new security risks.

This LZ preconditioning step must be done in a way that minimizes obvious visible artifacts, as perceived by the Human Visual System (HVS). This tradeoff, of increasing distortion but reducing the bitrate is a classic application of Rate-distortion theory. This is well-known in video coding, and now in GPU texture encoding (which I introduced in 2009 with my "crunch" compression library).

The rdopng tool on github supports creating lossy LZ4 compressed RGB/RGBA images using a simple rate-distortion model. (Lossless is next, but I wanted to solve the harder problem first.) During the preconditioning step, the LZ4 rate in bits is approximated using a sliding dictionary and a match finder. For each potential match replacement which would introduce distortion into the lookahead buffer, the preconditioner approximates the introduced visual error by computing color error distances in a scaled Oklab perceptual colorspace. (Oklab is one of the most powerful colorspaces I've used for this sort of work. There are better colorspaces for compression, but Oklab is simple to use and well-documented.)

Perceptually, distortions introduced into regions of images surrounded by high frequency details are less noticeable vs. regions containing smooth or gradient features. Before preconditioning, the encoder computes two error scaling masks which indicate which areas of the image contains large or small gradients/smooth regions. These scaling masks suppress introducing distortions (by using longer or more matches) if doing so would be too noticeable to the HVS. This step has a large impact on bitrate and can be improved.

To speed up encoding, the preconditioner only examines a window region above and to the left of the lookahead buffer. LZ4's unfortunate minimum match size of 4 bytes complicates encoding of 24bpp RGB images. Encoding is not very fast due to this search process, but it's possible to thread it by working on different regions of the image in parallel. The encoder is a proof of principle and testing grounds, and not as fast as it could be, but it works.

The encoder also supports angular error metrics for encoding tangent space normal maps.

LZ4I images are trivial to decode in any language. A LZ4I image consists of a simple header followed by the LZ4 compressed 24bpp RGB (R first) or RGBA pixel data:


Some example encodings:

Lossless original PNG image, 19.577 bits/pixel PNG, or 20.519 bits/pixel LZ4I:



Lossy LZ4I: 42.630 709 Y dB,  12.985 bits/pixel, 1.1 gigapixels/sec. decompression rate using LZ4_decompress_safe() (on a mobile Core i7 1065G7 at 1.3GHz base clock):

Biased delta image:

Greyscale histogram of biased delta image:



A more lossy LZ4I encoding, 38.019 709 Y dB, 8.319 bits/pixel:


Biased delta image:

Greyscale delta image histogram:


Lossless original: 14.779 bits/pixel PNG, 16.551 bits/pixel LZ4I:


Lossy LZ4I: 45.758 709 Y dB, 7.433 bits/pixel (less than half the size vs. lossless LZ4I!):


Biased delta image:


rdopng also supports lossy QOI and PNG encoding. QOI is a particularly attractive for lossy compression because the encoding search space is tiny, however decompression is slower. Lossy QOI encoding is extremely fast vs. PNG/LZ4.

It's also possible to construct specialized preconditioners for other LZ codecs, such as LZMA, Brotli, Zstandard, or LZSSE. Note the LZ4 preconditioner demonstrated here is universal (i.e. it's compatible with any LZ codec) because it just introduces more or longer matches, but it doesn't exploit the individual LZ commands supported by each codec.

LZSSE is particularly attractive as a preconditioning target because it's 30-40% faster than LZ4 and has a higher ratio. This is next. A format that uses tiled decompression and multiple threads is also a no-brainer. Ultimately I think LZ or QOI-like variants will be very strong contenders in the future.



Faster LZ is not the answer to 150-250+ GB video game downloads

$
0
0
When the JPEG folks were working on image compression, they didn't create a better or faster LZ. Instead they developed new approaches. I see games growing >150GB and then graphs like this, and it's obvious the game devs are going in the wrong direction:

https://aras-p.info/blog/2023/01/31/Float-Compression-2-Oodleflate/

(Note these benchmarks are great and extremely useful.)

Separating out the texture encoding stage from the lossless stage is a compromise. I first did this in my "crunch" library around 15 years ago. It was called "RDO mode". You can swizzle the ASTC/BC1-7 bits before LZ, and precondition them, and that'll help, but the two steps are still disconnected. Instead combine the texture and compression steps (like crunch's .CRN mode - shipped by Unity for BC1-5.) 

Alternatively: defer computing the GPU texture data until right before it's actually needed and cache it. Ship the texture signal data using existing image compression technology, which at this point is quite advanced. For normal maps, customize or tune existing tech to handle them without introducing excessive angular distortion. I think both ideas are workable.

Also, these LZ codecs are too fast. They are designed for fast loading and streaming off SSD's. Who cares about saving off a few hundred ms (or a second) when it takes hours or days to download the product onto the SSD?

Somebody could develop a 10x faster Oodle (or an Oodle that compresses 1-2% better) and we're still going to wait many hours or days to actually use the product. And then there's the constant updates. This approach doesn't scale.

It's fun and sexy to work on faster LZ but the real problem (and value add) doesn't call for better or more lossless tech. This is a distraction. If trends continue the downloads and updates will be measured in terms of fractional or 1+ week(s).


Vectorized interleaved Range Coding using SSE 4.1

$
0
0
In order to avoid the current (and upcoming) ANS/rANS entropy coding patent minefield, we're avoiding it and using vectorized Range Coding instead. Here's a 24-bit SSE 4.1 example using 16 interleaved streams. This example decoder gets 550-700 megabytes/sec. with 8-bit alphabets on various Intel/AMD CPU's I've tried:


More on the rANS patent situation (from early 2022):

This decoder design is practical on any CPU or GPU that supports fast hardware integer or float division. It explicitly uses 24-bit registers to sidestep issues with float divides. I've put much less work on optimizing the encoder, but the key step (the post-encode byte swizzle) is the next bottleneck to address.

Comparing Vectorized Huffman and Range Decoding vs. rANS (Or: rANS entropy coding is overrated)

$
0
0
The point here is to show that both Huffman and range decoding are all vectorizable and competitive vs. rANS in software.

What I care about is fast entropy decoding. Even scalar encoding of any of these techniques is already fast enough for most use cases. For the distribution use case, you encode once on some server(s) and then distribute to millions of users.

I've constructed fast SSE 4.1 and AVX-2 vectorized decoders for range coding and Huffman coding. The SSE 4.1 vectorized 24-bit range decoder (using just 16 streams for simplicity) is here on github. This code is easily ported to AVX-2 and Huffman in a day or two. The Steam Hardware Survey shows that 99.58% of surveyed users (as of 3/23) support SSE 4.1, so it seems pointless to focus on or benchmark scalar decoders any more. 

There's nothing particularly special or exotic about these implementations. The range coder is LZMA's binary range coder (which is also used by LZHAM), which I adapted to handle 8-bit symbols with 12-bit probabilities and 24-bit registers. It appears quite similar to a range coder released by Dmitry Subbotin, except it handles carries differently. The Huffman decoder uses the Package Merge algorithm. 

The tricky renormalization step (where each vector lane is fed with more bytes from the source streams) is basically the same between rANS, Huffman and range decoding. Range decoding needs integer division which is scary at first, but the CPU engineers have solved that problem (if all you need is 24-bits). Also, each decoder uses some sort of table to accelerate decoding. 

The vectorized versions are surprisingly similar once implemented. They do some math, do a gather from a table, write the output symbols, do some more math, then fetch some bytes from the source buffer and renormalize and distribute to each lane in parallel these source bytes. Once you've implemented one well you've more or less got the others.

Here are some benchmarks on the same machine (a Dell Ice Lake laptop, Core i7 1065G7) on Calgary Corpus book1 (768,771 bytes). All coders use 8-bit symbols. 

SSE 4.1 Huffman/Range Decoding

  • Huffman: 32 streams, 1,122 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 738 MiB/sec., 436,349 bytes (32 streams=436,266 bytes)


AVX-2 Huffman/Range Decoding

  • Huffman: 32 streams, 1,811 MiB/sec., 438,602 bytes
  • Range: 64 streams, 24-bit: 1,223 MiB/sec., 436,349 bytes

Notes: The range coder uses 12 bit probabilities. The Huffman coder was limited to a max code size of 13 bits. The same encoded stream is compatible with SSE 4.1 and AVX-2 decoders. The compressed bytes statistic doesn't include the probability or Huffman code length tables. I'm not a vector coding expert, so I'm sure in the right hands these decoders could be made even faster.


Collet's Scalar Huffman/FSE Decoding

Using his fast scalar Huffman and FSE encoders (FSE -b -h, or FSE -b):
  • Huffman: 840 MiB/sec., 439,150 bytes
  • FSE: 365 MiB/sec., 437,232 bytes
Also see FPC on github by Konstantinos Agiannis, which gets on a Core i5 ~814 MiB/sec. for scalar length-limited Huffman decoding.

Giesen's Vectorized rANS Decoding

Using exam_simd_sse41, running on the same machine using WSL and compiled with clang v10:

  • rANS: 213.9 MiB/s, 435,604 bytes
  • interleaved rANS: 324.7 MiB/s, 435,606 bytes
  • SIMD rANS: 556.5 MiB/s, 435,626 bytes

Under Windows 10 all code was built with MSVC 2019 x64.

It's easy to construct a Huffman decoding table that supplies 2-3 symbols per decode step.  (My Huffman benchmarks don't demonstrate this ability, but they could if I limited the max Huffman code size to 6 or 7 bits, which is practical for some problems.) AFAIK this ability has not been demonstrated with rANS. Until it is, Huffman isn't going anywhere. Considering rANS is patented, it's just not worth the risk and trouble for questionable decoding gains when usable alternatives exist. I'll revisit this in 20+ years (the lifetime of patents in the US).

Somewhere Range Coding went off the rails

$
0
0
I've lost track of the number of Range Coders I've seen floating around for the past ~25 years. Subbotin's coders released in the late 90's/early 2000's seem to be very influential. Each implementation has different tradeoffs that greatly impact a vectorized implementation. There is no single definitive range coder design, and I've seen 3 very different ways to handle carries. I learned a lot implementing my first vectorized range decoder.

Most of the tradeoffs boil down to how carries are handled or how many state registers are used by the decoder. You can delay the carries during encoding (Subbotin's earliest "carry-less" or delayed carry encoder that I can find, which is very clever), or fixup the already encoded bytes by propagating the carries backwards in memory during encoding (LZMA's binary range coder, which is similar to Subbotin's), or move the carry logic into the decoder (another by Subbotin called the "Russian People's Range Coder"). There are also decoder variants that track low/code/range vs. just code/range (LZMA's and Subbotin's earliest).

Importantly, range coders using this renormalization method during decoding are unnecessarily complex:

https://github.com/jkbonfield/rans_static/blob/master/arith_static.c#L222



Here's a different Subbotin variant that also uses overly complex and obfuscated renormalization logic during decoding. (Please just stop guys - this is the wrong direction.)

Why is all that messy conditional logic in the decoder renormalization step? It appears to be carry related, but the carries should have been handled during encoding in some way. It's unnecessary to create a working and correct range coder.

Instead do this, from Subbotin's first delayed carry range coder. This is easily vectorizable and he only uses 2 variables (code and range). This is quite similar to LZMA's and LZHAM's, except it doesn't need to propagate carries back in the buffer containing the output bytes.

https://gist.github.com/richgel999/d522e318bf3ad67e019eabc13137350a




Another important note: If you switch to 24-bits vs. 32-bits, you don't need to use integer division. (Note division isn't needed for a binary range coder. It's also not strictly needed for a range coder that only supports small alphabets. The division enables using a fast decoding table for large alphabets.) Instead you can use single precision division and truncate the results. This option is crucial for vectorization on modern CPU's. Modern vectorized single precision division is fairly fast now. Or said in a different way, it's not so slow that it's a show stopper. 


In vectorized implementations of both Huffman and range decoding, most of the instructions are actually not related to the actual entropy coding method. Instead they are common things like fetching from the decoder's lookup table (using some sort of manual or explicit gather), and fetching source bytes and distributing them to the lanes during renormalization. The FP division is a small part of the whole process. 

What seems to slow down vectorized range decoding is the more complex renormalization vs. Huffman, which in my implementation can fetch between [0,2] bytes. By comparison, my vectorized Huffman decoders fetch either 0 or 2 bytes per step.

All of these details are starting to matter now that rANS coding is falling under patent protection. Range Coding is a ~44 years old technique.

LZ_XOR/LZ_ADD progress

$
0
0

I'm tired of all the endless LZ clones, so I'm trying something different.

I now have two prototype LZ_XOR/ADD lossless codecs. In this design a new fundamental instruction is added to the usual LZ virtual machine, either XOR or ADD. Currently the specific instruction added is decided at the file level. (From now on I'm just going to say XOR, but I really mean XOR or ADD.)

These new instructions are like the usual LZ matches, except XOR's are followed by a list of entropy coded byte values that are XOR'd to the string bytes matched in the sliding dictionary. On certain types of content these new ops are a win (to a big win), but I'm still benchmarking it.

The tradeoff is an expensive fuzzy search problem. Also, with this design you're on your own - because there's nobody to copy ideas from. 

One prototype is byte oriented and is somewhat fast to decompress (>1 GiB/sec.), the other is like LZMA and uses a bitwise range coder. Fuzzy matching is difficult but I've made a lot of headway. It's no longer a terrifying search problem, now it's just scary.






The ratio of XOR's vs. literals or COPY ops highly depends on the source data. On plain text XOR's are weak and not worth the trouble. They're extremely strong on audio and image data, and they excel on binary or structured content. 

With the LZMA-like codec LZ_XOR instructions using mostly 0 delta bytes can become so cheap to code they can be preferred over COPY's, which is at first surprising to see. It can be cheaper to extend an LZ_XOR with some more delta bytes vs. truncating one and starting a COPY instruction. On some repetitive log files nearly all emitted instructions are long XOR's.

Overall this appears to be a net win, assuming you can optimize the parsing. GPU parsing is probably required to pull this off, which I'm steadily moving towards.
Viewing all 302 articles
Browse latest View live