Quantcast
Channel: Richard Geldreich's Blog
Viewing all articles
Browse latest Browse all 302

BC1/BC7/ASTC encoding notes

$
0
0
There are probably only a dozen or so developers interested in this level of detail about high quality texture encoding, but here you go. I've been spending a lot of time exploring fast BC1 encoding again, which triggered this blog post. Here's where I'm currently at:


ispc_texcomp and libsquish both use SIMD instructions, while the others are scalar.

This is BC1's "Pareto Frontier", which is a key concept used in lossless compression benchmarking. (There are 2 BC1 codecs missing: the ones in NVidia Texture Tools and AMD Compressonator. I don't think either would change this graph in a fundamental way, but I'll add them.) This applies to BC7/UASTC/ASTC (RGB/RGBA Direct CEM's) too, because the same core algorithms are used on each subset. Any basic improvements made here will benefit all similar endpoint-centric texture formats. So this frontier matters to us a great deal. (Note that GPU encoders may be much faster in an absolute sense, but they all boil down to using the same basic algorithms which is what we're really interested in here.)

For BC1 encoding (and this applies to BC7/ASTC and GPU encoders too), here are the key things you should do to best balance quality vs. perf that I've learned:

1. Do PCA, find 2 colors in block furthest apart along this axis. Use these colors as initial endpoints.

Note it's slightly better (and faster/simpler) to use 2 colors from the block as initial endpoints - not the versions projected along the axis. (The "stb_dxt" approach.)

Interestingly, the PCA step can be approximated. This all-integer approximation is surprisingly effective, except on crazy outlier images like frymire (where it still performs admirably well, especially with 2 least squares passes). You can also specialize the common grayscale case (see the same "alternate" encoder function).

2. Compute initial selectors using these endpoints.

There are numerous approaches, but the one I like best computes a trial selector index using a scaled dot product that results in a selector index from [0,N-1], clamps this to [1,N-1], then computes the errors of using colors[trial_s-1] and colors[trial_s] and chooses the best. This avoids having to check every block color (a big win for 16/32 color BC7/UASTC blocks).

This is like stb_dxt's method, except we compute the actual colorspace error to two trial colors nearest the projection.

3. Least Squares (LS) using these selectors. If LS fails, try the block's average color using optimal solid-color tables. Find optimal selectors.

There are many LS methods, see this code for two different approaches. One uses an incremental PCA approach that works well in 4D, the other computes covariance then uses 3-8 power iterations.

4. Optional: Try LS one more time (that's what STB_DXT_HIGHQUAL does). Good win.

Importantly, try LS a second time even if the first time failed and you chose endpoints from your optimal solid color tables. This is a small win.

5. For higher quality, carefully vary the selectors and try least squares:
- Try encoding the block's average color using optimal single-color tables. (This is a surprising win for formats with low bit endpoint components.)
- Try incrementing all the minimum selectors+LS.
- Try decrementing all the maximum selectors+LS.
- Try both incrementing the min and decrementing the max selectors+LS.

These selector manipulations are big wins. Others are possible. One usesa precomputed table driven approachof best unique total orderings to try given the current total ordering. (I've been tweeting this over the past couple days. It's a big win.) This exploits the property of some BC1 selector total orderings being used much more often than others:

More info here.

The other tries scaling the selectors to better exploit endpoint interpolation (tiny/marginal win).

Notes:
Least squares gives you floating point endpoints, which must be quantized to 5 or 6 bit components for BC1 (and similar for BC7/ASTC). To do this correctly, use Castano's optimal rounding method: https://gist.github.com/castano/c92c7626f288f9e99e158520b14a61cf

The optimal rounding method applies to all the formats.

Also, you need to carefully tie break between selectors that result in the same encoding error: https://twitter.com/richgel999/status/1243894923000254466

Why? Because how you break ties subtly interacts with the following least squares pass. (I called this "improved rounding" for some reason in my post-quarantined state.)

It's possible for both endpoints to quantize into a single colorspace voxel, and your encoder unnecessarily loses "freedom". We deal with this in our UASTC/BC7 encoders by manually "pulling" the endpoints apart. It's a tricky problem that needs more attention.

Note if you're doing BC7 you must implement p-bits correctly or you're totally wasting the format's potential:
https://richg42.blogspot.com/2018/04/proper-pbit-computation-in-bc7-texture.html

Most of the above applies to the basic "RGB/RGBA Direct" modes in UASTC/ASTC too.

stb_dxt's BC1 encoder is, as far as I can tell, Pareto optimal for scalar BC1 encoding (once you add Castano's optimal endpoint rounding, fix its selector tie breaking, and the precision of the axis vector used in its selector determination step). If you can improve the quality of scalar stb_dxt BC1 without slowing it down, it's likely to be an important change that will benefit all the endpoint-centric texture formats.

For reference, here's Simon Brown's "DXT Compression Techniques" blog post and link to libsquish:
http://sjbrown.co.uk/2006/01/19/dxt-compression-techniques/
https://github.com/svn2github/libsquish

I've examined all of the available GPU/CPU encoders I can find for endpoint-centric formats, and the above algorithms are the most competitive I know about (highest quality per unit of CPU time). Generally, for every .25-.5 dB you can push a SIMD encoder "up", the faster it can be made to go for the same average quality (as quality and perf. are interrelated).

More on how BC1 is approximated by actual GPU's (BC3-5 are too):
https://twitter.com/richgel999/status/1244638912401809409
https://twitter.com/richgel999/status/1244657623695339520
http://www.ludicon.com/castano/blog/2009/03/gpu-dxt-decompression/



Viewing all articles
Browse latest Browse all 302

Trending Articles