Quantcast
Channel: Richard Geldreich's Blog
Viewing all 302 articles
Browse latest View live

LZHAM and crunch are now Public Domain software


How to benchmark or use the UASTC encoder in the Basis Universal library

$
0
0
UASTC is a subset of LDR ASTC 4x4, 4x4 block size, always 8bpp, and very high quality. If your engine/product/benchmark supports BC7 or LDR ASTC 4x4, trying out our UASTC encoder/transcoder (without using .basis or .KTX2 at all) is pretty simple:

Compile/link in the Basis Universal encoder and transcoder .cpp files (or put them into libs). Call basisu_encoder_init() at startup.

To encode 4x4 blocks to the 8bpp UASTC format, call encode_uastc():
https://github.com/BinomialLLC/basis_universal/blob/master/encoder/basisu_uastc_enc.h

To decode UASTC blocks to raw 32bpp pixels, call

bool unpack_uastc(const uastc_block& blk, color32* pPixels, bool srgb);

Set the "srgb" flag to always false right now, because that's what the UASTC encoder assumes it will be set to. (We're fixing this for the Feb. release.)

Or you can call transcode_uastc_to_bc7() or transcode_uastc_to_astc(), then unpack those blocks yourself (ASTC will always be equal or higher quality than BC7 because UASTC is a pure subset of LDR 4x4 ASTC):

https://github.com/BinomialLLC/basis_universal/blob/master/transcoder/basisu_transcoder_uastc.h

There's an optional RDO post processor in there too that you can call on arrays of UASTC blocks, but it's pretty basic right now. See uastc_rdo().

The advantage of UASTC is that you can transcode it at run-time to basically any texture format. There are very high quality transcoders to BC1-5, ETC1/2, BC7, etc. It even supports PVRTC1. The disadvantage is a slight drop in quality vs. best BC7/ASTC, but not much, and slower encoding. We even throw in a free RDO encoder (as a simple post processor) for UASTC.

RDO BC1-BC7 progress

$
0
0
I've been making progress on my first RDO BC7 encoder. I started working on RDO BC1-7 years ago, but I put this work on hold to open source Basis Universal. BasisU was way more important from a business perspective. (Games are fun and all, but the game business doesn't pay and web and mapping are where the eyeballs are at.)

RDO BC1-5 are done and already checked into the bc7enc_rdo repo. The test app in this repo only currently supports RDO BC1 and BC4, but I'll add in BC3/5 very soon (they are just trivial variations of BC1/4). I'm hoping the KTX2 guys will add this encoder to their repo, so I don't have to create yet another command line tool that supports mipmaps, reading/writing DDS/KTX, etc. RDO BC1-5 are implemented as post-processors, so they are compatible with any other non-RDO BC1-5 encoder. 

For my first RDO BC7 encoder, I've modified bc7enc's BC7 encoder (which purposely only supports 4 modes: 1/5/6/7) to support optional per-mode error weights, and 6-bit endpoint components with fixed 0/1 p-bits in mode 6. These two simple changes immediately reduce LZ compressed file sizes by around 5-10% with Deflate, with no perf. impact. I may support doing something like this for the other modes. I also implemented Castano's optimal endpoint rounding method, because why not.

The next step is creating a post-processor that accepts an array of encoded BC7 blocks, and modifies them for higher quality per compressed bit by increasing LZ matches. The post-processor function will support all the modes, although I'm testing primarily with bc7enc at the moment. Merging selector bits with previously encoded blocks is the simplest thing to do, which I just got working for any mode. 

I'm using the usual Langrangian multiplier method (j=D+l*R, where D=MSE, R=predicted bits, l=lambda). Here's a good but dense article on rate distortion methods and theory: Rate-distortion methods for image and video compression, by Ortego and Ramchandran (1998). I first read this years ago while working on Halo Wars 1's texture compression system, which was like crunch's .CRN mode but simpler. None of this stuff is new, and the image and video folks have been doing it for decades.

I first implemented the Langrangian multiplier method in 2017, as a postprocess on top of crunch's BC1 RDO mode which we sent to a few companies. The Langrangian multiplier method itself is easy, but estimating LZ bitrate and especially handling smooth blocks is tricky. The current smooth block method I'm using computes the maximum of the standard deviation of any component in each block, and from that scalar it computes a per-block MSE error scale. This artificially amplifies computed errors on smooth blocks, which is a hack, but it does seem to work. This hurts R-D performance but something must be done or smooth blocks turn to shit.

Some RDO BC1 and BC7 examples on kodim23, which has lots of smooth blocks:

RDO BC1: 8KB dictionary, lambda=1.0, max smooth block MSE scale=10.2, max std dev=18.0, linear metrics
38.763 RGB dB, 2.72 bits/texel (Deflate, miniz max compression)


RDO BC7 modes 1+6: 8KB dictionary, lambda=1.0, max smooth block MSE scale=19.2, max std dev=18.0, -u4, linear metrics
42.659 RGB dB, 5.05 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks



RDO BC7 modes 1+6: 8KB dictionary, lambda=2.0, max smooth block MSE scale=20.4, max std dev=18.0, -u4, linear metrics
40.876 RGB dB, 4.41 bits/texel (Deflate, miniz max compression)
Mode 1: 1920 blocks
Mode 6: 22656 blocks


To get an idea how bad things can get if you don't do anything to handle smooth blocks, here's BC7 modes 1+6: lambda=1.0, no smooth block error scaling (max MSE scale=1.0):
38.469 RGB dB, 3.49 bits/texel


I'm showing kodim23 because how you handle smooth blocks in this method is paramount. 92% of kodim23's blocks are treated as smooth blocks (because the max component standard deviation of any block is <= 18). This means that most of the MSE errors being computed and plugged into the Langrangian calculation are being artificially scaled up. There must be a better way, but at least it's simple. (By comparison, crunch's clusterization-based method didn't do anything special for smooth blocks - it just worked.)

I'm still tuning how smooth blocks are handled. Being too conservative with smooth blocks can cause very noticeable block artifacts at higher lambdas. Being too liberal with smooth blocks causes the R-D efficiency (quality per LZ bit) to go down:


Here are some R-D curves from my early BC1 RDO results. rgbcx.h's RDO BC1 is clearly beating crunch's 13 year old RDO BC1 implementation, achieving higher quality per LZ compressed bit. crunch is based off endpoint/selector clusterization+refinement with no direct awareness of what LZ is going to do with the output data, so this isn't surprising. 


(These images are way too big, but I'm too tired to resize them.)

For some historical background the crunch library (which also supports RDO BC1-5) has always computed and displayed LZMA statistics on the compressed texture's output data. (As a side note, there's no point using Unity's crunch repo for RDO BC1-5 - they didn't optimize RDO, just .CRN.) The entire goal of crunch, from the beginning, was RDO BC1-5. I remember being very excited by RDO texture encoders in 2009, because I realized how useful they would be to video game developers. It achieved this indirectly by causing many blocks, especially nearby ones, to use the same endpoint/selector bits, increasing the ratio of LZ matches vs. literals. For a fun but practical Windows demo I wrote years ago of crunch's RDO encoder (written using managed C++ of all things), check out ddsexport.

Anyhow, the next step is to further enhance RDO BC7 opaque, then dive into alpha. I'll be open sourcing the RDO BC7 postprocessor within a week. After this I'm going to write a second stronger version.

I suspect everybody will switch to RDO texture encoders at some point. Selector RDO with optional endpoint refinement is very easy to do on all the GPU texture formats, even PVRTC1.

I recently went back and updated the UASTC RDO encoder to use the same basic options (lambda and smooth block settings) as my RDO BC1-7 encoders. The original UASTC RDO encoder controlled quality vs. bitrate in a different way. These changes will be in basisu v1.13, which should be released on github hopefully by next week (once we get approval from the company we're working with).

More RDO BC7 encoder progress

$
0
0
I finally sat down and added a simple LZ4-like simulator to my in-progress soon to be open source RDO BC7 encoder. You can add blocks in, query it to find the longest/nearest match, and give it some bytes and ask it how many bits it would take to code (up to 128 for BC7, less if it finds matches). It's definitely the right path forward for RDO encoders. It looks like, for BC7 modes 1 and 6, that it's accurate vs. Deflate within around 1.5-7%. It predicts on the high side vs. Deflate, because it doesn't have a Huffman model. Mode 1's predictions tend to be more accurate, I think because this mode has encoded endpoints nicely aligned on byte boundaries.

With BC7 RDO encoding, you really need an LZ simulator of some sort. Or you need decent approximations. Once you can simulate how many bits a block compresses to, you can then have the encoder try replacing byte aligned sequences within each block (with sequences that appear in previous blocks). This is the key magic that makes this method work so well. You need to "talk" to the LZ compressor in the primary language it understands: 2+ or 3+ length byte matches.

For example, with mode 6, the selectors are 4-bits per texel, and are aligned at the end of the block. So each byte has 2 texels. If your p-bits are always [0,1] (mine are in RDO mode), then it's easy to substitute various regions of bytes from previously encoded mode 6 blocks, and see what LZ does.

This is pretty awesome because it allows the encoder to escape from being forced to always using an entire previous block's selectors, greatly reducing block artifacts.

In one experiment, around 40% of the blocks that got selector byte substitutions from previous blocks are from plugging in 3 or 4 byte matches and evaluating the Lagrangian.

40% is ridiculously high - which means this technique works well. It'll work with BC1 too. The downside (as usual) is encoding performance.

I've implemented byte replacement trials for 3-8 byte matches. All are heavily used, especially 7 and 8 byte matches. I may try other combinations, like trying two 3 byte matches with 2 literals, etc. You can also do byte replacement in two passes, by trying 3 or 4 byte sequences from 2 previously encoded blocks.

Making this go fast will be a perf. optimization challenge. I'm convinced that you need to do something like this otherwise you're always stuck replacing entire block's worth of selectors, which can be way uglier.

Example encodings (non-RDO modes 1+6 is 42.253 dB, 6.84 bits/texel):

- RDO BC7 mode 1+6, lambda .1, 8KB max search distance, match replacements taken from up to 2 previous blocks
41.765 RGB dB, 6.13 bits/texel (Deflate - miniz library max compression)



- RDO BC7 mode 1+6, lambda .25, 8KB max search distance, match replacements taken from up to 2 previous blocks
41.496 RGB dB, 5.78 bits/texel (Deflate - miniz library max compression)



- RDO BC7 mode 1+6, lambda .5, 8KB max search distance, match replacements taken from up to 2 previous blocks
40.830 RGB dB, 5.36 bits/texel (Deflate - miniz library max compression)



- RDO BC7 mode 1+6, lambda 1.0, 4KB max search distance
39.507 RGB dB, 4.97 bits/texel (Deflate - miniz library max compression)



Mode 6 byte replacement histogram (lengths of matches, in bytes):
14752 0 5000 3688 3833 3975 4632 0 0 0 0 0 0 0 0 0
8       3    4    5    6    7

- RDO BC7 mode 1+6, lambda 3.0, 2KB max search distance
36.161 dB, 4.59 bits/texel




- RDO BC7 mode 1+6, lambda 4.0, 2KB max search distance
35.035 dB, 4.47 bits/texel



- RDO BC7 mode 1+6, lambda 5.0, 4KB max search distance, match replacements taken from up to 2 previous blocks
33.760 dB, 3.96 bits/texel



- RDO BC7 mode 1+6, lambda 8.0, 4KB max search distance, match replacements taken from up to 2 previous blocks
32.072 dB, 3.47 bits/texel



-RDO BC7 mode 1+6, lambda 10.0, 4KB max search distance, match replacements taken from up to 2 previous blocks
31.318 dB, 3.32 bits/texel


- RDO BC7 mode 1+6, lambda 10.0, 8KB max search distance, match replacements taken from up to 2 previous blocks
31.279 dB, 3.21 bits/texel

- RDO BC7 mode 1+6, lambda 12.0, 8KB max search distance, match replacements taken from up to 2 previous blocks
30.675 db, 3.07 bits/texel


- RDO BC7 mode 1+6, lambda 20.0, 8KB max search distance, match replacements taken from up to 2 previous blocks
29.179 dB, 2.68 bits/texel



- Non-RDO mode 1+6 (bc7enc level 4)
42.253 dB, 6.84 bits/texel:



- Original 1024x1024 image:



Simple and fast ways to reduce BC7's output entropy (and increase LZ matches)

$
0
0

 It's relatively easy to reduce the output entropy of BC7 by around 5-10%, without slowing down encoding or even speeding it up. I'll be adding this stuff to the bc7e ispc encoder soon. I've been testing these tricks in bc7enc_rdo:

- Weight the mode errors: For example weight mode 1 and 6's errors way lower than the other modes. This shifts the encoder to use modes 1 and 6 more often, which reduces the output data's entropy. This requires the other modes to make a truly significant difference in reducing distortion before the encoder switches to using them.

- Biased p-bits: When deciding which p-bits to use (0 vs. 1), weight the error from using p-bit 1 slightly lower (or the opposite). This will cause the encoder to favor one of the p-bits more than the other, reducing the block output data's entropy.

- Partition pattern weighting: Weight the error from using the lower frequency partitions [0,15] or [0,33] slightly lower vs. the other patterns. This reduces the output entropy of the first or second byte of BC7 modes with partitions.

- Quantize mode 6's endpoints and force its p-bits to [0,1]: Mode 6 uses 7-bit endpoint components. Use 6-bits instead, with fixed [0,1] p-bits. You'll need to do this in combination with reducing mode 6's error weight, or a multi-mode encoder won't use mode 6 as much. 

- Don't use mode 4/5 component rotations, or the index flag. 

In practice these options aren't particularly useful, and just increase the output entropy. The component rotation feature can also cause odd looking color artifacts.

- Don't use mode 0,2,3, possibly 4: These modes are less useful, at least on albedo/specular/etc. maps, sRGB content, and photos/images. Almost all BC7 encoders, including ispc_texcomp's, can't even handle mode 0 correctly anyway.

Mode 4 is useful on decorrelated alpha. If your content doesn't have much of that, just always use mode 5.


The two types of RDO BC7 encoders

$
0
0
There are two main high-level categories of RDO BC7 encoders:
1. The first type is optimized for highest PSNR per LZ compressed bit, but they are significantly slower vs. ispc_texcomp/bc7e.

2. The second type is optimized for highest PSNR per LZ compressed bit per encoding time. They have the same speed, or are almost as fast as ispc_texcomp/bc7e. Some may even be faster than non-RDO encoders because they entirely ignore less useful modes (like mode 0).

To optimize for PSNR per LZ compressed bit, you can create the usual rate distortion graph (bitrate on X, quality on Y), then choose the encoder with the highest PSNR at specific bitrates (the highest/leftmost curve) that meets your encoder performance needs.

Other thoughts:
- When comparing category 2 encoders, encoding time is nearly everything.

- Category 2 encoders don't need to win against category 1 encoders. They compete against non-RDO encoders. Given two encoders, one category 2 RDO and the other non-RDO, if all other things are equal the RDO encoder will win.

- Modifying bc7e to place it into category #2 will be easy.

- Category 1 is PSNR/bitrate (where bitrate is in LZ bits/texel). Or SSIM/bitrate, but I've found SSIM to be nearly useless for texture encoding.

- Category 2 is (PSNR/bitrate)/encode_time (where encode_time is in seconds).

BC7 DDS file entropy visualization

$
0
0
RDO GPU texture encoders increase the number/density of LZ matches in the encoded output texture. He's a file entropy visualization of kodim18.dds. The left image was non-RDO encoded, the right image was encoded with lambda=4.0 max backwards scan=2048 bytes.

Non-RDO:



RDO:

Non-RDO, one byte matches removed:



RDO, one byte matches removed:



fv docs:
"The output is fv.bmp with the given size in pixels, which visually
displays where matching substrings of various lengths and offsets are
found. A pixel at x, y is (black, red, green, blue) if the last matching
substring of length (1, 2, 4, 8) at x occurred y bytes ago. x and y
are scaled so that the image dimensions match the file length.
The y axis is scaled log base 10."
Tool source:

Lagrangian multiplier based RDO encoding early outs

$
0
0

Some minor observations about Lagrangian multiplier based RDO (with BC7 RDO+Deflate or LZ4):

We're optimizing to find lowest t (sometimes called j), given many hundreds/thousands of ways of encoding a BC7 block:

float t = trial_mse * smooth_block_error_scale + trial_lz_bits * lambda;

For each trial block, we compute its MSE and estimate its LZ bits using a simple Deflate/LZ4-like model.

If we already have a potential solution for a block (the best found so far), given the trial block's MSE and the current best_t we can compute how many bits (maximum) a new trial encoding would take to be an improvement. If the number of computed threshold bits is ridiculous (like negative, or just impossible to achieve with Deflate on a 128-bit block input), we can immediately throw out that trial block:

threshold_trial_lz_bits = (best_t - trial_mse * smooth_block_error_scale ) / lambda

Same for MSE: if we already have a solution, we can compute the MSE threshold where it's impossible for a trial to be an improvement:

threshold_trial_mse  = (best_t - (trial_lz_bits * lambda)) /  smooth_block_error_scale

This seems less valuable because running the LZ simulator to compute trial_lz_bits is likely more expensive than computing a trial block's MSE. We could plug in a lowest possible estimate for trial_lz_bits, and use that as a threshold MSE. Another interesting thing about this: trials are very likely to always have an MSE >= than the best found encoding for a block.

Using simple formulas like this results in large perf. improvements (~2x).


BC7 RDO rate distortion curves

$
0
0

I've been tuning the fixed Deflate model in bc7enc_rdo. In this test I varied the # of literal bits from 8 to 14. Higher values push the system to prefer matches vs. literals.

The orange line was yesterday's encoder, all other lines are for today's encoder. Today's encoder has several improvements, such as lazy parsing and mode 6 endpoint match trails. 


(I know this graph is going to be difficult to read on blogger - Google updated it and now images suck. You used to be able to click on images and get a full-res view.)

More RDO BC7 progress

$
0
0

I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.

RDO BC7+Deflate (4KB replacement window size)

33.551 RGB dB PSNR, 3.75 bits/texel


One could argue that at these low PSNR's you should just use BC1, but about 10% of the blocks in this RDO BC7 encoding use mode 1 (2 subsets). BC1 will be more blocky even at a similar PSNR.

31.319 dB, 3.25 bits/texel:



Low bitrate RDO BC7 with lzham_devel

$
0
0
RDO BC7+Deflate could also be described as "BC7 encoding with Deflate in-loop".

Using the lzham_codec_devel repo (which is now perfectly stable, I just haven't updated the readme kinda on purpose), this mode 1+6 RDO BC7 .DDS file compressed to 2.87 bits/texel. LZMA gets 2.74 bits/texel. 

Around 10% of the blocks use mode 1, the rest mode 6. I need to add a LZMA/LZHAM model to bc7enc_rdo, which should be fairly easy (add len2 matches, add rep model, larger dictionary - and then let the optimal parsers in lzham/lzma figure it out).

Commands:

bc7enc -zc32768 -u4 -o xmen_1024.png -z6.0

lzhamtest_x64.exe -x16 -h4 -e -o c xmen_1024.dds 1.lzham

There are some issues with this encoding, but it's great progress.



More RDO BC7 encoding - new algorithm

$
0
0
I sat down and implemented another RDO BC7 algorithm, using what I learned from the previous one. Amazingly it's beating the way more complex one, except perhaps at really high quality levels (really low lambdas). Very surprising! The source is here, and the post-processing function (the entropy reduction transform in function bc7enc_reduce_entropy()) is here

The latest bc7enc_rdo repo is here.

I expected it to perform worse, yet it's blowing the more complex one away. The new algorithm is compatible with all the BC7 modes, too. The previous one was mostly hardwired for the main modes (mostly 1/6).


The new algorithm is much stronger:

RDO BC7 new algorithm - lambda 1.0, 4KB window size 
bc7enc -o -u4 -zc4096 J:\dev\test_images\xmen_1024.png -e -E -z1.0

37.15 dB, 3.97 bits/texel (Deflate)



RDO BC7 new algorithm - lambda 3.0, 4KB window size 
bc7enc -o -u4 -zc4096 J:\dev\test_images\xmen_1024.png -e -E -z3.0

32.071 dB, 3.12 bits/texel (Deflate)



The new algorithm degrades way more gracefully:

lambda=4.0
30.812 dB, 2.94 bits/texel


lambda=5.0
29.883 2.69 bits/texel (Deflate)


lambda=5.0, window size 8KB
29.826 dB, 2.59 bits/texel (Deflate)




bc7enc_rdo repo updated

RDO texture encoding notes

$
0
0
A few things I've learned about RDO texture encoders:

- If you've spent a lot of time working on lowest distortion based texture encoders, your instincts will probably lead you astray once you start working on rate distortion encoders. Distortion can paradoxically increase on a single test even when the rate distortion behavior has improved overall.

- Always plot your results in 2D (rate vs. distortion) - don't focus so much on distortion. 

As a quick check of compressor efficiency, compute and display PSNR/bits_per_texel * scale, or SSIM/bits_per_texel * scale (where scale is like 10,000 or something - it's just for readability). 

Compute accurate bits_per_texel by actually compressing your output using a real LZ compressor with correct settings. The higher this value, the more efficient the compressor. Use the actual LZ compressor you're shipping the data with.

- Make sure your PSNR, RMSE, MSE, SSIM, etc. calculations are correct and accurate. ALWAYS compare against an independent 3rd party implementation that is known to be correct/trusted. Write your input and output to .PNG/.TGA/.BMP or whatever and use an external 3rd party image comparison tool as a sanity check.

Otherwise you've possibly messed it up and are in the weeds. 

One option is ImageMagick.
Here's how to calculate PSNR, and here's some sample code.

- RDO texture encoding+Deflate is basically all about increasing matches above all else. Even adding a single match to a block can be a huge win in a rate distortion sense.

- It's not necessary to worry about how blocks are packed, which modes are supported, or byte alignment. Just focus on byte matches and literals/match estimates. 

- Avoid copying around bits. That increases the overall block entropy. Always copy full bytes. 

- For more gains you can copy bytes from one offset in a block to another offset. This is way slower to encode but does compress better. I removed this option from bc7enc_rdo because it was so much slower.

- You don't need a huge window to get large gains. Even 64-512 byte windows are fine. 

- You don't need an accurate LZ simulator to make a workable high quality encoder. 

Although, I needed one to figure all this out. 

- Use an already working RDO encoder as a baseline (even a shitty one). Plot its average R-D curve across a range of settings/images. Go from there.

- By default, a high quality texture encoding will consist of mostly literals. Just focus on inserting a single match into each block from one of the previously encoded blocks. Use the Langrangian multiplier method (j=MSE*smooth_block_scale+bits*lambda) to pick the best one.

- Use Matt Mahoney's "fv" tool to visualize the entropy of your encoded output data:
http://www.mattmahoney.net/dc/fv.cpp

- You can copy a full block (which is like VQ) or partial byte sequences from one block to another. It's possible that a match can partially cross endpoints and selectors. Just decode the block, calculate MSE, estimate bits and then the Langrangian formula.

- Plot rate distortion curves (PSNR or SSIM vs. bits/texel) for various lambdas and encoder settings. Focus on increasing the PSNR per bit (move the curve up and left).

- You must do something about smooth/flat blocks. Their MSE's are too low relative to the visual impact they have when they get distorted. One solution is to compute the max std dev. of any component and use a linear function of that to scale block/trial MSE.

- Before developing anything more complex than the technique used in bc7enc_rdo (the byte-wise ERT), get this technique working and tuned first. You'll be surprised how challenging it can be to actually improve it.

- Nobody will trust or listen to you when you claim your encoder is better in some way, even if you show them graphs. There are just too many ways to either mess up or bias a benchmark. You need a trusted 3rd party to independently benchmark and validate your encoder vs. other encoders.

The people at Unity have been filling this role recently. (Which makes sense because they integrate a lot of texture encoders into Unity.)

Graphing length of introduced matches in the BC7 ERT

$
0
0
I'm starting to graph what's going on with this awesome little lossy BC7 block data transform (in bc7enc_rdo). Lets look at some match length histograms:


The window size was only 128 bytes (8 BC7 blocks). 3 byte matches is the minimum Deflate match length. 16 byte matches replicate entire BC7 blocks. Not sure why there's a noticeable peak at 10 bytes.

Entire block replacements are super valuable at these lambdas. The ERT in bc7enc_rdo weights matches of any sort way more than literals. If some nearby previous block is good enough it makes perfect sense to use it.

One thing I think would be easily added to the transform: If there's a match at the end of the previous block, try to continue/extend it by weighting the bytes following the copied bytes in the window a little more cheaply (to coax the transform towards extending the match).


bc7enc_rdo encoding examples

$
0
0

Compress kodim.png to kodim03.dds (with no mips) to two BC7 modes (1+6):

Highest Quality Mode (uses Modes 1+6)

This mode is like ispc_texcomp or bc7e's BC7 compressor. bc7enc_rdo currently only uses modes 1/6 on opaque blocks, and modes 5/6/7 on alpha blocks. 

bc7enc.exe -o -u4 kodim08.png

...
BC7 mode histogram:
0: 0
1: 8703
2: 0
3: 0
4: 0
5: 0
6: 15873
7: 0

Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 378097, 7.69 bits/texel
Processing time: 0.113000 secs
...
Output data size: 393216, LZ (Deflate) compressed file size: 378097, 7.69 bits/texel
Wrote DDS file kodim08.dds
Luma Max error: 13 RMSE: 1.279412 PSNR 45.991 dB, PSNR per bits/texel: 59787.033452
RGB Max error: 37 RMSE: 2.065000 PSNR 41.832 dB, PSNR per bits/texel: 54381.448887
RGBA Max error: 37 RMSE: 1.805041 PSNR 43.001 dB, PSNR per bits/texel: 55900.685976

So the output .DDS file compressed to 7.69 bits/texel using miniz (stock non-optimal parsing Deflate, so a few percent worse vs. zopfli or 7za's Deflate). The RGB PSNR was 41.8 and the RGBA PSNR was 43 dB. It used mode 1 around half as much as mode 6.

Notice the pre-RDO compressed size is equal to the output's compressed size (7.69 bits/texel). There was no RDO, or anything in particular done to reduce the encoded output data's entropy. The output is mostly Huffman compressed because Deflate can't find many 3+ byte matches, so the output is quite close to 8 bits/texel. It's basically noise to Deflate or most other LZ's.

Reduced Entropy Mode (-e option)


This mode is as fast as before. It only causes the encoder to weight modes, p-bits, etc. differently so the output data is naturally more compressible by entropy/LZ coders:

bc7enc -o -u4 -zc2048 kodim08.png -e

BC7 mode histogram:
0: 0
1: 3385
2: 0
3: 0
4: 0
5: 0
6: 21191
7: 0
Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 352693, 7.18 bits/texel
Processing time: 0.116000 secs
Output data size: 393216, LZ (Deflate) compressed file size: 352693, 7.18 bits/texel
Wrote DDS file kodim08.dds
Luma  Max error:  18 RMSE: 1.368621 PSNR 45.405 dB, PSNR per bits/texel: 63277.507753
RGB   Max error:  48 RMSE: 2.456375 PSNR 40.325 dB, PSNR per bits/texel: 56197.596592
RGBA  Max error:  48 RMSE: 2.152539 PSNR 41.472 dB, PSNR per bits/texel: 57795.900335

The RGB error increased by 1.5 dB (from 41.8 dB to 40.3 dB - so less signal and more distortion), however the compressibility went up. The output is now 7.18 bits/texel instead of the previous 7.69! Notice also that the "PSNR per bits/texel" value (the compressibility index I use to monitor the encoder's effectiveness) for RGB is now 56197 vs. the previous 54381.


Rate Distortion Optimization with the Entropy Reduction Transform (-e -z#)


Now let's enable all the tools the encoder has to reduce the encoded output data's entropy. This mode is slower, but it trivially threadable and you can scale down the amount of total compute by reducing the window size using "-zc#":

bc7enc -o -u4 -zc2048 kodim08.png -e -z.5

BC7 mode histogram:
0: 0
1: 4028
2: 0
3: 0
4: 0
5: 0
6: 20548
7: 0
Pre-RDO output data size: 393216, LZ (Deflate) compressed file size: 354192, 7.21 bits/texel
rdo_total_threads: 40
Using an automatically computed smooth block error scale of 19.375000
lambda: 0.500000
Lookback window size: 2048
Max allowed RMS increase ratio: 10.000000
Max smooth block std dev: 18.000000
Smooth block max MSE scale: 19.375000
Total modified blocks: 21589 87.85%
Total RDO postprocess time: 2.765000 secs
Processing time: 2.846000 secs
Output data size: 393216, LZ (Deflate) compressed file size: 316364, 6.44 bits/texel
Wrote DDS file kodim08.dds
Luma  Max error:  41 RMSE: 2.749131 PSNR 39.347 dB, PSNR per bits/texel: 61131.435585
RGB   Max error:  48 RMSE: 3.286210 PSNR 37.797 dB, PSNR per bits/texel: 58723.280910
RGBA  Max error:  48 RMSE: 2.861897 PSNR 38.998 dB, PSNR per bits/texel: 60588.948928

First, I set the window size the compressor uses to insert byte sequences from previously encoded blocks into each output block to 2KB to increase compression, using "-zc2048". The default is only 256 bytes, which is way faster (.42 seconds vs. 2.92 on my system).

Notice the RGB PSNR has dropped to 37.8 dB, however the compressed file is now only 6.44 bits/texel. The compressibility index (PSNR per bits/texel) is 58723. This is significantly higher than the previous two encodes, so the encoder has been able to squeeze more signal into the output bits (once they are LZ compressed).

The -z option directly sets lambda, which controls the rate distortion tradeoff. The higher this value, the more likely the encoder is to substitute a block with a previous block's bytes (either entirely or partially), which increases distortion but reduces entropy.

RDO compression using MSE as the internal error metric is difficult on smooth or flat regions. The RDO encoder tries to automatically scale up the computed MSE's of smooth blocks (using a simple linear function of each block's color channel maximum standard deviation), but the settings are conservative. You'll notice a message like this printed when you use -z:

Using an automatically computed smooth block error scale of 19.375000

By default the command line tool tries to compute a max smooth block factor based off the supplied lambda setting. There is no single calculation/set of settings that work perfectly on all input textures, but the formula in the code works OK for most textures at low-ish lambdas. (For an example of a difficult texture the currently formulas/settings doesn't handle so well, try encoding kodim03 at lambdas 1-3.) I tried to tune smooth block handling so lambdas at or near 1 it looks OK on textures with smooth gradients, skies, etc. 

You can use the -zb# option to manually set a max smooth block scale factor to a higher value. -zb30-100 works well. You'll need to experiment. -zb1.0 disables all smooth block handling, so only MSE is plugged into the lambda calculation.

Entropy Reduction Transform on BC1 texture data

$
0
0

Just got it working for BC1. Took about 15 minutes of copying & pasting the BC7 ERT, then modifying it to decode BC1 instead of BC7 blocks and have it ignore the decoded alpha. The ERT function is like 250 lines of code, and for BC1 it would be easily vectorizable (way easier than BC7 because decoding BC1 is easy).

This implementation differs from the BC7 ERT in one simple way: The bytes copied from previously encoded blocks are allowed to be moved around within the current block. This is slower to encode, but gives the encoder more freedom. I'm going to ship both options (move vs. nomove).

Here's a 2.02 bits/texel (Deflate) encode (lambda=1.0), 34.426 RGB dB. Normal BC1 (rgbcx.cpp level 18) is 3.00 bits/texel 35.742 dB. Normal BC1 level 2 (should be very close to stb_dxt) gets 3.01 bits/texel and 35.086 dB, so if you're willing to lose a little quality you can get large savings.

I'll have this checked in tomorrow after more benchmarking and smooth block tuning.

I've been thinking about a simple/elegant universal rate distortion optimizing transform for GPU texture data for the past year, since working on UASTC and BC1 RDO. It's nice to see this working so well on two different GPU texture formats. ETC1-2, PVRTC1, LDR/HDR ASTC, and BC6H are coming.



1.77 bits/texel, 32.891 dB (-L18 -b -z4.0 -zb17.0 -zc2048):



1.59 bits/texel, 30.652 dB (-zc2048 -L18 -b -z8.0 -zb30.0):



bc7enc_rdo now supports RDO for all BC1-7 texture formats

$
0
0

It now fully supports RDO BC1-7:

 https://github.com/richgel999/bc7enc_rdo

I've also been cleaning up the tool and tuning all of the defaults. Note that if you build with MSVC you get OpenMP, which results in significantly faster compression. Currently the Linux/OSX builds don't get OpenMP.

I decided to unify all RDO BC1-7 encoders so they use a single universal entropy reduction transform function in ert.cpp/.h. I have specialized RDO encoders for arrays BC1 and BC4 blocks (which I checked into the repo previously), which may perform better, but it was a lot more code to maintain. I removed them.

Weighted/biased BC7 encoding for reduced output data entropy (with no slowdowns)

$
0
0

Previous BC7 encoders optimize for maximum quality and entirely ignore (more like externalize) the encoded data they output. Their encoded output is usually uncompressible noise to LZ coders. 

It's easy to modify existing encoders to favor specific BC7 modes, p-bits, or partition patterns. You can also set some modes to always use specific p-bits, or disable the index flag/component rotation features, and/or quantize mode 6's endpoints more coarsely during encoding. 

These changes result in less entropy in the output data, which indirectly increases LZ matches and boosts the effectiveness of entropy coding. More details here. You can't expect much from this method (I've seen 5-10% reductions in compressed output using Deflate), but it's basically "free" meaning it doesn't slow down encoding at all. It may even speed it up. 

Quick test using the bc7enc_rdo tool:

Mode 1+6: 45.295 dB, 7.41 bits/texel (Deflate), .109 secs

Command: "bc7enc kodim23.png"

BC7 mode histogram:
1: 8736
6: 15840



Mode 1+6 reduced entropy mode: 43.479 RGB PSNR, 6.77 bits/texel (Deflate), .107 secs

Command: "bc7enc kodim23.png -e"

BC7 mode histogram:
1: 1970
6: 22606




Difference image (biased by 128,128,128) and grayscale histogram:




Dirac video codec authors on Rate-Distortion Optimization

$
0
0
"This description makes RDO sound like a science: in fact it isn't and the reader will be pleased to learn that there is plenty of scope for engineering ad-hoc-ery of all kinds. This is because there are some practical problems in applying the procedure:"

http://dirac.sourceforge.net/documentation/algorithm/algorithm/rdo.htm

"Perceptual fudge factors are therefore necessary in RDO in all types of coders.""There may be no common measure of distortion. For example: quantising a high-frequency subband is less visually objectionable than quantising a low-frequency subband, in general. So there is no direct comparison with the significance of the distortion produced in one subband with that produced in another. This can be overcome by perceptual weighting.."

In other words: yea it's a bit of a hack.

This is what I've found with RDO texture encoding. If you use the commonly talked about formula (j=D+l*R, optimize for min j) it's totally unusable (you'll get horrible distortion on flat/smooth blocks, which is like 80% of the blocks on some images/textures). Stock MSE doesn't work. You need something else. 

Adding a linear scale to MSE kinda works (that's what bc7enc_rdo does) but you need ridiculous scales for some textures, which ruins R-D performance.

So if you take two RDO texture encoders, benchmark them, and look at just their PSNR's, you are possibly fooling yourself and others. One encoder with higher PSNR (and better R-D performance) may visually look worse than the other. It's part art, not all science.

With bc7enc_rdo, I wanted to open source *something* usable for most textures with out of the box settings, even though I knew that its smooth block handling needed work. Textures with skies like kodim03 are challenging to compress without manually increasing the smooth block factor. kodim23 is less challenging because its background has some noise.

Releasing something open source with decent performance that works OK on most textures is more important than perfection. 

Viewing all 302 articles
Browse latest View live