Quantcast
Channel: Richard Geldreich's Blog
Viewing all 302 articles
Browse latest View live

Basis universal GPU texture format examples

$
0
0
The .basis format is a lossy texture compression format roughly comparable to JPEG in size but designed specifically for GPU texture data. The format's main feature is that it can be efficiently transcoded to any other GPU texture format. We've written transcoders for BC1-5, ETC1, PVRTC, and BC7 so far. ASTC, ETC2 are coming. Transcoding complexity is similar to crunch's (my older open source texture compression tech for BC1-5). This is the first system to support a universal GPU texture format that is usable on the Web.

Excluding PVRTC, there are no complex pixel-level operations needed during transcoding. The transcoder's inner loop works at the block level (or 2x2 macroblock level) and involves simple operations (Huffman decoding, table lookups, endpoint/selector translation using small precomputed lookup tables). The PVRTC transcoder requires two passes and a temporary buffer to hold block endpoints. The PVRTC transcoder in this system is faster and simpler than any real-time PVRTC encoder I'm aware of.

I resized the kodim images to 512x512 (using a gamma correct windowed sinc filter) so they can be transcoded to PVRTC which only supports power of 2 texture dimensions. Resizing these textures to 512 actually makes these images more difficult to compress because the details are more dense spatially and artifacts stand out more.

The current single-threaded transcode times (for kodim01) on my 3.3GHz Xeon were:

ETC1 transcode time: 1.199494 ms
DXT1 transcode time: 2.198336 ms
BC7 transcode time: 2.801654 ms
DXT5A transcode time: 2.361919 ms
PVRTC1_4 transcode time: 2.756762 ms

These timings will get better as I optimize the transcoders. crunch's transcoding speed is roughly similar to .basis ETC1. The transcoder is usable on Web by cross compiling it to Javascript or WebAssembly.

For transparency support we internally store two texture slices, one for RGB and another for alpha. For ETC1 with alpha the user can either transcode to a texture twice as wide or high or use two separate textures. We support BC3-5 directly. For BC7, we currently only support opaque mode 6, but mode 4 or 5 is coming. For PVRTC we only support opaque 4bpp, but we know how to add alpha. PVRTC 2bpp opaque/alpha is also doable. ETC2 alpha block support will be easy.

These images are at quality level 255, around 1.5-2 bits/texel. The biggest quality constraint right now is the ETC1S format that these "baseline" .basis files use internally. Our plan is to add some optional extra texture data to the files to upgrade the quality for BC7/ASTC.

Some notable properties of this system:
  • This system is intended for Web use. It's a delicate balance between transcode times, quality, GPU format support, and encoder complexity. The transcoder step must be much faster than just using JPEG (or WebP, etc.) followed by a real-time GPU texture encoder, or it's not valuable. This system is basically existence proof that it's possible to build a universal GPU texture compression system. Long term, much higher quality solutions are possible.
  • This format trade offs quality to gain access to all GPU's and API's without having to store/distribute multiple files or encode multiple times. Quality isn't that great but it's very usable on photos, satellite photography, and rasterized map images. For some games it may not be acceptable, which is fine. The largest users of crunch aren't games at all.
  • The internal baseline format uses a subset of ETC1 (ETC1S), so transcoding to ETC1 is fastest. ETC1 is highest quality, followed by BC7, BC1, then PVRTC. The difference between ETC1 and BC1 is .1-.4 dB Y PSNR (slightly better for BC7). 
  • This system isn't full ETC1 quality because it disables 2x4/4x2 subblocks. We loose a few dB vs. optimal ETC1 due to this limitation, but we gain the ability to easily transcode to any other 4x4 block-based format. In a rate distortion sense using PSNR or SSIM full ETC1 support rarely makes sense in our testing anyway (i.e. there are almost always better uses of those bits vs. supporting flips or 4:4:4 individual mode colors).
  • Using ETC1S as an internal format allowed us to reuse our existing ETC1 encoder. It also allows others to easily build their own universal format encoders by re-purposing their existing ETC1 solutions. There's a large amount of value in a system that will work on any GPU or API, and we can improve the quality over time by extending it.
  • The PVRTC endpoint interpolation actually smooths out the ETC1 artifacts in a "nice" looking way. The PVRTC artifacts are definitely worse than any other format. The .basis->PVRTC transcoder favors speed and reliable behavior, not PSNR/SSIM. There are noticeable banding and low-pass artifacts. (Honestly, PVRTC is an unforgiving format and I'm surprised it looks as good as it does!) It should be possible to add dithering or smarter endpoint selection, but that would substantially slow transcoding down.
  • We have a customer using .basis ETC1 with normal maps on mobile, so I know they are usable. I doubt PVRTC would work well with normal maps, but the other formats should be usable.
  • Grayscale conversion is easy: we just convert the G channel. For DXT5/BC3 or BC7 we call the transcoder twice (once for RGB then again for alpha).
  • Newer iOS devices support ETC1 in hardware, but WebGL on iOS doesn't expose this format on these devices so we must support PVRTC. We weren't originally going to support PVRTC, but we had to due to this issue.
  • The PVRTC transcoder and decoder assumes wrap addressing is enabled, for compatibility with PVRTexTool. This can be disabled (and you can use clamp addressing when fetching from the shader). This can sometimes cause issues on the very edges of the texture (see the bottom of the xmen image, or the bottom of the hramp at the end).
  • Look at these images on your phone or tablet before making any quality judgments. On a iPhone even really low .basis quality levels can look surprisingly good. Vector quantization+GPU block compression artifacts are just different from JPEG's artifacts. 
Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:


BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
 BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:

Original:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
Original:
BC7:
 DXT1:
 DXT5A:
 ETC1:
 PVRTC:
Original:
BC7:

 DXT1:
 DXT5A:
 ETC1:
 PVRTC:


xmen_1024 encoded to .basis at various bitrates

$
0
0
Here's xmen_1024 compressed at various bitrates to .basis. I show the output of two transcodes: ETC1 (which is the highest quality format in baseline .basis) and PVRTC (with clamp addressing).

Original:

Optimal ETC1 is 38.896 Y PSNR

Q 16: .717 bits/texel, ETC1 28.473 Y PSNR

ETC1:
PVRTC:

Q 64: .905 bits/texel, ETC1: 30.361 Y PSNR

ETC1:

PVRTC:

Q 128: 1.064 bits/texel, ETC1: 32.026 Y PSNR

ETC1:

PVRTC:

Q 192: 1.208 bits/texel, ETC1: 30.379 Y PSNR

ETC1:

 PVRTC:

Q 255: 1.362 bits/texel, ETC1: 34.630 Y PSNR

ETC1:
 PVRTC:

Better PVRTC encoding

$
0
0
I'm taking a quick break from RDO BC7. I've been working on it for too long and I need to mix things up.

I've been experimenting with high-quality PVRTC encoding for several years, off and on. I've finally found an algorithm that is simple and fast, that in most cases beats PVRTexTool's approach. (PVRTexTool is the "standard" high-quality production encoder for PVRTC. To my knowledge it's the best available.) In the cases I can find where PVRTexTool does better, the quality delta is low (<.5 dB).

I know PVRTC is doomed long term (ASTC is far better), but it's still pervasive on iOS devices.

Useful references:
http://roartindon.blogspot.com/2014/08/pvr-texture-compression-exploration.html
http://jcgt.org/published/0003/04/07/paper-lowres.pdf

It's a three phase algorithm:
1. Compute endpoints using van Waveren's approximation: For each 4x4 block compute the RGB(A) bounds of that block. Set the low endpoint to the floor() of the bounds (with correct rounding to 554), and set the high endpoint to the ceil() of the bounds (again with correct rounding to 555).

An alternative is to use Intensity Dilation (see the link to the paper), which may lead to better results. But this is far simpler and it's what Lim successfully uses in his real-time encoder.

2. Now go and select the optimal modulation values for each pixel using these endpoints (factoring in the PVRTC endpoint interpolation, of course).

The results at this point are usually a little better than PVRTexTool in "Lower" quality, at least visually. The results so far should be equivalent or slightly better than Lim's encoder (depending on how much you approximate the modulation value search).

Interestingly, the results up to this point are acceptable for some use cases already. The output is too banded and high contrast areas will be smeared out, but the distortion introduced up to this point is predictable and stable.

3. For each block in raster order: Now use 1x1 block least squares optimization (using normal equations) separately on each component to solve for the best low/high endpoints to use for each block. A single block impacts 7x7 pixels (or 3x3 blocks) in PVRTC 4bpp mode.

The surrounding endpoints, modulation values, and output pixels are constants, and the only unknowns are the endpoints, so this is fairly straightforward. This is just like how it's done in BC1 and BC7 encoders, except we're dealing with larger matrices (7x7 instead of 4x4) and we need to carefully factor in the endpoint interpolation.

For solving, the equation is Ax=b, where A is a 49x2 matrix (7x7 pixels=49), x is a 2x1 matrix (the low and high endpoint values we're solving for), and b is 49x1 matrix containing the desired output values (which are the desired RGB output pixel values minus the interpolated and weighted contribution from the surrounding constant endpoints). The A matrix contains the per-pixel modulation weights multiplied by the amount the endpoint influences the result (factoring in endpoint interpolation).

After you've done 1x1 least squares on each component, the results are rounded to 554/555. Then you find the optimal modulation values for the effected 7x7 block of pixels, and only accept the results if the overall error has been reduced.

You can "twiddle" the modulation values in various ways before doing the least squares calculations, just like BC1/BC7 encoders do. I've tried incrementing the lowest modulation value and/or decrementing the higher modulation value, and seeing if the results are any better. This works well.

Step 3 can be repeated multiple times to improve quality more. 3-5 refinement iterations seems to be enough. You can vary the block processing order for slightly higher quality.

There are definitely many other improvements, but this is the basic idea. Each step is simple, and all steps are vectorizable and threadable.

PVRTexTool uses 2x2 SVD, as far as I know, but this seems unnecessary, and seems to lead to noticeable stipple-like artifacts being introduced in many cases. (Check out the car door below.) Also, PVRTexTool's handling of gradients seems questionable (perhaps endpoint rounding issues?).

Quick example encodings:

Original:


PVRTexTool 4.19.0, "Very High Quality":
RGB Average Error: PSNR: 36.245, SSIM: 0.979442
Luma Error: PSNR: 36.841, SSIM: 0.984710


New encoder (using perceptual colorspace metrics, so it's trying to optimize for lower luma error):
RGB Average Error: PSNR: 36.728, SSIM: 0.976032
Luma Error: PSNR: 37.827, SSIM: 0.990251


Original:


PVRTexTool Very High:
RGB Average Error: PSNR: 41.809, SSIM: 0.993144
Luma Error: PSNR: 41.943, SSIM: 0.993875


New encoder (perceptual mode):
RGB Average Error: PSNR: 41.730, SSIM: 0.991800
Luma Error: PSNR: 43.419, SSIM: 0.997416


Original:

PVRTexTool high quality:
RGB Average Error: PSNR: 27.640, SSIM: 0.954125
Luma Error: PSNR: 30.292, SSIM: 0.964433


New encoder (RGB metrics):
RGB Average Error: PSNR: 29.523, SSIM: 0.957067
Luma Error: PSNR: 32.702, SSIM: 0.974145


How to improve crunch's codebook generators

$
0
0
While writing Basis ETC I sat down and started to study the codebook generation process I used on crunch. crunch would create candidate representational vectors (for endpoints or selectors), clusterize these candidates (using top-down clusterization), assign blocks to the closest codebook entry, and then go and compute the best DXT1 endpoint or selectors to use for each cluster. That's basically it. Figuring out how to do this well on DXT1 took a lot of experimentation, so I didn't have the energy to go and improve it.

Here are the visualizations:



After studying the clusterizations visualized as massive PNG files I saw a lot of nonsensical things. The algorithm worked, but sometimes clusters would be surprisingly large (in 6D for endpoints or 16D space for selectors), leading to unrelated blocks being lumped into the same cluster.

To fix this, I started using Lloyd's algorithm at a higher level, so the codebook could be refined over several iterations:

1. Create candidate codebook (like crunch)
2. Reassign each input block to the best codebook entry (by trying them all and computing the error of each), creating a new clusterization.
3. Compute new codebook entries (by optimizing the endpoints or selecting the best selectors to use for each cluster factoring in the block endpoints).
4. Repeat steps 2-3 X times. Each iteration will lower the overall error.

You also need to insert steps to identify redundant codebook entries and delete them. If the codebook becomes too small, you can find the cluster with the worst error and split it into two or more clusters.

Also, whenever you decide to use a different endpoint or selector to code a block, you've changed the clusterization used and you should recompute the codebook (factoring in the actual clusterization). Optimizations like selector RDO change the final clusterization.

ETC1S texture format encoding and how it's transcoded to BC1

$
0
0
I developed the ETC1S encoding method back in late 2016, and we talked about it publicly in our CppCon '16 presentation. It's good to see that this encoding is working well in crunch too (better bitrate for near equal error). There are kodim statistics on Alexander's checkin notes:

https://github.com/Unity-Technologies/crunch/commit/660322d3a611782202202ac00109fbd1a10d7602

I described the format details and asked Alexander to support ETC1S so we could add universal support to crunch.

Anyhow, ETC1S is great because it enables simplified transcoding to BC1 using a couple small lookup tables (one for the 5 bit DXT1 components, and the other for 6). You can precompute the best DXT1 component low/high endpoints to use for each possibility of used ETC1S selectors (or low/high selector "ranges") and ways of remapping the ETC1S selectors to DXT1 selectors. The method I came up with supports a strong subset of these possible mapping (6 low/high selector ranges and 10 selector remappings).

So the basic idea to this transcoder design is that we'll figure out the near-optimal DXT1 low/high endpoints to use for a ETC1S block, then just translate the ETC1S selectors through a remapping table. We don't need to do any expensive R,G,B vector calculations here, just simple math on endpoint components and selectors. To find the best endpoints, we need the ETC1S base color (5,5,5), intensity table index (3 bits), and the used selector range (because ETC1/ETC1S heavily depends on endpoint extrapolation to reduce overall error, so for example sometimes the encoder will only use a single selector in the "middle" of the intensity range).

First, here are the most used selector ranges used by the transcoder:
{ 0, 3 },
{ 1, 3 },
{ 0, 2 },
{ 1, 2 },
{ 2, 3 },
{ 0, 1 },

And here are the selector remapping tables:
{ 0, 0, 1, 1 },
{ 0, 0, 1, 2 },
{ 0, 0, 1, 3 },
{ 0, 0, 2, 3 },
{ 0, 1, 1, 1 },
{ 0, 1, 2, 2 },
{ 0, 1, 2, 3 },
{ 0, 2, 3, 3 },
{ 1, 2, 2, 2 },
{ 1, 2, 3, 3 },

So what does this stuff mean? In the first table, the first entry is { 0, 3 }. This index is used for blocks that use all 4 selectors. The 2nd one is for blocks that only use selectors 1-3, etc. We could support all possible ways that the 4 selectors could be used, but you reach a point of diminishing returns.

The second table is used to translate ETC1S selectors to DXT1 selectors. Again, we could support all possible ways of remapping selectors, but only a few are needed in practice.

So to translate an ETC1S block to BC1/DXT1:

- Scan the ETC1S selectors (which range from 0-3) to identify their low/high range, and map this to the best entry in the first table. This is the selector range table index, from 0-5.
(For crunch/basis this is precomputed for each selector codebook entry, so we don't need to do it for each block.)

- Now we have a selector range (0-5), three ETC1S base color components (5-bits each) and an ETC1S intensity table index (3-bits). We have a set of 10 precomputed tables (for each supported way of remapping the selectors from ETC1S->DXT1) for each selector_range/basecolor/inten_table possibility (6*32*8*10=15360 total tables).

- Each table entry has a DXT1 low/high endpoint values (either 5 or 6 bits) and an error value. But this is only for a single component, so we need to scan the 10 entries (for each possible way of remapping the selectors from ETC1S->DXT1) for all components, sum up their total R+G+B error, and use the selector remapping method that minimizes the overall error. (We can only select 1 way to remap the selectors, because there's only a single selector for each pixel.) The best way of remapping the selectors for R may not be the best way for G or B, so we need to try all 10 ways we support, compute the error for each, and select the best one that minimizes the overall error.

In code:

// Get the best selector range table entry to use for the ETC1S block:
const uint selector_range_table = g_etc1_to_dxt1_selector_range_index[low_selector][high_selector];

// Now get pointers to the precomputed tables for each component:
//[32][8][RANGES][MAPPING]
const etc1_to_dxt1_56_solution *pTable_r = &g_etc1_to_dxt_5[(inten_table * 32 + base_color.r) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];
const etc1_to_dxt1_56_solution *pTable_g = &g_etc1_to_dxt_6[(inten_table * 32 + base_color.g) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];
const etc1_to_dxt1_56_solution *pTable_b = &g_etc1_to_dxt_5[(inten_table * 32 + base_color.b) * (NUM_ETC1_TO_DXT1_SELECTOR_RANGES * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS) + selector_range_table * NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS];

// Scan to find the best remapping table (from 10) to use:
uint best_err = UINT_MAX;
uint best_mapping = 0;

CRND_ASSERT(NUM_ETC1_TO_DXT1_SELECTOR_MAPPINGS == 10);
#define DO_ITER(m) { uint total_err = pTable_r[m].m_err + pTable_g[m].m_err + pTable_b[m].m_err; if (total_err < best_err) { best_err = total_err; best_mapping = m; } }
DO_ITER(0); DO_ITER(1); DO_ITER(2); DO_ITER(3); DO_ITER(4);
DO_ITER(5); DO_ITER(6); DO_ITER(7); DO_ITER(8); DO_ITER(9);
#undef DO_ITER

// Now create the DXT1 endpoints
uint l = dxt1_block::pack_unscaled_color(pTable_r[best_mapping].m_lo, pTable_g[best_mapping].m_lo, pTable_b[best_mapping].m_lo);
uint h = dxt1_block::pack_unscaled_color(pTable_r[best_mapping].m_hi, pTable_g[best_mapping].m_hi, pTable_b[best_mapping].m_hi);

// pSelector_xlat is used to translate the ETC1S selectors to DXT1 selectors
const uint8 *pSelectors_xlat = &g_etc1_to_dxt1_selector_mappings1[best_mapping][0];

if (l < h)
{
std::swap(l, h);
pSelectors_xlat = &g_etc1_to_dxt1_selector_mappings2[best_mapping][0];
}

pDst_block->set_low_color(static_cast<uint16>(l));
pDst_block->set_high_color(static_cast<uint16>(h));

// Now use pSelectors_xlat[] to translate the selectors and we're done

If the block only uses a single selector, it's a fixed color block and you can use a separate set of precomputed tables (like stb_dxt uses) to convert it to the optimal DXT1 color.

So that's it. It's a fast and simple process to convert ETC1S->DXT1. The results look very good, and are within a fraction of a dB between ETC1S and BC1. You can also use this process to convert ETC1S->BC7, etc.

Once you understand this process, almost everything else falls into place for the universal format. ETC1S->BC1 and ETC1S->PVRTC are the key transcoders, and all other formats use these basic ideas.

There are surely other "base" formats we could choose. I choose ETC1S because I already had a strong encoder for this format and because it's transcodable to BC1.

You can see the actual code here, in function convert_etc1_to_dxt1().

It's possible to add BC7-style pbits to ETC1S (1 or 3) to improve quality. Transcoders can decide to use these pbits, or not.

Lookup table based real-time PVRTC encoding

$
0
0
I've found a table-based method of improving the output from a real-time PVRTC encoder. Fast real-time encoders first find the RGB(A) bounds of each 4x4 block to determine the block endpoints, then they evaluate the interpolated endpoints at each pixel to determine the modulation values which minimize the encoded error. This works okay, but the results are barely acceptable in practice due to banding artifacts on smooth features.

One way to improve the output of this process is to precompute, for all [0,255] 8-bit component values, the best PVRTC low/high endpoints to use to encode that value assuming the modulation values in the 7x7 pixel region are either all-1 or 2 (or all 0, 1, 2, or 3):

// Tables containing the 5-bit/5-bit L/H endpoints to use for each 8-bit value      
static uint g_pvrtc_opt55_e1[256];
static uint g_pvrtc_opt55_e2[256];

// Tables containing the 5-bit/4-bit L/H endpoints to use for each 8-bit value     
static uint g_pvrtc_opt54_e1[256];
static uint g_pvrtc_opt54_e2[256];

const int T = 120;

for (uint c = 0; c < 256; c++)
{
    uint best_err1 = UINT_MAX;
    uint best_l1 = 0, best_h1 = 0;
    uint best_err2 = UINT_MAX;
    uint best_l2 = 0, best_h2 = 0;

    for (uint l = 0; l < 32; l++)
    {
        const int lv = (l << 3) | (l >> 2);

        for (uint h = 0; h < 32; h++)
        {
            const int hv = (h << 3) | (h >> 2);

            if (lv > hv)
                continue;

            int delta = hv - lv;
            // Avoid endpoints that are too far apart to reduce artifacts
            if (delta > T)
                continue;

            uint e1 = (lv * 5 + hv * 3) / 8;

            int diff1 = math::iabs(c - e1);
            if (diff1 < best_err1)
            {
                best_err1 = diff1;
                best_l1 = l;
                best_h1 = h;
            }

            uint e2 = (lv * 3 + hv * 5) / 8;
            int diff2 = math::iabs(c - e2);
            if (diff2 < best_err2)
            {
                best_err2 = diff2;
                best_l2 = l;
                best_h2 = h;
            }
        }
    }

    g_pvrtc_opt55_e1[c] = best_l1 | (best_h1 << 8);
    g_pvrtc_opt55_e2[c] = best_l2 | (best_h2 << 8);
}

// 5-bit/4-bit loop is similar

Now that you have these tables, you can loop through all the 4x4 pixel blocks in the PVRTC texture and compute the 7x7 average RGB color surrounding each block (it's 7x7 pixels because you want the average of all colors influenced by each block's endpoint accounting for bilinear endpoint interpolation). You can look up the optimal endpoints to use for each component, set the block's endpoints to those trial endpoints, find the best modulation values for the impacted 7x7 pixels, and see if the error is reduced or not. The overall error is reduced on smooth blocks very often. You can try this process several times for each block using different precomputed tables.

For even more quality, you can also use precomputed tables for modulation values 0 and 3. You can also use two dimensional tables [256][256] that have the optimal endpoints to use for two colors, then quantize each 7x7 pixel area to 2 colors (using a few Lloyd algorithm iterations) and try those endpoints too. 2D tables result in higher quality high contrast transitions.

Here's some psuedocode showing how to use the tables for a single modulation value (you can apply this process multiple times for the other tables):

// Compute average color of all pixels influenced by this endpoint
vec4F c_avg(0);

for (int y = 0; y < 7; y++)
{
const uint py = wrap_or_clamp_y(by * 4 + y - 1);
for (uint x = 0; x < 7; x++)
{
const uint px = wrap_or_clamp_x(bx * 4 + x - 1);

const color_quad_u8 &c = orig_img(px, py);

c_avg[0] += c[0];
c_avg[1] += c[1];
c_avg[2] += c[2];
c_avg[3] += c[3];
}
}

// Save the 3x3 block neighborhood surrounding the current block
for (int y = -1; y <= 1; y++)
{
    for (int x = -1; x <= 1; x++)
    {
        const uint block_x = wrap_or_clamp_block_x(bx + x);
        const uint block_y = wrap_or_clamp_block_y(by + y);
        cur_blocks[x + 1][y + 1] = m_blocks(block_x, block_y);
    }
}

// Compute the rounded 8-bit average color
// c_avg is the average color of the 7x7 pixels around the block
c_avg += vec4F(.5f);
color_quad_u8 color_avg((int)c_avg[0], (int)c_avg[1], (int)c_avg[2], (int)c_avg[3]);

// Lookup the optimal PVRTC endpoints to use given this average color,
// assuming the modulation values will be all-1
color_quad_u8 l0(0), h0(0);
l0[0] = g_pvrtc_opt55_e1[color_avg[0]] & 0xFF;
h0[0] = g_pvrtc_opt55_e1[color_avg[0]] >> 8;

l0[1] = g_pvrtc_opt55_e1[color_avg[1]] & 0xFF;
h0[1] = g_pvrtc_opt55_e1[color_avg[1]] >> 8;

l0[2] = g_pvrtc_opt54_e1[color_avg[2]] & 0xFF;
h0[2] = g_pvrtc_opt54_e1[color_avg[2]] >> 8;

// Set the block's endpoints and evaluate the error of the 7x7 neighborhood (also choosing new modulation values!)
m_blocks(bx, by).set_opaque_endpoint_raw(0, l0);
m_blocks(bx, by).set_opaque_endpoint_raw(1, h0);

uint64 e1_err = remap_pixels_influenced_by_endpoint(bx, by, orig_img, perceptual, alpha_is_significant);
if (e1_err > current_best_err)
{
    // Error got worse, so restore the blocks
    for (int y = -1; y <= 1; y++)
    {
        for (int x = -1; x <= 1; x++)
        {
            const uint block_x = wrap_or_clamp_block_x(bx + x);
            const uint block_y = wrap_or_clamp_block_y(by + y);

            m_blocks(block_x, block_y) = cur_blocks[x + 1][y + 1];
        }
    }
}

Here's an example for kodim03 (cropped to 1k square due to PVRTC limitations). This image only uses 2 precomputed tables for modulation values 1 and 2 (because it's real-time):

Original:


Before table-based optimization:
RGB Average Error: Max:  86, Mean: 1.156, MSE: 9.024, RMSE: 3.004, PSNR: 38.577


Endpoint and modulation data:





After:
RGB Average Error: Max:  79, Mean: 0.971, MSE: 6.694, RMSE: 2.587, PSNR: 39.874



Endpoint and modulation data:





The 2D table version looks better on high contrast transitions, but needs more memory. Using 4 1D tables followed by a single 2D lookup results in the best quality.

The lookup table example code above assumes the high endpoints will usually be >= than the low endpoints. Whatever algorithm you use to create the endpoints in the first pass needs to be compatible with your lookup tables, or you'll loose quality.


Real-time PVRTC encoding for a universal GPU texture format system

$
0
0
Here's one way to support PVRTC in a universal GPU texture format system that transcodes from a block based format like ETC1S.

First, study this PVRTC code:
https://bitbucket.org/jthlim/pvrtccompressor/src/default/PvrTcEncoder.cpp

Unfortunately, this library has several key bugs, but its core texture encoding approach is sound for real-time use.

Don't use its decompressor, it's not bit accurate vs. the GPU and doesn't unpack alpha properly. Use this "official" decoder instead as a reference instead:

https://github.com/google/swiftshader/blob/master/third_party/PowerVR_SDK/Tools/PVRTDecompress.h

Function EncodeRgb4Bpp() has two passes:

1. The first pass computes RGB(A) bounding boxes for each 4x4 block: 

    for(int y = 0; y < blocks; ++y) { for(int x = 0; x < blocks; ++x) { ColorRgbBoundingBox cbb; CalculateBoundingBox(cbb, bitmap, x, y); PvrTcPacket* packet = packets + GetMortonNumber(x, y); packet->usePunchthroughAlpha = 0; packet->SetColorA(cbb.min); packet->SetColorB(cbb.max); } }
Most importantly, SetColorA() must floor and SetColorB() must ceil. Note that the alpha version of the code in this library (function EncodeRgba4Bpp()) is very wrong: it assumes alpha 7=255, which is incorrect (it's actually (7*2)*255/15 or 238). 

This pass can be done while decoding ETC1S blocks during transcoding. The endpoint/modulation values need to be saved to a temporary buffer.

It's possible to swap the low and high endpoints and get an encoding that results in less error (I believe because the endpoint encoding precision of blue isn't symmetrical - it's 4/5 not 5/5), but you have to encode the image twice so it doesn't seem worth the trouble.

2. Now that the per-block endpoints are computed, you can compute the per-pixel modulation values. This function is quite optimizable without requiring vector code (which doesn't work on the Web yet):

for(int y = 0; y < blocks; ++y) { for(int x = 0; x < blocks; ++x) { const unsigned char (*factor)[4] = PvrTcPacket::BILINEAR_FACTORS; const ColorRgba<unsigned char>* data = bitmap.GetData() + y * 4 * size + x * 4; uint32_t modulationData = 0; for(int py = 0; py < 4; ++py) { const int yOffset = (py < 2) ? -1 : 0; const int y0 = (y + yOffset) & blockMask; const int y1 = (y0+1) & blockMask; for(int px = 0; px < 4; ++px) { const int xOffset = (px < 2) ? -1 : 0; const int x0 = (x + xOffset) & blockMask; const int x1 = (x0+1) & blockMask; const PvrTcPacket* p0 = packets + GetMortonNumber(x0, y0); const PvrTcPacket* p1 = packets + GetMortonNumber(x1, y0); const PvrTcPacket* p2 = packets + GetMortonNumber(x0, y1); const PvrTcPacket* p3 = packets + GetMortonNumber(x1, y1); ColorRgb<int> ca = p0->GetColorRgbA() * (*factor)[0] + p1->GetColorRgbA() * (*factor)[1] + p2->GetColorRgbA() * (*factor)[2] + p3->GetColorRgbA() * (*factor)[3]; ColorRgb<int> cb = p0->GetColorRgbB() * (*factor)[0] + p1->GetColorRgbB() * (*factor)[1] + p2->GetColorRgbB() * (*factor)[2] + p3->GetColorRgbB() * (*factor)[3]; const ColorRgb<unsigned char>& pixel = data[py*size + px]; ColorRgb<int> d = cb - ca; ColorRgb<int> p{pixel.r*16, pixel.g*16, pixel.b*16}; ColorRgb<int> v = p - ca; // PVRTC uses weightings of 0, 3/8, 5/8 and 1 // The boundaries for these are 3/16, 1/2 (=8/16), 13/16 int projection = (v % d) * 16; int lengthSquared = d % d; if(projection > 3*lengthSquared) modulationData++; if(projection > 8*lengthSquared) modulationData++; if(projection > 13*lengthSquared) modulationData++; modulationData = BitUtility::RotateRight(modulationData, 2); factor++; } } PvrTcPacket* packet = packets + GetMortonNumber(x, y); packet->modulationData = modulationData; } }

The code above interpolates the endpoints in full RGB(A) space, which isn't necessary. You can sum each channel into a single value (like Luma, but just R+G+B), interpolate that instead (much faster in scalar code), then decide which modulation values to use in 1D space. Also, you can unroll the innermost px/py loops using macros or whatever.

Encoding from ETC1S simplifies things somewhat because, for each block, you can precompute the R+G+B values to use for each of the 4 possible input selectors.

That's basically it. If you combine this post with my previous one, you've got a nice real-time PVRTC encoder usable in WebAssembly/asm.js (i.e. it doesn't need vector ops to be fast). Quality is surprisingly good for a real-time encoder, especially if you add the optional 3rd pass described in my other post.

Opaque is tougher to handle, but the basic concepts are the same.

The encoder in this library doesn't support punch-through alpha, which is quite valuable and easy to encode in my testing. 

PVRTC encoding examples

$
0
0
This is "testpat.png", which I got somewhere on the web. It's a surprisingly tricky image to encode to PVRTC. The gradients, various patterns, the transitions between these regions and even the constant-color areas are hard to handle in PVRTC. (Sorry, there is lena in there. I will change this to something else eventually.)

Note my encoder used clamp addressing for both encoding and decoding but PVRTexTool used wrap (not that it matters with this image). Here's the .pvr file for testpat.

Original

BC1: 47.991 Y PSNR

PVRTexTool "Best Quality": 41.943 Y PSNR

Experimental encoder (bounding box, precomputed tables, 1x1 block LS): 44.914 Y PSNR:
Here's delorean (resampled to .25 original size):

Original

BC1: 43.293 Y PSNR, .997308 Y SSIM

PVRTexTool "Best Quality": 40.440 Y PSNR, .996007 Y SSIM

Experimental encoder: 42.891 Y PSNR, .997021 Y SSIM
Interestingly, on delorean you can see that PVRTC's handling of smooth gradients is clearly superior vs. BC1 with a strong encoder.

Here's xmen_1024:

Original

BC1: 37.757 Y PSNR, .984543 Y SSIM

PVRTexTool "Best Quality": 36.762 Y PSNR, .976023 Y SSIM

Experimental encoder: 37.314 Y PSNR, .9812 Y SSIM

"Y" is REC 709 Luma, SSIM was computed using OpenCV.


Basis status update

$
0
0
I sent this as a reply to someone by email, but it makes a good blog post too. Here's what Basis does today right now (i.e. this is what we ship for OSX/Windows/Linux):

1. RDO BC1-5: Like crunch's, slower but higher quality/smaller files (supports up to 32K codebooks, LZ-specific RDO optimizations - crunch is limited to only 8K codebooks, no LZ RDO)

This competes against crunch's original BC1-5 RDO solution, which is extremely fast (I wrote it for max speed) but lower quality for same bitrate. The decrease in bitrate for same quality completely depends on the content and the LZ codec you use, but it can be as high as 20% according to one large customer. On the other hand, for some texture's it'll only be a few percent.

crunch's RDO is limited to 8K codebooks so Basis can be used were crunch cannot due to quality concerns.

Some teams prefer fast encoding at lower quality, and some prefer higher quality (especially on normal maps) at lower speed. We basically gave away the lower quality option in crunch.

2. RDO ETC1: Up to 32K codebooks, no LZ-specific RDO optimizations yet.
Crunch doesn't support ETC1 RDO.
You could compress to ETC1 .CRN, then unpack that to .KTX, to give you a "poor man's" equivalent to direct ETC1 RDO, but you'll still be limited to 8K codebooks max (which is too low quality for many normal maps and some albedo textures).

3. .basis: universal (supports transcoding to BC1-5, BC7, PVRTC1 4bpp opaque, ETC1, more formats on the way)
crunch doesn't support this.
We provide all of the C++ decoder/transcoder source code, which builds using emscripten.

.basis started as a custom ETC1 system we wrote for Netflix, then I figured out how to make it universal. Note that I recently open sourced the key ETC1S->BC1 transcoding technique in crunch publicly (to help the Khronos universal GPU texture effort along by giving them the key info they needed to implement their own solution):

4. Non-RDO BC7: superior to ispc_texcomp's. Written in ispc.

I'm currently working on RDO BC7 and better support for PVRTC. We are building a portfolio of encoders for all the formats, as fast as we can. We're going to keep adding encoders over the next few years.

Our intention is not to compete against crunch (that's commercial suicide). I put a ton of value into crunch, and after Alexander optimized .CRN more its value went through the roof. A bunch of large teams are using it on commercial products because it works so well.


A little ETC1S history

$
0
0
I've been talking about ETC1S for several years. I removed some of my earlier posts (to prevent others from stealing our work - which does happen) but they are here:
https://web.archive.org/web/20160913170247/http://richg42.blogspot.com/

We also covered our work with ETC1S and a universal texture format at CppCon 2016:

Just in case there's any confusion, we shipped our first ETC1S encoder to Netflix early last year, and developed all the universal stuff from 2016-early 2018.

This is why we're working on Basis.

Types of tech companies

$
0
0
Some tech company types I've encountered in the past. A company can be a blend of multiple types.

1. Famous companies 
Massive subsidies from digital distribution
Let them call you - *always*. Can be dehumanizing and super abusive.
Results don't really matter. Soviet-like purges. Insane, power mad staff.

2. Megacorps
Massive dilbert-like internal politics between and within groups. Can be decent if you find the right group that you fit into.
Results only matter due to internal politics and constant reorgs/layoffs, not due to any intrinsic need for profits.
Great for those who want a somewhat stable lifestyle, if you can tolerate the politics and culture.
Workers turn the megacorp into their corporate tribe and absolutely obsess over internal politics at all times.
Can appear absolutely batshit insane from the outside.
May have illegally colluded with other megacorps to not hire each other's employees to keep wages and horizontal movement between firms down.

3. Satellite Companies
Firm acquired by large megacorp
The former insiders/founders become mid-level management at the megacorp.
Firm will try to keep its identity and culture but usually fails.
Resistance is futile: Either the firm ultimately becomes fully absorbed into the megacorp or it will be shut down.
Don't join until you figure out what the megacorp actually thinks of the acquired company and its mid-level managers.
The former owners will be super tight.

4. Well-established, well-respected firms
Products are legendary and set the bar.
There are two types of employees: The "old guard" and everyone else. If you're not on the inside you are outside.
Can be a good choice if you can fit in and get shit done.
Don't expect to become an insider anytime soon if ever.
Rare

5. Traditional Silicon Valley-style startups
Founders can appear absolutely batshit on social media, during public presentations etc.
For the gamblers. The earlier you get in, the higher the probability you'll get good stock.
Founders and investors get in bed with each other.
Non-savvy founders get eventually pushed out or lose power.
These startups come and go constantly, so if you work for one that almost inevitably goes bust just move your desk one block away.
Talk to ex-employees before joining. If they had to sign NDA's or got threatened if they talked, avoid.

6. Self-funded startups
Formed by a small, passionate group of insiders wanting to recapture past glories or just be independent.
Can be good, but don't expect it to last when the insiders break up.
Founders can be super passionate about their project and will continue investing in it even after it becomes obvious to everyone else that it's never going to make a dime.
These startups have lifespans of a few years or so unless they have a big success.

7. Scrappy contracting outfits beholden to a single publisher
Publisher is abusive. Company ignores it because it has no choice.
You will be treated like dogs. Crunch is expected.
Founders think they are making good money, but because they go for long periods of time without any income while still working they actually aren't.
Company has zero leverage with its publisher because it doesn't have any alternatives.
Can be OK if you work there hourly, but avoid full-time contracts because you will be crunched to death and treated badly.
Darth Vader-like publisher will break all the rules, recruit your best staff, make changes to your team or contract, etc. at will - because it can.

8. Firms working with multiple publishers
Company keeps multiple products in the pipeline with multiple publishers.
Firm lies to each publisher about who is working on what.
May have secret passion-projects in the background covertly funded with publisher funds.
Fragile. If one team fails, the company is in trouble and expect layoffs. If two or more teams fail, the company is toast.
Always talk to all teams in the firm to build a picture of how healthy each product is.
Can be great places to work as long as you realize it probably won't last.

9. Phoenix-like companies started after mass layoffs in small cities
Company formed after mass layoff trauma.
Two groups: Insiders and Outsiders. Insiders are *tight*. Outsiders will never become insiders - new insiders will be brought in.
Eventual Buyout Mentality: You will be constantly told that the company will eventually be sold and you'll become rich off your stock - just like last time.
Local shadowy investors prop the company up during hard times.
Publisher(s) are kept at arms length and generally aren't allowed to interact with employees - always through managers.
Stock is absolutely, totally worthless unless the Insiders love you during a buyout.
Unstable until established. Buyout may never actually happen.
Small-town environment may make the company somewhat shady. Horizontal movement between tech companies in the same small town is virtually impossible due to illegal collusion between companies to not compete over employees.
The company actually exists to make the insiders wealthy and to give the upper management a decent lifestyle. Everyone else is expendable.

10. Companies formed by a single god-like owner
No "Insiders": There's the dictator-like owner, upper management, and everyone else.
Always meet and interview with the owner first. Avoid if they give you the creeps or a bad vibe.
Company is an expression of the owner's weird development philosophies.
Best avoided if in a small city/town.
Check the background of the owner and figure out where their funding came from. If they are scam artists, have lots of ex-employees suing them, or have otherwise shady backgrounds, obviously avoid.

11. Engine Companies
At the center of a large ecosystem of content creators
At War with competitive engine companies, which the company absolutely hates.
Funded with large amounts of investor capitol and through support contracts with large firms.
Massive, sprawling corporation consisting of multiple smaller firms spread over the entire globe.
The engine company workers actually wind up secretly hating the developers who use their engine.
Joining as a single developer gets you nowhere. It's best to be acquired by the firm as a small group and given your own office. The company actively looks for these small groups to hire/acquire.
Can be a good gig in the right city but don't expect to get anywhere. It's just a job on a large sprawling piece of engine software nobody fully understands anymore.
Company can employ talent virtually anywhere on the globe.
Workers generally treated like crap. Contractors (especially in Eastern Europe or Russia) are massively underpaid and undervalued.
Company has a firm process and procedure for doing things and that's it.
Upper management layer is cult-like and very tight.
Each office has its own strange small town-esque politics and culture.

Existing BC7 codecs don't handle textures with decorrelated alpha channels well

$
0
0
Developers aren't getting the alpha quality they could be getting if they had better BC7 codecs. I noticed while working on our new non-RDO BC7 codec that existing BC7 codecs don't handle textures with decorrelated alpha signals well. They wind up trashing the alpha channel when the A signal doesn't resemble the signal in RGB. I didn't have time to investigate the issue until now. I'm guessing most developers either don't care, or they use simple (correlated) alpha channels, or multiple textures.

Some codecs allow the user to specify individual RGBA channel weightings. (ispc_texcomp isn't one of them.) This doesn't work well in practice, and users will rarely fiddle with the weightings anyway. You have to weight A so highly that RGB gets trashed.

Here's an example using a well-known CPU BC7 codec:

RGB image: kodim18
Alpha image: kodim17

Encoded using Intel's ispc_texcomp to BC7 profile alpha_slow:

RGB Average Error: Max:  40, Mean: 1.868, MSE: 7.456, RMSE: 2.731, PSNR: 39.406
Luma        Error: Max:  26, Mean: 1.334, MSE: 3.754, RMSE: 1.938, PSNR: 42.386
Alpha       Error: Max:  36, Mean: 1.932, MSE: 7.572, RMSE: 2.752, PSNR: 39.339

Encoded RGB:

Encoded A:

Experimental RDO BC7 codec (quantization disabled) with support for decorrelated alpha. Uses only modes 4, 6, and 7:

M4: 72.432457%, M6: 17.028809%, M7: 10.538737%
RGB Average Error: Max:  65, Mean: 2.031, MSE: 8.871, RMSE: 2.978, PSNR: 38.651
Luma        Error: Max:  34, Mean: 1.502, MSE: 4.887, RMSE: 2.211, PSNR: 41.241
Alpha       Error: Max:  29, Mean: 1.601, MSE: 5.703, RMSE: 2.388, PSNR: 40.570

Encoded RGB:

Encoded A:

Zoomed in comparison:



This experimental codec separately measures per-block RGB average and alpha PSNR. It prefers mode 4, and switches to modes 6 or 7 using this logic:

const float M7_RGB_THRESHOLD = 1.0f;
const float M7_A_THRESHOLD = 40.0f;
const float M7_A_DERATING = 12.0f;

const float M6_RGB_THRESHOLD = 1.0f;
const float M6_A_THRESHOLD = 40.0f;
const float M6_A_DERATING = 7.0f;

if ((m6_rgb_psnr > (math::maximum(m4_rgb_psnr, m7_rgb_psnr) + M6_RGB_THRESHOLD)) && (m6_a_psnr > M6_A_THRESHOLD) && (m6_a_psnr > math::maximum(m4_a_psnr, m7_a_psnr) - M6_A_DERATING))
{
block_modes[block_index] = 6;
}
else if ((m7_rgb_psnr > (m4_rgb_psnr + M7_RGB_THRESHOLD)) && (m7_a_psnr > (m4_a_psnr - M7_A_DERATING)) && (m7_a_psnr > M7_A_THRESHOLD))
{
block_modes[block_index] = 7;
}
else 
{
block_modes[block_index] = 4;
}

What this basically does: we only use mode 6 or 7 when the RGB PSNR is greater than mode 4's RGB PSNR plus a threshold (1 dB). But we only do this if we don't degrade alpha quality too much (either 12 dB or 7 dB), and if alpha quality is above a minimum threshold (40 dB).

PSNR doesn't really capture the extra distortion being introduced here. It can help to load the alpha images into Photoshop or whatever and compare them closely as layers.

RGB PSNR went down a little with my experimental RDO BC7 codec, but alpha went up. Visually, alpha is greatly improved. My ispc non-RDO BC7 code currently has the same issue as Intel's ispc_texcomp with decorrelated alpha.

Direct conversion of ETC1 to DXT1 texture data (originally published 9/11/16)

$
0
0
In this experiment, I limited my ETC1 encoder to only use a subset of the full format: differential mode, no flipping, with the diff color always set to (0,0,0). So all we use in the ETC1 format is the 5:5:5 base color, the 3-bit intensity table index, and the 16 2-bit selectors. This is the same subset used in this post on ETC1 endpoint clusterization.

This limits the ETC1 encoder to only utilizing 4 colors per block, just like DXT1. These 4 colors are on a line parallel to the grayscale axis. Fully lossless conversion (of this ETC1 subset format) to DXT1 is not possible in all cases, but it may be possible to do a "good enough" conversion.

The ETC1->DXT1 conversion step uses a precomputed 18-bit lookup table (5*3+3 bits) to accelerate the conversion of the ETC1 base color, intensity table index, and selectors to DXT1 low/high color endpoints and selectors. Each table entry contains the best DXT1 low/high color endpoints to use, along with a 4 entry table specifying which DXT1 selector to use for each ETC1 selector. I used crunch's DXT1 endpoint optimizer to build this table.

ETC1 (subset):

Error: Max:  80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324



Converted directly to DXT1 using the lookup table approach, then decoded (in software using crnlib):

Error: Max:  73, Mean: 3.966, MSE: 32.873, RMSE: 5.733, PSNR: 32.962


Delta image:


Grayscale delta histogram:


There are some block artifacts to work on, but this is great progress for 1 hour of work. (Honestly, I would have been pretty worried if there weren't any artifacts to figure out on my first test!)

These results are extremely promising. The next step is to work on the artifacts and do more testing. If this conversion step can be made to work well enough it means that a lossy "universal crunch" format that can be quickly and efficiently transcoded to either DXT1 or ETC1 is actually possible.

Direct conversion of ETC1 to DXT1 texture data: 2nd experiment (originally published 9/06/16)

$
0
0
I lowered the ETC1 encoder's quality setting, so it doesn't try varying the block color so much during endpoint optimization. The DXT1 artifacts in my first experiment are definitely improved, although the overall quality is reduced. I also enabled usage of 3-color DXT1 blocks (although that was very minor).

Perhaps the right solution (that preserves quality but avoids the artifacts) is to add ETC1->DXT1 error evaluation to the ETC1 encoder, so it's aware of how much DXT1 error each ETC1 trial block color has.

ETC1 (subset):


Error: Max: 101, Mean: 4.036, MSE: 34.999, RMSE: 5.916, PSNR: 32.690

Converted directly to DXT1 using a 18-bit lookup table:


Error: Max: 107, Mean: 4.239, MSE: 38.930, RMSE: 6.239, PSNR: 32.228

Another ETC1:


Error: Max: 121, Mean: 4.220, MSE: 45.108, RMSE: 6.716, PSNR: 31.588

DXT1:


Error: Max: 117, Mean: 4.403, MSE: 48.206, RMSE: 6.943, PSNR: 31.300

Direct conversion of ETC1 to DXT1 texture data: 3rd experiment (originally published 9/7/16)

$
0
0
I've changed the lookup table used to convert to DXT1. Each cell in the 256K entry table (32*32*32*8, for each 5:5:5 base color and 3-bit intensity table entry in my ETC1 subset format) now contains 10 entries, to account for each combination of actually used ETC1 selector ranges in a block:

 { 0, 0 },
 { 1, 1 },
 { 2, 2 },
 { 3, 3 },
 { 0, 3 },
 { 1, 3 },
 { 2, 3 },
 { 0, 2 },
 { 0, 1 },
 { 1, 2 }

The first 4 entries here account for blocks that get encoded into a single color. The next entry accounts for blocks which use all selectors, then { 1, 3 } accounts for blocks which only use selectors 1,2,3, etc.

So for example, when converting from ETC1, if only selector 2 was actually used in a block, the ETC1->DXT1 converter uses a set of DXT1 low/high colors optimized for that particular use case. If all selectors were used, it uses entry #4, etc. The downsides to this technique are the extra CPU expense in the ETC1->DXT1 converter to determine the range of used selectors, and the extra memory to hold a larger table.

Note the ETC1 encoder is still not aware at all that its output will also be DXT1 coded. That's the next experiment. I don't think using this larger lookup table is necessary; a smaller table should hopefully be OK if the ETC1 subset encoder is aware of the DXT1 artifacts its introducing in each trial. Another idea is to use a simple table most of the time, and only access the larger/deeper conversion table on blocks which use the brighter ETC1 intensity table indices (the ones with more error, like 5-7).

ETC1 (subset):


Error: Max:  80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324

ETC1 texture directly converted to DXT1:


Error: Max:  73, Mean: 3.939, MSE: 32.218, RMSE: 5.676, PSNR: 33.050

I experimented with allowing the DXT1 optimizer (used to build the lookup table) to use 3-color blocks. This is actually a big deal for this use case, because the transparent selector's color is black (0,0,0). ETC1's saturation to 0 or 255 after adding the intensity table values creates "strange" block colors (away from the block's colorspace line), and this trick allows the DXT1 optimizer to work around that issue better. I'm not using this trick above, though.

I started seriously looking at the BC7 texture format's details today. It's complex, but nowhere near as complex as ASTC. I'm very tempted to try converting my ETC1 subset to that format next.

Also, if you're wondering why I'm working on this stuff: I want to write one .CRN-like encoder that supports efficient transcoding into as many GPU formats as possible. It's a lot of work to write these encoders, and the idea of that work's value getting amplified across a huge range of platforms and devices is very appealing. A universal format's quality won't be the best, but it may be practical to add a losslessly encoded "fixup" chunk to the end of the universal file. This could improve quality for a specific GPU format. 

Direct conversion of ETC1 to DXT1 texture data: 4th experiment (originally published 9/11/16)

$
0
0
In this experiment, I've worked on reducing the size of the lookup table used to quickly convert a subset of ETC1 texture data (using only a single 5:5:5 base color, one 3-bit intensity table index, and 2-bit selectors) directly to DXT1 texture data. Now the ETC1 encoder is able to simultaneously optimize for both formats, and due to this I can reduce the size of the conversion table. To accomplish this, I've modified the ETC1 base color/intensity optimizer function so it also factors in the DXT1 block encoding error into each trial's computed ETC1 error.

The overall trial error reported back to the encoder in this experiment was etc_error*16+dxt_error. The ETC1->DXT1 lookup table is now 3.75MB, with precomputed DXT1 low/high endpoints for three used selector ranges: 0-3, 0-2, 1-3. My previous experiment had 10 precomputed ranges, which seemed impractically large. I'm unsure which set of ranges is really needed or optimal yet. Even just one (0-3) seems to work OK, but with more artifacts on very high contrast blocks.

Anyhow, here's kodim18.

ETC1 subset:


Max:  80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

DXT1:


Max:  76, Mean: 3.952, MSE: 32.806, RMSE: 5.728, PSNR: 32.971

ETC1 block selector range usage histogram:
0-3: 19161
1-3: 3012
0-2: 2403

Universal texture compression: 5th experiment (originally published 9/15/16)

$
0
0

I outlined a plan for my next texture compression experiment in a previous post, here. I modified my ETC1 packer so it accepts an optional parameter which forces the encoder to use a set of predetermined selectors, instead of allowing it to use whatever selectors it likes.

The idea is, I can take an ETC1 texture using a subset of the full-format (no flips and only a single base color/intensity index - basically a single partition/single subset format using BC7 terminology) and "upgrade" it to higher quality without modifying the selector indices. I think this is one critical step to making a practical universal texture format that supports both DXT1 and ETC1.

Turns out, this idea works better than I thought it would. The ETC1 subset encoding gets 33.265 dB, while the "upgraded" version (using the same selectors as the subset encoding) gets 34.315 dB, a big gain. (Which isn't surprising, because the ETC1 subset encoding doesn't take full advantage of the format.) The nearly-optimal ETC1 encoding gets 35.475 dB, so there is still some quality left on the table here.

The ETC1 subset to DXT1 converted texture is 32.971 dB. I'm not worried about having the best DXT1 quality, because I'm going to support ASTC and BC7 too and (at the minimum) they can be directly converted from the "upgraded" ETC1 encoding that this experiment is about.

I need to think about the next step from here. I now know I can build a crunch-like format that supports DXT1, ETC1, and ATC. These experiments have opened up a bunch of interesting product and open source library ideas. Proving that BC7 support is also practical to add should be easy. ASTC is so darned complex that I'm hesitant to do it for "fun".

1. ETC1 (subset):

Max:  80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

Its selectors:


2. ETC1 (full format, constrained selectors) - optimizer was constrained to always use the subset encoding's selectors:


Max:  85, Mean: 3.435, MSE: 24.076, RMSE: 4.907, PSNR: 34.315

Its selectors (should be the same as #1's):


Biased delta between the ETC1 subset and ETC1 full encoding with constrained selectors - so we can see what pixels have benefited from the "upgrade" pass:


3. ETC1 (full format, unconstrained selectors) - packed using an improved version of rg_etc1 in highest quality mode:


Max:  80, Mean: 3.007, MSE: 18.432, RMSE: 4.293, PSNR: 35.475

Delta between the best ETC1 encoding (#3) and the ETC1 encoding using constrained selectors (#2):

Few more random thoughts on a "universal" GPU texture format (originally published 9/9/16)

$
0
0
In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)

ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)

So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.

Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for interchange and distribution.

Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.

Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)

More universal GPU texture format stuff (originally published 9/9/16)

$
0
0
Some BC7 format references:
https://msdn.microsoft.com/en-us/library/hh308954(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/hh308953.aspx

Source to CPU and shader BC7 (and other format) encoders/decoders:
https://github.com/Microsoft/DirectXTex

Khronos texture format references, including BC6H and BC7:
https://www.khronos.org/registry/dataformat/specs/1.1/dataformat.1.1.pdf

It may be possible to add ETC1-style subblocks into a universal GPU texture format, in a way that can be compressed efficiently and still converted on the fly to DXT1. Converting full ETC1 (with subblocks and per-subblock base colors) directly to BC7 at high quality looks easy because of BC7's partition table support. BC7 tables 0 and 13 (in 2 subset mode) perfectly match the ETC1 subblock orientations.

Any DX11 class or better GPU supports BC7, so on these GPU's the preferred output format can be BC7. DXT1 can be viewed as a legacy lower quality fallback for older GPU's.

Also, I limited the per-block (or per-subblock) base colors to 5:5:5 to simplify the experiments in my previous posts. Maybe storing 5:5:5 (for ETC1/DXT1) with 1-3 bit per-component deltas could improve the output for BC7/ASTC.

Also, one idea for alpha channel support in a universal GPU format: Store a 2nd ETC1 texture, containing the alpha channel. There's nothing to do when converting to ETC1, because using two ETC1 textures for color+alpha is a common pattern. (And, this eats two samplers, which sucks.)

When converting to DXT5's alpha block (DXT5A blocks - and yes I know there are BCx format equivalents but I'm using crnlib terms here), just use another ETC1 block color/intensity selector index to DXT5A mapping table. This table will be optimized for grayscale conversion. BC7 has very flexible alpha support so it should be a straightforward conversion.

The final thing to figure out is ASTC, but OMG that format looks daunting. Reminds me of MPEG/JPEG specs.
Viewing all 302 articles
Browse latest View live