Few notes about the previous post

August 4, 2016, 10:06 am

≫ Next: Brotli levels 0-10 vs. Oodle Kraken

≪ Previous: RAD's ground breaking lossless compression product benchmarked

This rant is mostly directed at the commenters that claimed I hobbled the open source codecs (including my own!) by not selecting the "proper" settings:

Please look closely at the red dots. Those represent Kraken. Now, this is a log10/log2 graph (log10 on the throughput axis.) Kraken's decompressor is almost one order of magnitude faster than Brotli's. Specifically, it's around 5-8x faster, just from eyeing the graph. No amount of tweaking Brotli's settings is going to speed it up this much. Sorry everyone. I've benchmarked Brotli at settings 0-10 (11 is just too slow) overnight and I'll post them tomorrow, just to be sure.

There is only a single executable file. The codecs are statically linked into this executable. All open source codecs were compiled with Visual Studio 2015 with optimizations enabled. They all use the same exact compiler settings. I'll update the previous post tomorrow with the specific settings.

I'm not releasing my data corpus. Neither does Squeeze Chart. This is to prevent codec authors from tweaking their algorithms to perform well on a specific corpus while neglecting general purpose performance. It's just a large mix of data I found over time that was useful for developing and testing LZHAM. I didn't develop this corpus with any specific goals in mind, and it just happens to be useful as a compressor benchmark. (The reasoning goes: If it was good enough to tune LZHAM, it should be good enough for newer codecs.)

↧

Brotli levels 0-10 vs. Oodle Kraken

August 5, 2016, 5:29 pm

≫ Next: Good article: Why software patents are evil

≪ Previous: Few notes about the previous post

For codec version info, compiler settings, etc. see this previous post.

This graph demonstrates that varying Brotli's compression level from 0 to 10 noticeably impacts its decompression throughput. (Level 11 is just too slow to complete the benchmark overnight.) As I expected, at none of these settings is it able to compete against Kraken.

Interestingly, it appears that at Brotli's lowest settings (0 and 1) it outputs compressed data that is extremely (and surprisingly) slow to decode. (I've highlighted these settings in yellow and green below.) I'm not sure if this is intentional or not, but with this kind of large slowdown I would avoid these Brotli settings (and use something like zlib or LZ4 instead if you need that much throughput).

Level Compressed Size
0 2144016081
1 2020173184
2 1963448673
3 1945877537
4 1905601392
5 1829657573
6 1803865722
7 1772564848
8 1756332118
9 1746959367
10 1671777094

Original 5374152762

↧

Good article: Why software patents are evil

August 29, 2016, 12:06 pm

≫ Next: ETC1 block color clusterization experiment

≪ Previous: Brotli levels 0-10 vs. Oodle Kraken

I have been attacked (at a time in my life when the last thing I needed was more stress!) by a patent holder before, so hey I hate software patents:

http://www.infoworld.com/article/2619609/open-source-software/why-software-patents-are-evil.html

↧

ETC1 block color clusterization experiment

September 3, 2016, 9:47 pm

≫ Next: Visualizing ETC1 texture compression

≪ Previous: Good article: Why software patents are evil

Intro

ETC1 is a well thought out, elegant little GPU format. In my experience a few years ago writing a production quality block ETC1 encoder, I found it to be far less fiddly than DXT1. Both use 64-bits to represent a 4x4 texel block, or 4-bits per texel.

I've been very curious how hard it would be to add ETC1/2 support to crunch. Also, many people have asked about ETC1 support, which is guaranteed to be available on OpenGL ES 2.0 compatible Android devices. crunch currently only supports the DXT1/5/N (3DC) texture formats. crunch's higher level classes are highly specific to the DXT formats, so adding a new format is not trivial.

One of the trickier (and key) problems in adding a new GPU format to crunch is figuring out how to group blocks (using some form of cluster analysis) so they can share the same endpoints. GPU formats like DXT1 and ETC1 are riddled with block artifacts, and bad groupings can greatly amplify them. crunch for DXT has a endpoint clusterization algorithm that was refined over many tens of thousands of real-life game textures and satellite photography. I've just begun experimenting with ETC1, and so far I'm very impressed with how well behaved and versatile it is.

Note this experiment was conducted in a new data compression codebase I've been building, which is much larger than crunch's.

ETC1 Texture Compression

Unlike DXT1, which only supports 3 or 4 unique block colors, the ETC1 format supports up to 8 unique block colors. It divides up the block into either two 4x2 or 2x4 pixel "subblocks". A single "flip" bit controls whether or not the subblocks are oriented horizontally or vertically. Each subblock has 4 colors, for 8 total.

The 4 subblock colors are created by taking the subblock's base color and adding to it 4 grayscale colors from an intensity table. Each subblock has 3 bits which selects which intensity table to apply. The intensity tables are constant and part of the spec.

To encode the two block colors, ETC1 supports two modes: an "individual" mode, where each color is encoded to 4:4:4, or a "differential"mode, where the first color is 5:5:5 and the second color is a two's complement encoded 3:3:3 delta relative to the base color. The delta is applied before the base color is scaled to 8-bits.

From an encoding perspective, individual mode is most useful when the two subblocks have wildly different colors (favoring color diversity vs. encoding precision), and delta mode is most useful when encoding precision is more useful than diversity.

Each pixel is represented using 2-bit selectors, just like DXT1. Except in ETC1, the color selected depends on which subblock the pixel is within.

So that's ETC1 in a nutshell. In practice, from what I remember its quality is a little lower than DXT1, but not by much. Its artifacts look more pleasant to me than DXT1's (obviously subjective). Each ETC1 block is represented by 2 colorspace lines that are always parallel to the grayscale axis. By comparison, with DXT1, there's only a single line, but it can be in any direction, and perhaps that gives it a slight advantage.

ETC1 Endpoint Clusterization

The goal here is to figure out how to reduce the total number of unique endpoints (or block colors and intensity table indices) in an ETC1 encoded image without murdering the quality. This is just an early experiment, so let's try simplifying the ETC1 format itself to keep things simple. This experiment always use differential block color mode, with the delta color set to (0,0,0). So each subblock is represented using the same 5:5:5 color, and the same intensity table. The flip bit is always false. Obviously, this is going to lower quality, but let's see what happens. Note this simplified format is still 100% compatible with existing ETC1 decoders, we're just limiting ourselves to only using a simpler subset.

Here's the original image (kodim18 - because I remember this image being a pain to handle well in crunch for DXT1):

Here's the image encoded using high quality ETC1 compression (using rg_etc1, slow mode, perceptual colorspace metrics):

Delta:

Grayscale delta histogram:

Error: Max: 56, Mean: 2.827, MSE: 16.106, RMSE: 4.013, PSNR: 36.061

So the ETC1 encoding that takes advantage of all ETC1 features is 36.061 dB.

Here's the encoding using just diff mode, no flipping, with a (0,0,0) delta color:

Delta:

Grayscale delta histogram:

Max: 74, Mean: 3.638, MSE: 27.869, RMSE: 5.279, PSNR: 33.680

So we've lost 2.38 dB by limiting ourselves to this simpler subset of ETC1. The reduction in quality is obviously visible, but by no means fatal for the purposes of this quick experiment.

In this experiment, each ETC1 block only contains 4 unique colors (or a single colorspace line, with "low" and "high" endpoints and 2 intermediate colors). Here's a visualization of the "low" and "high" endpoints in this image:

Now let's clusterize these block color endpoints, using 6D tree structured VQ (vector quantization) to perform the clusterization. The output of this step consists of a series of clusters, and each cluster contains one or more block indices. The idea is, blocks with similar endpoint vectors will be placed into the same cluster. This is a similar process used by crunch for DXT1. It's much like generating a RGB color palette from an array of image colors, except we're dealing with 6D vectors instead of 3D color vectors, and instead of using the output palette directly all we really care about is how the input vectors are grouped.

Here's a visualization of the cluster endpoint centroid vectors after generating 32 clusters:

Once we have the image organized into block clusters containing similar endpoints, use an internal helper class within rg_etc1 to find the near-optimal 5:5:5 endpoint and intensity table to represent all the pixels within each cluster. We can now create a ETC1-compatible texture by processing each block cluster and selecting the optimal selectors to use for each pixel.

Let's see what this texture looks like, and the PSNR, after limiting the number of unique endpoints.

ETC1 (subset) with 64 unique endpoints:

Error: Max: 110, Mean: 5.865, MSE: 70.233, RMSE: 8.380, PSNR: 29.665

ETC1 (subset) 256 unique endpoints:

Error: Max: 93, Mean: 4.624, MSE: 45.889, RMSE: 6.774, PSNR: 31.514

ETC1 (subset) 512 unique endpoints:

Error: Max: 87, Mean: 4.225, MSE: 38.411, RMSE: 6.198, PSNR: 32.286

ETC1 (subset) 1024 unique endpoints:

Error: Max: 87, Mean: 3.911, MSE: 32.967, RMSE: 5.742, PSNR: 32.950

ETC1 (subset) 4096 unique endpoints:

Error: Max: 87, Mean: 3.642, MSE: 28.037, RMSE: 5.295, PSNR: 33.654

Next Steps

This experiment shows one way to clusterize the endpoint optimization process in a limited subset of the ETC1 format. This first step must be mastered before crunch for ETC1 can be written.

The clusterization step outlined here isn't aware of flipping, or that each block can have 2 block colors, and we haven't even looked at the selectors yet. A production encoder will need to support more features of the ETC1 format. Note that crunch for DXT1 doesn't support 3 color blocks and works just fine, so it's possible we don't need to support every encoding feature.

Some next steps:

- Figure out how to best clusterize the full format. Expand the format subset to include two block colors, flipping, and both encodings.

Is 6D clusterization good enough - or is 12D needed?

- Selector clusterization

- ETC1 specific refinement stages: refine endpoints based off the clusterized endpoints, then refine the clusterized endpoints based off the clusterized selectors, possibly repeat.

- crunch-style tiling ("macroblocking") will most likely be needed to get bitrate down to JPEG+real-time encoding competitive levels.
- ETC2 support

(Currently, I'm conducting these experiments in my spare time, in between VR and optimization contracts. If you're really interested in accelerating development of crunch for a specific GPU format please contact info@binomial.info.)

↧

Visualizing ETC1 texture compression

September 5, 2016, 9:22 pm

≫ Next: More thoughts on a universal GPU texture interchange format

≪ Previous: ETC1 block color clusterization experiment

The ETC1 format consists of two block colors, two intensity table selectors, two mode bits ("diff" and "flip"), and 16 2-bit selectors. Here are some simple visualizations of what this encoded data looks like.

The original image (kodim14):

The ETC1 encoded image (using rg_etc1 in slow mode - modified to use perceptual colorspace metrics):

Error: Max: 63, Mean: 2.896, MSE: 19.284, RMSE: 4.391, PSNR: 35.279

Here's the selector image (the 2-bit selectors have been scaled up to 0-255):

Subblock 0's color, expanded to 8,8,8:

Subblock 0's intensity, scaled from 0-7 to 0-255:

Subblock 1's color, expanded to 8,8,8:

Subblock 1's intensity, scaled from 0-7 to 0-255:

The "diff" mode bits (white=differential mode, black=individual mode):

The "flip" mode bits (white=flipped):

↧

More thoughts on a universal GPU texture interchange format

September 5, 2016, 11:44 pm

≫ Next: Direct conversion of ETC1 to DXT1 texture data

≪ Previous: Visualizing ETC1 texture compression

Just some random thoughts:

I still think the idea of a universal GPU texture compression standard is fascinating and useful. Something that can be efficiently transcoded to 2 or more major vendor formats, without sacrificing too much along the quality or compression ratio axes. Developers could just encode to this standard interchange format and ship to a large range of devices without worrying about whether GPU Y supports arcane texture format Z. (This isn't my idea, it's from Won Chun at RAD.)

Imagine, for example, a format that can be efficiently transcoded to ASTC, with an alternate mode in the transcoder that outputs BC7 as a fallback. Interestingly, imagine if this GPU texture interchange format looked a bit better (and/or transcoded more quickly) when transcoded into one of the GPU formats verses the other. This situation seems very possible in some of the designs of a universal format I've been thinking about.

Now imagine, in a few years time, a large set of universal GPU textures gets used and stored by developers, and distributed into the wild on the web. Graphics or rendering code samples even start getting distributed using this interchange format. A situation like this would apply pressure to the other GPU vendor with the inferior format to either dump their format or create a newer format more compatible with efficient transcoding.

To put it simply, a universal format could help fix this mess of GPU texture formats we have today.

↧

Direct conversion of ETC1 to DXT1 texture data

September 6, 2016, 5:40 pm

≫ Next: Direct conversion of ETC1 to DXT1 texture data: 2nd experiment

≪ Previous: More thoughts on a universal GPU texture interchange format

In this experiment, I limited my ETC1 encoder to only use a subset of the full format: differential mode, no flipping, with the diff color always set to (0,0,0). So all we use in the ETC1 format is the 5:5:5 base color, the 3-bit intensity table index, and the 16 2-bit selectors. This is the same subset used in this post on ETC1 endpoint clusterization.

This limits the ETC1 encoder to only utilizing 4 colors per block, just like DXT1. These 4 colors are on a line parallel to the grayscale axis. Fully lossless conversion (of this ETC1 subset format) to DXT1 is not possible in all cases, but it may be possible to do a "good enough" conversion.

The ETC1->DXT1 conversion step uses a precomputed 18-bit lookup table (5*3+3 bits) to accelerate the conversion of the ETC1 base color, intensity table index, and selectors to DXT1 low/high color endpoints and selectors. Each table entry contains the best DXT1 low/high color endpoints to use, along with a 4 entry table specifying which DXT1 selector to use for each ETC1 selector. I used crunch's DXT1 endpoint optimizer to build this table.

ETC1 (subset):

Error: Max: 80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324

Converted directly to DXT1 using the lookup table approach, then decoded (in software using crnlib):

Error: Max: 73, Mean: 3.966, MSE: 32.873, RMSE: 5.733, PSNR: 32.962

Delta image:

Grayscale delta histogram:

There are some block artifacts to work on, but this is great progress for 1 hour of work. (Honestly, I would have been pretty worried if there weren't any artifacts to figure out on my first test!)

These results are extremely promising. The next step is to work on the artifacts and do more testing. If this conversion step can be made to work well enough it means that a lossy "universal crunch" format that can be quickly and efficiently transcoded to either DXT1 or ETC1 is actually possible.

↧

Direct conversion of ETC1 to DXT1 texture data: 2nd experiment

September 6, 2016, 11:48 pm

≫ Next: ETC1->DXT1 encoding table error visualization

≪ Previous: Direct conversion of ETC1 to DXT1 texture data

I lowered the ETC1 encoder's quality setting, so it doesn't try varying the block color so much during endpoint optimization. The DXT1 artifacts in my first experiment are definitely improved, although the overall quality is reduced. I also enabled usage of 3-color DXT1 blocks (although that was very minor).

Perhaps the right solution (that preserves quality but avoids the artifacts) is to add a ETC1->DXT1 error elevator to the ETC1 encoder, so it's aware of how much DXT1 error each ETC1 trial block color has.

ETC1 (subset):

Error: Max: 101, Mean: 4.036, MSE: 34.999, RMSE: 5.916, PSNR: 32.690

Converted directly to DXT1 using a 18-bit lookup table:

Error: Max: 107, Mean: 4.239, MSE: 38.930, RMSE: 6.239, PSNR: 32.228

Another ETC1:

Error: Max: 121, Mean: 4.220, MSE: 45.108, RMSE: 6.716, PSNR: 31.588

DXT1:

Error: Max: 117, Mean: 4.403, MSE: 48.206, RMSE: 6.943, PSNR: 31.300

↧

ETC1->DXT1 encoding table error visualization

September 7, 2016, 1:22 am

≫ Next: Direct conversion of ETC1 to DXT1 texture data: 3rd experiment

≪ Previous: Direct conversion of ETC1 to DXT1 texture data: 2nd experiment

Here's are two visualizations of the overall DXT1 encoding error due to using this table, assuming each selector is used equally (which is not always true). This is the lookup table referred to in my previous post.

Each small 32x32 pixel tile in this image visualizes a R,G slice of the 3D lattice, there are 32 tiles for B (left to right), and there are 8 rows overall. The first row of tiles is for ETC intensity table 0, the second 1, etc.

First visualization, where the max error in each individual tile is scaled to white:

Second visualization, visualizing max overall encoding error relative to all tiles:

Hmm - the last row (representing ETC1 intensity table 7) is approximated the worst in DXT1.

↧

Direct conversion of ETC1 to DXT1 texture data: 3rd experiment

September 7, 2016, 9:03 pm

≫ Next: More universal GPU texture format stuff

≪ Previous: ETC1->DXT1 encoding table error visualization

I've changed the lookup table used to convert to DXT1. Each cell in the 256K entry table (32*32*32*8, for each 5:5:5 base color and 3-bit intensity table entry in my ETC1 subset format) now contains 10 entries, to account for each combination of actually used ETC1 selector ranges in a block:

{ 0, 0 },

{ 1, 1 },

{ 2, 2 },

{ 3, 3 },

{ 0, 3 },

{ 1, 3 },

{ 2, 3 },

{ 0, 2 },

{ 0, 1 },

{ 1, 2 }

The first 4 entries here account for blocks that get encoded into a single color. The next entry accounts for blocks which use all selectors, then { 1, 3 } accounts for blocks which only use selectors 1,2,3, etc.

So for example, when converting from ETC1, if only selector 2 was actually used in a block, the ETC1->DXT1 converter uses a set of DXT1 low/high colors optimized for that particular use case. If all selectors were used, it uses entry #4, etc. The downsides to this technique are the extra CPU expense in the ETC1->DXT1 converter to determine the range of used selectors, and the extra memory to hold a larger table.

Note the ETC1 encoder is still not aware at all that its output will also be DXT1 coded. That's the next experiment. I don't think using this larger lookup table is necessary; a smaller table should hopefully be OK if the ETC1 subset encoder is aware of the DXT1 artifacts its introducing in each trial. Another idea is to use a simple table most of the time, and only access the larger/deeper conversion table on blocks which use the brighter ETC1 intensity table indices (the ones with more error, like 5-7).

ETC1 (subset):

Error: Max: 80, Mean: 3.802, MSE: 30.247, RMSE: 5.500, PSNR: 33.324

ETC1 texture directly converted to DXT1:

Error: Max: 73, Mean: 3.939, MSE: 32.218, RMSE: 5.676, PSNR: 33.050

I experimented with allowing the DXT1 optimizer (used to build the lookup table) to use 3-color blocks. This is actually a big deal for this use case, because the transparent selector's color is black (0,0,0). ETC1's saturation to 0 or 255 after adding the intensity table values creates "strange" block colors (away from the block's colorspace line), and this trick allows the DXT1 optimizer to work around that issue better. I'm not using this trick above, though.

I started seriously looking at the BC7 texture format's details today. It's complex, but nowhere near as complex as ASTC. I'm very tempted to try converting my ETC1 subset to that format next.

Also, if you're wondering why I'm working on this stuff: I want to write one .CRN-like encoder that supports efficient transcoding into as many GPU formats as possible. It's a lot of work to write these encoders, and the idea of that work's value getting amplified across a huge range of platforms and devices is very appealing. A universal format's quality won't be the best, but it may be practical to add a losslessly encoded "fixup" chunk to the end of the universal file. This could improve quality for a specific GPU format.

↧

More universal GPU texture format stuff

September 9, 2016, 12:54 am

≫ Next: Some memories

≪ Previous: Direct conversion of ETC1 to DXT1 texture data: 3rd experiment

Some BC7 format references:
https://msdn.microsoft.com/en-us/library/hh308954(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/hh308953.aspx

Source to CPU and shader BC7 (and other format) encoders/decoders:
https://github.com/Microsoft/DirectXTex

Khronos texture format references, including BC6H and BC7:
https://www.khronos.org/registry/dataformat/specs/1.1/dataformat.1.1.pdf

It may be possible to add ETC1-style subblocks into a universal GPU texture format, in a way that can be compressed efficiently and still converted on the fly to DXT1. Converting full ETC1 (with subblocks and per-subblock base colors) directly to BC7 at high quality looks easy because of BC7's partition table support. BC7 tables 0 and 13 (in 2 subset mode) perfectly match the ETC1 subblock orientations.

Any DX11 class or better GPU supports BC7, so on these GPU's the preferred output format can be BC7. DXT1 can be viewed as a legacy lower quality fallback for older GPU's.

Also, I limited the per-block (or per-subblock) base colors to 5:5:5 to simplify the experiments in my previous posts. Maybe storing 5:5:5 (for ETC1/DXT1) with 1-3 bit per-component deltas could improve the output for BC7/ASTC.

Also, one idea for alpha channel support in a universal GPU format: Store a 2nd ETC1 texture, containing the alpha channel. There's nothing to do when converting to ETC1, because using two ETC1 textures for color+alpha is a common pattern. (And, this eats two samplers, which sucks.)

When converting to DXT5's alpha block (DXT5A blocks - and yes I know there are BCx format equivalents but I'm using crnlib terms here), just use another ETC1 block color/intensity selector index to DXT5A mapping table. This table will be optimized for grayscale conversion. BC7 has very flexible alpha support so it should be a straightforward conversion.

The final thing to figure out is ASTC, but OMG that format looks daunting. Reminds me of MPEG/JPEG specs.

↧

Some memories

September 9, 2016, 11:34 am

≫ Next: Few more random thoughts on a "universal" GPU texture format

≪ Previous: More universal GPU texture format stuff

I remember a few years ago at one company, I was explaining and showing one of my early graphics API tracing/replaying demos (on a really cool 1st person game made by some company in Europe) to a couple "senior" engineers there. I described my plan and showed them the demo.

Both of them said it wasn't interesting, and implied I should stop now and not show what I was working on to the public.

Thanks to these two engineers, I knew for sure I had something valuable! And it turned out, this tool (and tools like it) was very useful and valuable to developers. I later showed this tool to the public and received amazingly positive feedback.

I had learned from many previous experiences that, at this particular company, resistance to new ideas was usually a sign. The harder they resisted, the more useful and interesting the technology probably was. The company had horribly stagnated, and the engineers there were, as a group, optimizing for yearly stack ranking slots (and their bonuses) and not for the actual needs of the company.

↧

Few more random thoughts on a "universal" GPU texture format

September 9, 2016, 10:13 pm

≫ Next: Hierarchical clustering

≪ Previous: Some memories

In my experiments, a simple but usable subset of ETC1 can be easily converted to DXT1, BC7, and ATC. And after studying the standard, it very much looks like the full ETC1 format can be converted into BC7 with very little loss. (And when I say "converted", I mean using very little CPU, just basically some table lookup operations over the endpoint and selector entries.)

ASTC seems to be (at first glance) around as powerful as BC7, so converting the full ETC1 format to ASTC with very little loss should be possible. (Unfortunately ASTC is so dense and complex that I don't have time to determine this for sure yet.)

So I'm pretty confident now that a universal format could be compatible with ASTC, BC7, DXT1, ETC1, and ATC. The only other major format that I can't fit into this scheme easily is my old nemesis, PVRTC.

Obviously this format won't look as good compared to a dedicated, single format encoder's output. So what? There are many valuable use cases that don't require super high quality levels. This scheme purposely trades off a drop in quality for interchange and distribution.

Additionally, with a crunch-style encoding method, only the endpoint (and possibly the selector) codebook entries (of which there are usually only hundreds, possibly up to a few thousand in a single texture) would need to be converted to the target format. So the GPU format conversion step doesn't actually need to be insanely fast.

Another idea is to just unify ASTC and BC7, two very high quality formats. The drop in quality due to unification would be relatively much less significant with this combination. (But how valuable is this combo?)

↧

Hierarchical clustering

September 10, 2016, 6:18 pm

≫ Next: Direct conversion of ETC1 to DXT1 texture data: 4th experiment

≪ Previous: Few more random thoughts on a "universal" GPU texture format

One of the key algorithms in crunch is determining how to group together block endpoints into clusters. Crunch uses a bottom up clustering approach at the 8x8 pixel (or 2x2 DXTn block) "macroblock" level, then it switches to top down. The top down method is extremely sensitive to the vectors chosen to represent each block during the clusterization step. The algorithm crunch uses to compute representative vectors (used only during clusterization) was refined and tweaked over time. Badly chosen representative vectors cause the clustering step to product crappy clusters (i.e. nasty artifacts).

Anyhow, an alternative approach would be entirely bottom up. I think this method could require less tweaking. Some reading:

https://en.wikipedia.org/wiki/Hierarchical_clustering

https://onlinecourses.science.psu.edu/stat505/node/143

Also Google "agglomerative hierarchical clustering". Here's a Youtube video describing it.

↧

Direct conversion of ETC1 to DXT1 texture data: 4th experiment

September 11, 2016, 1:33 am

≫ Next: Idea for next texture compression experiment

≪ Previous: Hierarchical clustering

In this experiment, I've worked on reducing the size of the lookup table used to quickly convert a subset of ETC1 texture data (using only a single 5:5:5 base color, one 3-bit intensity table index, and 2-bit selectors) directly to DXT1 texture data. Now the ETC1 encoder is able to simultaneously optimize for both formats, and due to this I can reduce the size of the conversion table. To accomplish this, I've modified the ETC1 base color/intensity optimizer function so it also factors in the DXT1 block encoding error into each trial's computed ETC1 error.

The overall trial error reported back to the encoder in this experiment was etc_error*16+dxt_error. The ETC1->DXT1 lookup table is now 3.75MB, with precomputed DXT1 low/high endpoints for three used selector ranges: 0-3, 0-2, 1-3. My previous experiment had 10 precomputed ranges, which seemed impractically large. I'm unsure which set of ranges is really needed or optimal yet. Even just one (0-3) seems to work OK, but with more artifacts on very high contrast blocks.

Anyhow, here's kodim18.

ETC1 subset:

Max: 80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

DXT1:

Max: 76, Mean: 3.952, MSE: 32.806, RMSE: 5.728, PSNR: 32.971

ETC1 block selector range usage histogram:
0-3: 19161
1-3: 3012
0-2: 2403

↧

Idea for next texture compression experiment

September 11, 2016, 6:14 pm

≫ Next: etcpak

≪ Previous: Direct conversion of ETC1 to DXT1 texture data: 4th experiment

Right now, I've got a GPU texture in a simple ETC1 subset that is easily converted to most other GPU formats:

Base color: 15-bits, 5:5:5 RGB
Intensity table index: 3-bits
Selectors: 2-bits/texel

Most importantly, this is a "single subset" encoding, using BC7 terminology. BC7 supports between 1-3 subsets per block. A subset is just a colorspace line represented by two R,G,B endpoint colors.

This format is easily converted to DXT1 using a table lookup. It's also the "base" of the universal GPU texture format I've been thinking about, because it's the data needed for DXT1 support. The next step is to experiment with attempting to refine this base data to better take advantage of the full ETC1 specification. So let's try adding two subsets to each block, with two partitions (again using BC7 terminology), top/bottom or left/right, which are supported by both ETC1 and BC7.

For example, we can code this base color, then delta code the 2 subset colors relative to this base. We'll also add a couple more intensity indices, which can be delta coded against the base index. Another bit can indicate which ETC1 block color encoding "mode" should be used (individual 4:4:4 4:4:4 or differential 5:5:5 3:3:3) to represent the subset colors in the output block.

In DXT1 mode, we can ignore this extra delta coded data and just convert the basic (single subset) base format. In ETC1/BC7/ASTC modes, we can use the extra information to support 2 subsets and 2 partitions.

Currently, the idea is to share the same selector indices between the single subset (DXT1) and two subset (BC7/ASTC/full ETC1) encodings. This will constrain how well this idea works, but I think it's worth trying out.

To add more quality to the 2 subset mode, we can delta code (maybe with some fancy per-pixel prediction) another array of selectors in some way. We can also add support for more partitions (derived from BC7's or ASTC's), too.

↧

etcpak

September 14, 2016, 10:42 pm

≫ Next: ETC1 principle axis optimization

≪ Previous: Idea for next texture compression experiment

etcpak is a very fast, but low quality ETC1 (and a little bit of ETC2) compressor:

https://bitbucket.org/wolfpld/etcpak/wiki/Home

It's the fastest open source ETC1 encoder that I'm aware of.

Notice the lack of any PSNR/MSE/SSIM statistics anywhere (that I can see). Also, the developer doesn't seem to get that the other tools/libraries he compares his stuff against were optimized for quality, not raw speed. In particular, rg_etc1 (and crunch's ETC1 support) was tuned to compete against the reference encoder along both the quality and perf. axes.

Anyhow, there are some interesting things to learn from etcpak:

Best quality doesn't always matter. It obviously depends on your use case. If you have 10 gigs of textures to compress then iteration speed can be very important.
The value spectrum spans from highest quality/slow encode (to ship final assets) to crap quality/fast as hell encode (favoring iteration speed).
Visually, the ETC1/2 formats are nicely forgiving. Even a low quality ETC1 encoder produces decent enough looking output for many use cases.

↧

ETC1 principle axis optimization

September 14, 2016, 11:00 pm

≫ Next: Universal texture compression: 5th experiment

≪ Previous: etcpak

One possible potential (probably minor) optimization to ETC1 encoding: determine the principle axis of the entire texture, rotate the texture's RGB pixels (by treating them as 3D vectors) so this axis is aligned along the grayscale axis, then compress the texture as usual. The pixel shader can undo the rotation using a trivial handful of instructions.

ETC1 uses colorspace lines constrained to be parallel to the grayscale axis, which this optimization exploits.

↧

Universal texture compression: 5th experiment

September 15, 2016, 1:03 am

≫ Next: Google's new ETC2 codec looks awesome

≪ Previous: ETC1 principle axis optimization

I outlined a plan for my next texture compression experiment in a previous post, here. I modified my ETC1 packer so it accepts an optional parameter which forces the encoder to use a set of predetermined selectors, instead of allowing it to use whatever selectors it likes.

The idea is, I can take an ETC1 texture using a subset of the full-format (no flips and only a single base color/intensity index - basically a single partition/single subset format using BC7 terminology) and "upgrade" it to higher quality without modifying the selector indices. I think this is one critical step to making a practical universal texture format that supports both DXT1 and ETC1.

Turns out, this idea works better than I thought it would. The ETC1 subset encoding gets 33.265 dB, while the "upgraded" version (using the same selectors as the subset encoding) gets 34.315 dB, a big gain. (Which isn't surprising, because the ETC1 subset encoding doesn't take full advantage of the format.) The nearly-optimal ETC1 encoding gets 35.475 dB, so there is still some quality left on the table here.

The ETC1 subset to DXT1 converted texture is 32.971 dB. I'm not worried about having the best DXT1 quality, because I'm going to support ASTC and BC7 too and (at the minimum) they can be directly converted from the "upgraded" ETC1 encoding that this experiment is about.

I need to think about the next step from here. I now know I can build a crunch-like format that supports DXT1, ETC1, and ATC. These experiments have opened up a bunch of interesting product and open source library ideas. Proving that BC7 support is also practical to add should be easy. ASTC is so darned complex that I'm hesitant to do it for "fun".

1. ETC1 (subset):

Max: 80, Mean: 3.809, MSE: 30.663, RMSE: 5.537, PSNR: 33.265

Its selectors:

2. ETC1 (full format, constrained selectors) - optimizer was constrained to always use the subset encoding's selectors:

Max: 85, Mean: 3.435, MSE: 24.076, RMSE: 4.907, PSNR: 34.315

Its selectors (should be the same as #1's):

Biased delta between the ETC1 subset and ETC1 full encoding with constrained selectors - so we can see what pixels have benefited from the "upgrade" pass:

3. ETC1 (full format, unconstrained selectors) - packed using an improved version of rg_etc1 in highest quality mode:

Max: 80, Mean: 3.007, MSE: 18.432, RMSE: 4.293, PSNR: 35.475

Delta between the best ETC1 encoding (#3) and the ETC1 encoding using constrained selectors (#2):

↧

Google's new ETC2 codec looks awesome

September 15, 2016, 4:16 pm

≫ Next: FasTC library

≪ Previous: Universal texture compression: 5th experiment

I've worked with many of the authors of this at one time or another:

Building a blazing fast ETC2 compressor

Repo:
https://github.com/google/etc2comp

(I can't believe the Mali encoder was only single threaded!)

↧