libsquish (a popular DXT encoding library) internally uses a total ordering based method to find high-quality DXT endpoints. This method can also be applied to ETC1 encoding, using the equations in rg_etc1's optimizer's remarks to solve for optimal subblock colors given each possible selector distribution in the total ordering and the current best intensity index and subblock color.
I don't actually compute the total ordering, I instead iterate over all selector distributions present in the total ordering because the actual per-pixel selector values don't matter to the solver. A hash table is also used to prevent the optimizer from evaluating a trial solution more than once.
Single threaded results:
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512
--- basislib Quality: 4
basislib time: 5.644
basislib ETC image Error: Max: 70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853
--- etc2comp effort: 10
etc2comp time: 75.792851
etc2comp Error: Max: 75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339
--- etcpak time: 0.006
etcpak Error: Max: 80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697
--- ispc_etc time: 1.021655
ispc_etc1 Error: Max: 75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916
After enabling multithreading (40 threads) in those encoders that support it:
J:\dev\basislib1\bin>texexp kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512
--- basislib Quality: 4
basislib pack time: 0.266
basislib ETC image Error: Max: 70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853
--- etc2comp effort: 10
etc2comp time: 3.608819
etc2comp Error: Max: 75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339
--- etcpak time: 0.006
etcpak Error: Max: 80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697
--- ispc_etc time: 1.054324
ispc_etc1 Error: Max: 75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916
Intel is doing some kind of amazing SIMD dark magic in there. The ETC1 cluster fit method is around 10-27x faster than rg_etc1 (which uses my previous method, a hybrid of a 3D neighborhood search with iterative base color refinement) and etc2comp (effort 100) in ETC1 mode. RGB Avg. PSNR is usually within ~.1 dB of Intel.
I'm so tempted to update rg_etc1 with this algorithm, if only I had the time.
I don't actually compute the total ordering, I instead iterate over all selector distributions present in the total ordering because the actual per-pixel selector values don't matter to the solver. A hash table is also used to prevent the optimizer from evaluating a trial solution more than once.
Single threaded results:
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512
--- basislib Quality: 4
basislib time: 5.644
basislib ETC image Error: Max: 70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853
--- etc2comp effort: 10
etc2comp time: 75.792851
etc2comp Error: Max: 75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339
--- etcpak time: 0.006
etcpak Error: Max: 80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697
--- ispc_etc time: 1.021655
ispc_etc1 Error: Max: 75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916
After enabling multithreading (40 threads) in those encoders that support it:
J:\dev\basislib1\bin>texexp kodak\kodim03.png
perceptual: 0 etc2: 0 rec709: 1
Source filename: kodak\kodim03.png 768x512
--- basislib Quality: 4
basislib pack time: 0.266
basislib ETC image Error: Max: 70, Mean: 1.952, MSE: 8.220, RMSE: 2.867, PSNR: 38.982, SSIM: 0.964853
--- etc2comp effort: 10
etc2comp time: 3.608819
etc2comp Error: Max: 75, Mean: 1.925, MSE: 8.009, RMSE: 2.830, PSNR: 39.095, SSIM: 0.965339
--- etcpak time: 0.006
etcpak Error: Max: 80, Mean: 2.492, MSE: 12.757, RMSE: 3.572, PSNR: 37.073, SSIM: 0.944697
--- ispc_etc time: 1.054324
ispc_etc1 Error: Max: 75, Mean: 1.965, MSE: 8.280, RMSE: 2.877, PSNR: 38.951, SSIM: 0.963916
Intel is doing some kind of amazing SIMD dark magic in there. The ETC1 cluster fit method is around 10-27x faster than rg_etc1 (which uses my previous method, a hybrid of a 3D neighborhood search with iterative base color refinement) and etc2comp (effort 100) in ETC1 mode. RGB Avg. PSNR is usually within ~.1 dB of Intel.
I'm so tempted to update rg_etc1 with this algorithm, if only I had the time.