Just got it working for BC1. Took about 15 minutes of copying & pasting the BC7 ERT, then modifying it to decode BC1 instead of BC7 blocks and have it ignore the decoded alpha. The ERT function is like 250 lines of code, and for BC1 it would be easily vectorizable (way easier than BC7 because decoding BC1 is easy).
This implementation differs from the BC7 ERT in one simple way: The bytes copied from previously encoded blocks are allowed to be moved around within the current block. This is slower to encode, but gives the encoder more freedom. I'm going to ship both options (move vs. nomove).
Here's a 2.02 bits/texel (Deflate) encode (lambda=1.0), 34.426 RGB dB. Normal BC1 (rgbcx.cpp level 18) is 3.00 bits/texel 35.742 dB. Normal BC1 level 2 (should be very close to stb_dxt) gets 3.01 bits/texel and 35.086 dB, so if you're willing to lose a little quality you can get large savings.
I'll have this checked in tomorrow after more benchmarking and smooth block tuning.
I've been thinking about a simple/elegant universal rate distortion optimizing transform for GPU texture data for the past year, since working on UASTC and BC1 RDO. It's nice to see this working so well on two different GPU texture formats. ETC1-2, PVRTC1, LDR/HDR ASTC, and BC6H are coming.
1.77 bits/texel, 32.891 dB (-L18 -b -z4.0 -zb17.0 -zc2048):