I've optimized the bc7enc_rdo's RDO BC7 encoder a bunch over the past few days. I've also added multithreading via a OpenMP parallel for, which really helps.
RDO BC7+Deflate (4KB replacement window size)
33.551 RGB dB PSNR, 3.75 bits/texel
One could argue that at these low PSNR's you should just use BC1, but about 10% of the blocks in this RDO BC7 encoding use mode 1 (2 subsets). BC1 will be more blocky even at a similar PSNR.
31.319 dB, 3.25 bits/texel: