Some minor observations about Lagrangian multiplier based RDO (with BC7 RDO+Deflate or LZ4):
We're optimizing to find lowest t (sometimes called j), given many hundreds/thousands of ways of encoding a BC7 block:
float t = trial_mse * smooth_block_error_scale + trial_lz_bits * lambda;
For each trial block, we compute its MSE and estimate its LZ bits using a simple Deflate/LZ4-like model.
If we already have a potential solution for a block (the best found so far), given the trial block's MSE and the current best_t we can compute how many bits (maximum) a new trial encoding would take to be an improvement. If the number of computed threshold bits is ridiculous (like negative, or just impossible to achieve with Deflate on a 128-bit block input), we can immediately throw out that trial block:
threshold_trial_lz_bits = (best_t - trial_mse * smooth_block_error_scale ) / lambda
Same for MSE: if we already have a solution, we can compute the MSE threshold where it's impossible for a trial to be an improvement:
threshold_trial_mse = (best_t - (trial_lz_bits * lambda)) / smooth_block_error_scale
This seems less valuable because running the LZ simulator to compute trial_lz_bits is likely more expensive than computing a trial block's MSE. We could plug in a lowest possible estimate for trial_lz_bits, and use that as a threshold MSE. Another interesting thing about this: trials are very likely to always have an MSE >= than the best found encoding for a block.
Using simple formulas like this results in large perf. improvements (~2x).