bsdiff is a simple delta compression algorithm, and it performs well compared to its open and closed source competitors. (Note bsdiff doesn't scale to large files due to memory, but that's somewhat fixable by breaking the problem up into blocks.) It also beats LZHAM in static dictionary mode by a large margin, and I want to understand why.
It conceptually works in three high-level phases:
1. Build suffix array of original file
2. Preprocess the new file, using the suffix array to find exact or approximate matches against the original file. This results in a series of 24 byte patch control commands, followed by a "diff" block (bytes added to various regions from the original file) and an "extra" block (unmatched literal data).
3. Separately compress this data using bzip2, with a 512KB block size.
As a quick experiment, switching step 3 to LZMA or LZHAM results in easy gains (no surprise as bzip2 is pretty dated):
Original file: Unity510.exe 52,680,480
New file: Unity512.exe 52,740,568
bsdiff preprocess+lzma: 3,012,581
bsdiff preprocess+lzham_devel: 3,152,188
bsdiff preprocess+bzip2: 3,810,343
lzham_devel (in delta mode, seed dict=Unity510.exe): 4,831,025
The bsdiff patch control blocks consist of a series of Add/Copy/Skip triples (x, y, z). Each command consists of three 8-byte values:
- Add x bytes from the original file to x bytes from the diff block, write results to output
(Literally - add each individual byte.)
- Copy y bytes from the extra block to output
- Skip forwards or backwards z bytes in the original file
It conceptually works in three high-level phases:
1. Build suffix array of original file
2. Preprocess the new file, using the suffix array to find exact or approximate matches against the original file. This results in a series of 24 byte patch control commands, followed by a "diff" block (bytes added to various regions from the original file) and an "extra" block (unmatched literal data).
3. Separately compress this data using bzip2, with a 512KB block size.
As a quick experiment, switching step 3 to LZMA or LZHAM results in easy gains (no surprise as bzip2 is pretty dated):
Original file: Unity510.exe 52,680,480
New file: Unity512.exe 52,740,568
bsdiff preprocess+lzma: 3,012,581
bsdiff preprocess+lzham_devel: 3,152,188
bsdiff preprocess+bzip2: 3,810,343
lzham_devel (in delta mode, seed dict=Unity510.exe): 4,831,025
The bsdiff patch control blocks consist of a series of Add/Copy/Skip triples (x, y, z). Each command consists of three 8-byte values:
- Add x bytes from the original file to x bytes from the diff block, write results to output
(Literally - add each individual byte.)
- Copy y bytes from the extra block to output
- Skip forwards or backwards z bytes in the original file
Most of the output consists of diff bytes in this test:
Commands: 33,111 (794,664 bytes)
Diff bytes: 51,023,396
Extra bytes: 1,717,172
The bsdiff control block data can be viewed as a sort of program that outputs the new file, given the old file and a list of "addcopyskip x,y,z" instructions along with three automatically updated machine "registers": the offsets into the old file data, the diff bytes block, and the extra bytes blocks. bsdiff then compresses this program and the two data blocks (diff+extra) individually with bzip2.
I think there are several key reasons why bsdiff works well relative to LZHAM with a static dictionary (and LZMA too, except it doesn't have explicit support for static dictionaries):
- LZHAM has no explicit way of referencing the original file (the seed dictionary). It must encode high match distances to reach back into the seed dictionary, beyond the already decoded data. This is expensive, and grows increasingly so as the dictionary grow larger.
- bsdiff is able to move around its original file pointer either forwards or backwards, by encoding the distance to travel and a sign bit. LZHAM can only used previous match distances (from the 4 entry LRU match history buffer), or it must spend many bits coding absolute match distances.
- LZHAM doesn't have delta matches (bsdiff's add operation) to efficiently encode the "second order" changes talked about in Colin Percival's thesis. The closest it gets are special handling for single byte LAM's (literals after matches), by XOR'ing the mismatch byte vs. the lookahead byte and coding the result with Huffman coding. But immediately after a LAM it always falls back to plain literals or exact matches.
- The bsdiff approach is able to apply the full bzip2 machinery to the diff bytes. In this file, most diff bytes are 0 with sprinklings of 2 or 3 byte deltas. Patterns are readily apparent in the delta data.
After some thought, one way to enhance LZHAM with stronger support for delta compression: support multibyte LAM's (perhaps by outputting the LAM length with arithmetic coding when in LAM mode), and maybe add the ability to encode REP matches with delta distances.
Or, just use LZHAM or LZMA as-is and just do the delta compression step as a preprocess, just like bzip2 does.
Commands: 33,111 (794,664 bytes)
Diff bytes: 51,023,396
Extra bytes: 1,717,172
The bsdiff control block data can be viewed as a sort of program that outputs the new file, given the old file and a list of "addcopyskip x,y,z" instructions along with three automatically updated machine "registers": the offsets into the old file data, the diff bytes block, and the extra bytes blocks. bsdiff then compresses this program and the two data blocks (diff+extra) individually with bzip2.
I think there are several key reasons why bsdiff works well relative to LZHAM with a static dictionary (and LZMA too, except it doesn't have explicit support for static dictionaries):
- LZHAM has no explicit way of referencing the original file (the seed dictionary). It must encode high match distances to reach back into the seed dictionary, beyond the already decoded data. This is expensive, and grows increasingly so as the dictionary grow larger.
- bsdiff is able to move around its original file pointer either forwards or backwards, by encoding the distance to travel and a sign bit. LZHAM can only used previous match distances (from the 4 entry LRU match history buffer), or it must spend many bits coding absolute match distances.
- LZHAM doesn't have delta matches (bsdiff's add operation) to efficiently encode the "second order" changes talked about in Colin Percival's thesis. The closest it gets are special handling for single byte LAM's (literals after matches), by XOR'ing the mismatch byte vs. the lookahead byte and coding the result with Huffman coding. But immediately after a LAM it always falls back to plain literals or exact matches.
- The bsdiff approach is able to apply the full bzip2 machinery to the diff bytes. In this file, most diff bytes are 0 with sprinklings of 2 or 3 byte deltas. Patterns are readily apparent in the delta data.
After some thought, one way to enhance LZHAM with stronger support for delta compression: support multibyte LAM's (perhaps by outputting the LAM length with arithmetic coding when in LAM mode), and maybe add the ability to encode REP matches with delta distances.
Or, just use LZHAM or LZMA as-is and just do the delta compression step as a preprocess, just like bzip2 does.