Improved tiny block performance (i.e. calling lzham_decompress_memory() on a 88 byte string) by 3.3x by profiling the decompression of 400k blocks and focusing on the hotspots.
The top bottleneck was related to initializing the default Huffman tables, as I expected. At decomp startup, all the symbols have a frequency of 1, and the min/max code sizes are either equal (if the table's num_syms==pow2) or only differ by 1 (for non-pow2 tables). So this case was easy to optimize throughout the Huffman decoder table construction code.
The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.
I won't be pushing these changes up until a lot more testing. I should probably make a branch.
The next major bottleneck were some calls to malloc()/free() (up to a total of 34K worth, excluding the dictionary when using unbuffered decompression). I fixed this by adding a malloc_context parameter to any object or container that allocated/freed memory (which was a big pain), then allowing the user to optionally specify a fixed-size memory arena when they create the malloc context. The allocator functions in lzham_mem.cpp then try to allocate from this arena, which just treats it as a simple stack. Only the decompressor uses an arena, because its allocation patterns are very simple.
I won't be pushing these changes up until a lot more testing. I should probably make a branch.