This is a compression algorithm focused on finding long distance matches.
It is based upon lz4 and uses nearly the same block format (github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md). The number of bytes to encode the offset is four instead of two in lz4 to reflect the longer distance matching. The block format is described in ldm.h.
Build
Run make.
Compressing a file
ldm <filename>
Decompression and verification can be enabled by defining DECOMPRESS_AND_VERIFY in main.c.
The output file names are as follows:
<filename>.ldm: compressed file<filename>.ldm.dec: decompressed file
Parameters
There are various parameters that can be tuned. These parameters can be tuned in ldm.h or, alternatively if ldm_params.h is included, in ldm_params.h (for easier configuration).
The parameters are as follows and must all be defined:
LDM_MEMORY_USAGE: the memory usage of the underlying hash table in bytes.HASH_BUCKET_SIZE_LOG: the log size of each bucket in the hash table (used in collision resolution).LDM_LAG: the lag (in bytes) in inserting entries into the hash table.LDM_WINDOW_SIZE_LOG: the log maximum window size when searching for matches.LDM_MIN_MATCH_LENGTH: the minimum match length.INSERT_BY_TAG: insert entries into the hash table as a function of the hash. This increases speed by reducing the number of hash table lookups and match comparisons. Certain hashes will never be inserted.USE_CHECKSUM: store a checksum with the hash table entries for faster comparison. This halves the number of entries the hash table can contain.
The optional parameter HASH_ONLY_EVERY_LOG is the log inverse frequency of insertion into the hash table. That is, an entry is inserted approximately every 1 << HASH_ONLY_EVERY_LOG times. If this parameter is not defined, the value is computed as a function of the window size and memory usage to approximate an even coverage of the window.
Benchmark
Below is a comparison of various compression methods on a tar of four versions of llvm (versions 3.9.0, 3.9.1, 4.0.0, 4.0.1) with a total size of 727900160 B.
| Method | Size | Ratio |
|---|---|---|
| lrzip -p 32 -n -w 1 | 369968714 |
1.97 |
| ldm | 209391361 |
3.48 |
| lz4 | 189954338 |
3.83 |
| lrzip -p 32 -l -w 1 | 163940343 |
4.44 |
| zstd -1 | 126080293 |
5.77 |
| lrzip -p 32 -n | 124821009 |
5.83 |
| lrzip -p 32 -n -w 1 & zstd -1 | 120317909 |
6.05 |
| zstd -3 -o | 115290952 |
6.31 |
| lrzip -p 32 -g -L 9 -w 1 | 107168979 |
6.79 |
| zstd -6 -o | 102772098 |
7.08 |
| zstd -T16 -9 | 98040470 |
7.42 |
| lrzip -p 32 -n -w 1 & zstd -T32 -19 | 88050289 |
8.27 |
| zstd -T32 -19 | 83626098 |
8.70 |
| lrzip -p 32 -n & zstd -1 | 36335117 |
20.03 |
| ldm & zstd -6 | 32856232 |
22.15 |
| lrzip -p 32 -g -L 9 | 32243594 |
22.58 |
| lrzip -p 32 -n & zstd -6 | 30954572 |
23.52 |
| lrzip -p 32 -n & zstd -T32 -19 | 26472064 |
27.50 |
The method marked ldm was run with the following parameters:
| Parameter | Value |
|---|---|
LDM_MEMORY_USAGE |
23 |
HASH_BUCKET_SIZE_LOG |
3 |
LDM_LAG |
0 |
LDM_WINDOW_SIZE_LOG |
28 |
LDM_MIN_MATCH_LENGTH |
64 |
INSERT_BY_TAG |
1 |
USE_CHECKSUM |
1 |
The compression speed was 220.5 MB/s.
Parameter selection
Below is a brief discussion of the effects of the parameters on the speed and compression ratio.
Speed
A large bottleneck in terms of speed is finding the matches and comparing to see if they are greater than the minimum match length. Generally:
- The fewer matches found (or the lower the percentage of the literals matched), the slower the algorithm will behave.
- Increasing
HASH_ONLY_EVERY_LOGresults in fewer inserts and, ifINSERT_BY_TAGis set, fewer lookups in the table. This has a large effect on speed, as well as compression ratio. - If
HASH_ONLY_EVERY_LOGis not set, its value is calculated based onLDM_WINDOW_SIZE_LOGandLDM_MEMORY_USAGE. IncreasingLDM_WINDOW_SIZE_LOGhas the effect of increasingHASH_ONLY_EVERY_LOGand increasingLDM_MEMORY_USAGEdecreasesHASH_ONLY_EVERY_LOG. USE_CHECKSUMgenerally improves speed with hash table lookups.
Compression ratio
The compression ratio is highly correlated with the coverage of matches. As a long distance matcher, the algorithm was designed to "optimize" for long distance matches outside the zstd compression window. The compression ratio after recompressing the output of the long-distance matcher with zstd was a more important signal in development than the raw compression ratio itself.
Generally, increasing LDM_MEMORY_USAGE will improve the compression ratio. However when using the default computed value of HASH_ONLY_EVERY_LOG, this increases the frequency of insertion and lookup in the table and thus may result in a decrease in speed.
Below is a table showing the speed and compression ratio when compressing the llvm tar (as described above) using different settings for LDM_MEMORY_USAGE. The other parameters were the same as used in the benchmark above.
LDM_MEMORY_USAGE |
Ratio | Speed (MB/s) | Ratio after zstd -6 |
|---|---|---|---|
18 |
1.85 |
232.4 |
10.92 |
21 |
2.79 |
233.9 |
15.92 |
23 |
3.48 |
220.5 |
18.29 |
25 |
4.56 |
140.8 |
19.21 |
Compression statistics
Compression statistics (and the configuration) can be enabled/disabled via COMPUTE_STATS and OUTPUT_CONFIGURATION in ldm.h.