* Compressor saves most recently used Huffman table and reuses it
if it produces better results.
* I attempted to preserve CPU usage profile.
I intentionally left all of the existing heuristics in place.
There is only a speed difference on the second block and later.
When compressing large enough blocks (say >= 4 KiB) there is
no significant difference in compression speed.
Dictionary compression of one block is the same speed for blocks
with literals <= 1 KiB, and after that the difference is not
very significant.
* In the synthetic data, with blocks 10 KB or smaller, most blocks
can't use repeated tables because the previous block did not
contain a symbol that the current block contains.
Once blocks are about 12 KB or more, most previous blocks have
valid Huffman tables for the current block, and the compression
ratio and decompression speed jumped.
* In silesia blocks as small as 4KB can frequently reuse the
previous Huffman table (85%), but it isn't as profitable, and
the previous Huffman table only gets used about 3% of the time.
* Microbenchmarks show that `HUF_validateCTable()` takes ~55 ns
and `HUF_estimateCompressedSize()` takes ~35 ns.
They are decently well optimized, the first versions took 90 ns
and 120 ns respectively. `HUF_validateCTable()` could be twice as
fast, if we cast the `HUF_CElt*` to a `U32*` and compare to 0.
However, `U32` has an alignment of 4 instead of 2, so I think that
might be undefined behavior.
* I've ran `zstreamtest` compiled normally, with UASAN and with MSAN
for 4 hours each.
The worst case for the speed difference is a bunch of small blocks
in the same frame. I modified `bench.c` to compress the input in a
single frame but with blocks of the given block size, set by `-B`.
Benchmarks on level 1:
| Program | Block size | Corpus | Ratio | Compression MB/s | Decompression MB/s |
|-----------|------------|-----------|-------|------------------|--------------------|
| zstd.base | 256 | synthetic | 2.364 | 110.0 | 297.0 |
| zstd | 256 | synthetic | 2.367 | 108.9 | 297.0 |
| zstd.base | 256 | silesia | 2.204 | 93.8 | 415.7 |
| zstd | 256 | silesia | 2.204 | 93.4 | 415.7 |
| zstd.base | 512 | synthetic | 2.594 | 144.2 | 420.0 |
| zstd | 512 | synthetic | 2.599 | 141.5 | 425.7 |
| zstd.base | 512 | silesia | 2.358 | 118.4 | 432.6 |
| zstd | 512 | silesia | 2.358 | 119.8 | 432.6 |
| zstd.base | 1024 | synthetic | 2.790 | 192.3 | 594.1 |
| zstd | 1024 | synthetic | 2.794 | 192.3 | 600.0 |
| zstd.base | 1024 | silesia | 2.524 | 148.2 | 464.2 |
| zstd | 1024 | silesia | 2.525 | 148.2 | 467.6 |
| zstd.base | 4096 | synthetic | 3.023 | 300.0 | 1000.0 |
| zstd | 4096 | synthetic | 3.024 | 300.0 | 1010.1 |
| zstd.base | 4096 | silesia | 2.779 | 223.1 | 623.5 |
| zstd | 4096 | silesia | 2.779 | 223.1 | 636.0 |
| zstd.base | 16384 | synthetic | 3.131 | 350.0 | 1150.1 |
| zstd | 16384 | synthetic | 3.152 | 350.0 | 1630.3 |
| zstd.base | 16384 | silesia | 2.871 | 296.5 | 883.3 |
| zstd | 16384 | silesia | 2.872 | 294.4 | 898.3 |
When ZSTD_decompressStream() detects
that there is enough space in dst
to complete decompression in a single pass,
delegates to ZSTD_decompress(),
for an extra ~5% speed boost
There used to be a (very small) chance that
loading prefix from previous segment
would be confused with a real zstd dictionary.
For that to happen, the prefix needs to start
with the same value as dictionary magic.
That's 1 chance in 4 billions if all values have equal probability.
But in fact, since some values are more common (0x00000000 for example)
others are less common, and dictionary magic was selected to be one of them,
so probabilities are likely even lower.
Anyway, this risk is no down to zero
by adding a new CCtx parameter : ZSTD_p_forceRawDict
Current parameter policy : the parameter "stick" to its CCtx,
so any dictionary loading after ZSTD_p_forceRawDict is set
will be loaded in "raw" ("content only") mode,
even if CCtx is re-used multiple times with multiple different dictionary.
It's up to the user to reset this value differently if it needs so.