credit to oss-fuzz
This issue could happen when using the new Sequence Compression API in Explicit Delimiter Mode
with a too small dstCapacity.
In which case, there was one place where the buffer size wasn't checked.
discovered by oss-fuzz
It's a bug in the test itself :
ZSTD_compressBound() as an upper bound of the compress size
only works for data compressed "normally".
But in situations where many flushes are forcefully introduced,
this creates many more blocks,
each of which has a potential to increase the size by 3 bytes.
In extreme cases (lots of small incompressible blocks), the expansion can go beyond ZSTD_compressBound().
This situation is similar when using the CompressSequences() API
with Explicit Block Delimiters.
In which case, each explicit block acts like a deliberate flush.
When employed by a fuzzer, it's possible to generate scenarios like the one described above,
with tons of incompressible blocks of small sizes,
thus going beyond ZSTD_compressBound().
fix : when using Explicit Block Delimiters, use a larger bound, to account for this scenario.
credit to oss-fuzz
In rare circumstances, the block-splitter might cut a block at the exact beginning of a repcode.
In which case, since litlength=0, if the repcode expected 1+ literals in front, its signification changes.
This scenario is controlled in ZSTD_seqStore_resolveOffCodes(),
and the repcode is transformed into a raw offset when its new meaning is incorrect.
In more complex scenarios, the previous block might be emitted as uncompressed after all,
thus modifying the expected repcode history.
In the case discovered by oss-fuzz, the first block is emitted as uncompressed,
so the repcode history remains at default values: 1,4,8.
But since the starting repcode is repcode3, and the literal length is == 0,
its meaning is : = repcode1 - 1.
Since repcode1==1, it results in an offset value of 0, which is invalid.
So that's what the `assert()` was verifying : the result of the repcode translation should be a valid offset.
But actually, it doesn't matter, because this result will then be compared to reality,
and since it's an invalid offset, it will necessarily be discarded if incorrect,
then the repcode will be replaced by a raw offset.
So the `assert()` is not useful.
Furthermore, it's incorrect, because it assumes this situation cannot happen, but it does, as described in above scenario.
which properly represents the maximum bit size of compressed literals (11) as defined in the specification.
To be preferred from HUF_TABLELOG_DEFAULT which represents the same value but by accident.
Name selected to keep the same convention as existing width definitions,
MLFSELog, LLFSELog and OffFSELog.
In rare cases, the default huffman depth selector is a bit too harsh,
requiring brutal adaptations to the tree,
resulting is some loss of compression ratio.
This new heuristic avoids the worse cases, favoring compression ratio.
As an example, compression of a specific distribution of 771 literals
is now improved to 441 bytes, from 601 bytes before.
oss-fuzz uncovered a scenario where we're evaluating the cost of litLength = 131072,
which can't be represented in the zstd format, so we accessed 1 beyond LL_bits.
Fix the issue by making it cost 1 bit more than litLength = 131071.
There are still follow ups:
1. This happened because literals_cost[0] = 0, so the optimal parser chose 36 literals
over a match. Should we bound literals_cost[literal] > 0, unless the block truly only
has one literal value?
2. When no matches are found, the cost model isn't updated. In this case no matches were
found for an entire block. So the literals cost model wasn't updated at all. That made
the optimal parser think literals_cost[0] = 0, where it is actually quite high, since
the block was entirely random noise.
Credit to OSS-Fuzz.
When re-using a compression state, across multiple successive compressions,
the state should minimize the amount of allocation and initialization required.
This mostly matters in situations where initialization is an overwhelming task
compared to compression itself.
This can happen when the amount to compress is small,
while the compression state was given the impression that it would be much larger,
aka, streaming mode without providing a srcSize hint.
This lean-initialization optimization was broken in 980f3bbf8354edec0ad32b4430800f330185de6a .
This commit fixes it, making this scenario once again on par with v1.4.9.
Note that this does not completely fix#2966,
since another heavy initialization, specific to row mode,
is also happening (and was not present in v1.4.9).
This will be fixed in a separate commit.