clarify zstd specification for Huffman blocks

Following detailed comments from @dweiller in #3508.
This commit is contained in:
Yann Collet 2023-02-18 18:16:00 -08:00
parent 4ebaf36582
commit 832f559b0b

View File

@ -16,7 +16,7 @@ Distribution of this document is unlimited.
### Version
0.3.7 (2020-12-09)
0.3.8 (2023-02-18)
Introduction
@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types
repeated `Regenerated_Size` times.
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
starting with a Huffman tree description.
In this mode, there are at least 2 different literals represented in the Huffman tree description.
See details below.
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
using Huffman tree _from previous Huffman-compressed literals block_.
@ -566,6 +567,7 @@ or from a dictionary.
### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin.
@ -1197,7 +1199,7 @@ Huffman Coding
--------------
Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams.
Therefore, to find the start of the bitstream, it is therefore to
Therefore, to find the start of the bitstream, it is required to
know the offset of the last byte of the Huffman-coded stream.
After writing the last bit containing information, the compressor
@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
When a literal value is not present, it receives a `Weight` of 0.
The least frequent symbol receives a `Weight` of 1.
Consequently, the `Weight` 1 is necessarily present.
The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
The last symbol's `Weight` is deduced from previously retrieved Weights,
by completing to the nearest power of 2. It's necessarily non 0.
If it's not possible to reach a clean power of 2 with a single `Weight` value,
the Huffman Tree Description is considered invalid.
This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
`Max_Number_of_Bits` must be <= 11,
otherwise the representation is considered corrupted.
@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described :
The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4,
Literal value `5` will not be listed, as it can be determined from previous values 0-4,
nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is :
@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
#### Huffman Tree header
@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno
Version changes
---------------
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
- 0.3.6 : clarifications for Dictionary_ID
- 0.3.5 : clarifications for Block_Maximum_Size