clarify zstd specification for Huffman blocks

Following detailed comments from @dweiller in #3508.
This commit is contained in:
Yann Collet 2023-02-18 18:16:00 -08:00
parent 4ebaf36582
commit 832f559b0b

View File

@ -16,7 +16,7 @@ Distribution of this document is unlimited.
### Version ### Version
0.3.7 (2020-12-09) 0.3.8 (2023-02-18)
Introduction Introduction
@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types
repeated `Regenerated_Size` times. repeated `Regenerated_Size` times.
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block, - `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
starting with a Huffman tree description. starting with a Huffman tree description.
In this mode, there are at least 2 different literals represented in the Huffman tree description.
See details below. See details below.
- `Treeless_Literals_Block` - This is a Huffman-compressed block, - `Treeless_Literals_Block` - This is a Huffman-compressed block,
using Huffman tree _from previous Huffman-compressed literals block_. using Huffman tree _from previous Huffman-compressed literals block_.
@ -566,6 +567,7 @@ or from a dictionary.
### `Huffman_Tree_Description` ### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size of `Huffman_Tree_Description` is determined during decoding process, The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin. it must be used to determine where streams begin.
@ -1197,7 +1199,7 @@ Huffman Coding
-------------- --------------
Zstandard Huffman-coded streams are read backwards, Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams. similar to the FSE bitstreams.
Therefore, to find the start of the bitstream, it is therefore to Therefore, to find the start of the bitstream, it is required to
know the offset of the last byte of the Huffman-coded stream. know the offset of the last byte of the Huffman-coded stream.
After writing the last bit containing information, the compressor After writing the last bit containing information, the compressor
@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula :
``` ```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
``` ```
The last symbol's `Weight` is deduced from previously decoded ones, When a literal value is not present, it receives a `Weight` of 0.
by completing to the nearest power of 2. The least frequent symbol receives a `Weight` of 1.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. Consequently, the `Weight` 1 is necessarily present.
The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
The last symbol's `Weight` is deduced from previously retrieved Weights,
by completing to the nearest power of 2. It's necessarily non 0.
If it's not possible to reach a clean power of 2 with a single `Weight` value,
the Huffman Tree Description is considered invalid.
This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
`Max_Number_of_Bits` must be <= 11, `Max_Number_of_Bits` must be <= 11,
otherwise the representation is considered corrupted. otherwise the representation is considered corrupted.
@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described :
The tree depth is 4, since its longest elements uses 4 bits The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency). (longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4, Literal value `5` will not be listed, as it can be determined from previous values 0-4,
nor will values above `5` as they are all 0. nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is : Weight formula is :
@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is : The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`. `8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16. Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`. Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
#### Huffman Tree header #### Huffman Tree header
@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno
Version changes Version changes
--------------- ---------------
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878 - 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
- 0.3.6 : clarifications for Dictionary_ID - 0.3.6 : clarifications for Dictionary_ID
- 0.3.5 : clarifications for Block_Maximum_Size - 0.3.5 : clarifications for Block_Maximum_Size