Merge pull request #4164 from facebook/spec_043

spec update: huffman prefix code paragraph
This commit is contained in:
Yann Collet 2024-10-10 16:56:02 -07:00 committed by GitHub
commit 7ba43091b8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -16,7 +16,7 @@ Distribution of this document is unlimited.
### Version ### Version
0.4.2 (2024-10-02) 0.4.3 (2024-10-07)
Introduction Introduction
@ -1270,13 +1270,13 @@ This specification limits maximum code length to 11 bits.
#### Representation #### Representation
All literal values from zero (included) to last present one (excluded) All literal symbols from zero (included) to last present one (excluded)
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`. are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formula : Transformation from `Weight` to `Number_of_Bits` follows this formula :
``` ```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
``` ```
When a literal value is not present, it receives a `Weight` of 0. When a literal symbol is not present, it receives a `Weight` of 0.
The least frequent symbol receives a `Weight` of 1. The least frequent symbol receives a `Weight` of 1.
If no literal has a `Weight` of 1, then the data is considered corrupted. If no literal has a `Weight` of 1, then the data is considered corrupted.
If there are not at least two literals with non-zero `Weight`, then the data If there are not at least two literals with non-zero `Weight`, then the data
@ -1293,33 +1293,38 @@ otherwise the representation is considered corrupted.
__Example__ : __Example__ :
Let's presume the following Huffman tree must be described : Let's presume the following Huffman tree must be described :
| literal value | 0 | 1 | 2 | 3 | 4 | 5 | | literal symbol | A | B | C | D | E | F |
| ---------------- | --- | --- | --- | --- | --- | --- | | ---------------- | --- | --- | --- | --- | --- | --- |
| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | | `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 |
The tree depth is 4, since its longest elements uses 4 bits The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency). (longest elements are the ones with smallest frequency).
Literal value `5` will not be listed, as it can be determined from previous values 0-4,
nor will values above `5` as they are all 0. All symbols will now receive a `Weight` instead of `Number_of_Bits`.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is : Weight formula is :
``` ```
Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
``` ```
It gives the following series of weights : It gives the following series of Weights :
| literal value | 0 | 1 | 2 | 3 | 4 | | literal symbol | A | B | C | D | E | F |
| ------------- | --- | --- | --- | --- | --- | | -------------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
This list will be sent to the decoder, with the following modifications:
- `F` will not be listed, because it can be determined from previous symbols
- nor will symbols above `F` as they are all 0
- on the other hand, all symbols before `A`, starting with `\0`, will be listed, with a Weight of 0.
The decoder will do the inverse operation : The decoder will do the inverse operation :
having collected weights of literal symbols from `0` to `4`, having collected weights of literal symbols from `A` to `E`,
it knows the last literal, `5`, is present with a non-zero `Weight`. it knows the last literal, `F`, is present with a non-zero `Weight`.
The `Weight` of `5` can be determined by advancing to the next power of 2. The `Weight` of `F` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is : The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`. `8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16. Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`. Therefore, `Max_Number_of_Bits = log2(16) = 4` and `Weight[F] = log_2(16 - 15) + 1 = 1`.
#### Huffman Tree header #### Huffman Tree header
@ -1359,7 +1364,7 @@ sharing a single distribution table.
To decode an FSE bitstream, it is necessary to know its compressed size. To decode an FSE bitstream, it is necessary to know its compressed size.
Compressed size is provided by `headerByte`. Compressed size is provided by `headerByte`.
It's also necessary to know its _maximum possible_ decompressed size, It's also necessary to know its _maximum possible_ decompressed size,
which is `255`, since literal values span from `0` to `255`, which is `255`, since literal symbols span from `0` to `255`,
and last symbol's `Weight` is not represented. and last symbol's `Weight` is not represented.
An FSE bitstream starts by a header, describing probabilities distribution. An FSE bitstream starts by a header, describing probabilities distribution.
@ -1395,26 +1400,28 @@ It is possible to transform weights into `Number_of_Bits`, using this formula:
``` ```
Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0 Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
``` ```
Symbols are sorted by `Weight`. In order to determine which prefix code is assigned to each Symbol,
Within same `Weight`, symbols keep natural sequential order. Symbols are first sorted by `Weight`, then by natural sequential order.
Symbols with a `Weight` of zero are removed. Symbols with a `Weight` of zero are removed.
Then, starting from lowest `Weight`, prefix codes are distributed in sequential order. Then, starting from lowest `Weight` (hence highest `Number_of_Bits`),
prefix codes are assigned in ascending order.
__Example__ : __Example__ :
Let's presume the following list of weights has been decoded : Let's assume the following list of weights has been decoded:
| Literal | 0 | 1 | 2 | 3 | 4 | 5 | | Literal | A | B | C | D | E | F |
| -------- | --- | --- | --- | --- | --- | --- | | -------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural sequential order, Sorted by weight and then natural sequential order,
it gives the following distribution : it gives the following prefix codes distribution:
| Literal | 3 | 4 | 5 | 2 | 1 | 0 | | Literal | D | E | F | C | B | A |
| ---------------- | --- | --- | --- | --- | --- | ---- | | ---------------- | --- | ---- | ---- | ---- | ---- | ---- |
| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | | `Weight` | 0 | 1 | 1 | 2 | 3 | 4 |
| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | | `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 |
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | | prefix code | N/A | 0000 | 0001 | 001 | 01 | 1 |
| ascending order | N/A | 0000 | 0001 | 001x | 01xx | 1xxx |
### Huffman-coded Streams ### Huffman-coded Streams
@ -1437,10 +1444,10 @@ it's possible to read the bitstream in a __little-endian__ fashion,
keeping track of already used bits. Since the bitstream is encoded in reverse keeping track of already used bits. Since the bitstream is encoded in reverse
order, starting from the end read symbols in forward order. order, starting from the end read symbols in forward order.
For example, if the literal sequence "0145" was encoded using above prefix code, For example, if the literal sequence `ABEF` was encoded using above prefix code,
it would be encoded (in reverse order) as: it would be encoded (in reverse order) as:
|Symbol | 5 | 4 | 1 | 0 | Padding | |Symbol | F | E | B | A | Padding |
|--------|------|------|----|---|---------| |--------|------|------|----|---|---------|
|Encoding|`0000`|`0001`|`01`|`1`| `00001` | |Encoding|`0000`|`0001`|`01`|`1`| `00001` |
@ -1735,6 +1742,7 @@ or at least provide a meaningful error code explaining for which reason it canno
Version changes Version changes
--------------- ---------------
- 0.4.3 : clarifications for Huffman prefix code assignment example
- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian - 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota - 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov - 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov