mirror of
				https://github.com/facebook/zstd.git
				synced 2025-10-26 00:02:22 -04:00 
			
		
		
		
	update Zstandard format specification
answering a few questions from IETF RFC Discuss stage.
This commit is contained in:
		
							parent
							
								
									ac4f7ead3b
								
							
						
					
					
						commit
						a4c9c4defe
					
				| @ -16,7 +16,7 @@ Distribution of this document is unlimited. | |||||||
| 
 | 
 | ||||||
| ### Version | ### Version | ||||||
| 
 | 
 | ||||||
| 0.2.7 (30/04/18) | 0.2.8 (30/05/18) | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| Introduction | Introduction | ||||||
| @ -27,6 +27,8 @@ that is independent of CPU type, operating system, | |||||||
| file system and character set, suitable for | file system and character set, suitable for | ||||||
| file compression, pipe and streaming compression, | file compression, pipe and streaming compression, | ||||||
| using the [Zstandard algorithm](http://www.zstandard.org). | using the [Zstandard algorithm](http://www.zstandard.org). | ||||||
|  | The text of the specification assumes a basic background in programming | ||||||
|  | at the level of bits and other primitive data representations. | ||||||
| 
 | 
 | ||||||
| The data can be produced or consumed, | The data can be produced or consumed, | ||||||
| even for an arbitrarily long sequentially presented input data stream, | even for an arbitrarily long sequentially presented input data stream, | ||||||
| @ -39,11 +41,6 @@ for detection of data corruption. | |||||||
| The data format defined by this specification | The data format defined by this specification | ||||||
| does not attempt to allow random access to compressed data. | does not attempt to allow random access to compressed data. | ||||||
| 
 | 
 | ||||||
| This specification is intended for use by implementers of software |  | ||||||
| to compress data into Zstandard format and/or decompress data from Zstandard format. |  | ||||||
| The text of the specification assumes a basic background in programming |  | ||||||
| at the level of bits and other primitive data representations. |  | ||||||
| 
 |  | ||||||
| Unless otherwise indicated below, | Unless otherwise indicated below, | ||||||
| a compliant compressor must produce data sets | a compliant compressor must produce data sets | ||||||
| that conform to the specifications presented here. | that conform to the specifications presented here. | ||||||
| @ -57,6 +54,16 @@ Whenever it does not support a parameter defined in the compressed stream, | |||||||
| it must produce a non-ambiguous error code and associated error message | it must produce a non-ambiguous error code and associated error message | ||||||
| explaining which parameter is unsupported. | explaining which parameter is unsupported. | ||||||
| 
 | 
 | ||||||
|  | This specification is intended for use by implementers of software | ||||||
|  | to compress data into Zstandard format and/or decompress data from Zstandard format. | ||||||
|  | The Zstandard format is supported by an open source reference implementation, | ||||||
|  | which also contains some useful validation tool, | ||||||
|  | such as `decodeCorpus`, which generate random valid frames, | ||||||
|  | that a compliant decoder should be able to decode, | ||||||
|  | or provide a meaningful error code explaining why it cannot. | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
| ### Overall conventions | ### Overall conventions | ||||||
| In this document: | In this document: | ||||||
| - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. | ||||||
| @ -92,14 +99,14 @@ Overview | |||||||
| Frames | Frames | ||||||
| ------ | ------ | ||||||
| Zstandard compressed data is made of one or more __frames__. | Zstandard compressed data is made of one or more __frames__. | ||||||
| Each frame is independent and can be decompressed indepedently of other frames. | Each frame is independent and can be decompressed independently of other frames. | ||||||
| The decompressed content of multiple concatenated frames is the concatenation of | The decompressed content of multiple concatenated frames is the concatenation of | ||||||
| each frame decompressed content. | each frame decompressed content. | ||||||
| 
 | 
 | ||||||
| There are two frame formats defined by Zstandard: | There are two frame formats defined by Zstandard: | ||||||
|   Zstandard frames and Skippable frames. |   Zstandard frames and Skippable frames. | ||||||
| Zstandard frames contain compressed data, while | Zstandard frames contain compressed data, while | ||||||
| skippable frames contain no data and can be used for metadata. | skippable frames contain custom user metadata. | ||||||
| 
 | 
 | ||||||
| ## Zstandard frames | ## Zstandard frames | ||||||
| The structure of a single Zstandard frame is following: | The structure of a single Zstandard frame is following: | ||||||
| @ -630,15 +637,8 @@ They follow the same enumeration : | |||||||
| - `Predefined_Mode` : A predefined FSE distribution table is used, defined in | - `Predefined_Mode` : A predefined FSE distribution table is used, defined in | ||||||
|           [default distributions](#default-distributions). |           [default distributions](#default-distributions). | ||||||
|           No distribution table will be present. |           No distribution table will be present. | ||||||
| - `RLE_Mode` : The table description consists of a single byte. | - `RLE_Mode` : The table description consists of a single byte, which contain symbol's value. | ||||||
|           This code will be repeated for all sequences. |           This symbol will be used for all sequences. | ||||||
| - `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, |  | ||||||
|           or if this is the first block, table in the dictionary will be used |  | ||||||
|           No distribution table will be present. |  | ||||||
|           Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. |  | ||||||
|           Note that this also includes `Predefined_Mode`. |  | ||||||
|           If this mode is used without any previous sequence table in the frame |  | ||||||
|           (or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. |  | ||||||
| - `FSE_Compressed_Mode` : standard FSE compression. | - `FSE_Compressed_Mode` : standard FSE compression. | ||||||
|           A distribution table will be present. |           A distribution table will be present. | ||||||
|           The format of this distribution table is described in [FSE Table Description](#fse-table-description). |           The format of this distribution table is described in [FSE Table Description](#fse-table-description). | ||||||
| @ -646,6 +646,13 @@ They follow the same enumeration : | |||||||
|           and the maximum accuracy log for the offsets table is 8. |           and the maximum accuracy log for the offsets table is 8. | ||||||
|           `FSE_Compressed_Mode` must not be used when only one symbol is present, |           `FSE_Compressed_Mode` must not be used when only one symbol is present, | ||||||
|           `RLE_Mode` should be used instead (although any other mode will work). |           `RLE_Mode` should be used instead (although any other mode will work). | ||||||
|  | - `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, | ||||||
|  |           or if this is the first block, table in the dictionary will be used. | ||||||
|  |           Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. | ||||||
|  |           It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`. | ||||||
|  |           No distribution table will be present. | ||||||
|  |           If this mode is used without any previous sequence table in the frame | ||||||
|  |           (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. | ||||||
| 
 | 
 | ||||||
| #### The codes for literals lengths, match lengths, and offsets. | #### The codes for literals lengths, match lengths, and offsets. | ||||||
| 
 | 
 | ||||||
| @ -903,16 +910,22 @@ Skippable Frames | |||||||
| |:--------------:|:------------:|:-----------:| | |:--------------:|:------------:|:-----------:| | ||||||
| |   4 bytes      |  4 bytes     |   n bytes   | | |   4 bytes      |  4 bytes     |   n bytes   | | ||||||
| 
 | 
 | ||||||
| Skippable frames allow the insertion of user-defined data | Skippable frames allow the insertion of user-defined metadata | ||||||
| into a flow of concatenated frames. | into a flow of concatenated frames. | ||||||
| Its design is pretty straightforward, |  | ||||||
| with the sole objective to allow the decoder to quickly skip |  | ||||||
| over user-defined data and continue decoding. |  | ||||||
| 
 | 
 | ||||||
| Skippable frames defined in this specification are compatible with [LZ4] ones. | Skippable frames defined in this specification are compatible with [LZ4] ones. | ||||||
| 
 | 
 | ||||||
| [LZ4]:http://www.lz4.org | [LZ4]:http://www.lz4.org | ||||||
| 
 | 
 | ||||||
|  | From a compliant decoder perspective, skippable frames need just be skipped, | ||||||
|  | and their content ignored, resuming decoding after the skippable frame. | ||||||
|  | 
 | ||||||
|  | It can be noted that a skippable frame | ||||||
|  | can be used to watermark a stream of concatenated frames | ||||||
|  | embedding any kind of tracking information (even just an UUID). | ||||||
|  | User wary of such usage should scan the stream of concatenated frames | ||||||
|  | in an attempt to detect such frame for analysis or removal. | ||||||
|  | 
 | ||||||
| __`Magic_Number`__ | __`Magic_Number`__ | ||||||
| 
 | 
 | ||||||
| 4 Bytes, __little-endian__ format. | 4 Bytes, __little-endian__ format. | ||||||
| @ -1196,14 +1209,15 @@ which describes how to decode the list of weights. | |||||||
|   the top four bits and the second taking the bottom four (e.g. the following |   the top four bits and the second taking the bottom four (e.g. the following | ||||||
|   operations could be used to read the weights: |   operations could be used to read the weights: | ||||||
|   `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.). |   `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.). | ||||||
|   The full representation occupies `((Number_of_Symbols+1)/2)` bytes, |   The full representation occupies `Ceiling(Number_of_Symbols/2)` bytes, | ||||||
|   meaning it uses a last full byte even if `Number_of_Symbols` is odd. |   meaning it uses only full bytes even if `Number_of_Symbols` is odd. | ||||||
|   `Number_of_Symbols = headerByte - 127`. |   `Number_of_Symbols = headerByte - 127`. | ||||||
|   Note that maximum `Number_of_Symbols` is 255-127 = 128. |   Note that maximum `Number_of_Symbols` is 255-127 = 128. | ||||||
|   A larger series must necessarily use FSE compression. |   If any present literal has a value > 128, raw header mode is not possible. | ||||||
|  |   It's necessary to use FSE compression. | ||||||
| 
 | 
 | ||||||
| - if `headerByte` < 128 : | - if `headerByte` < 128 : | ||||||
|   the series of weights is compressed by FSE. |   the series of weights is compressed using FSE. | ||||||
|   The length of the FSE-compressed series is equal to `headerByte` (0-127). |   The length of the FSE-compressed series is equal to `headerByte` (0-127). | ||||||
| 
 | 
 | ||||||
| ##### Finite State Entropy (FSE) compression of Huffman weights | ##### Finite State Entropy (FSE) compression of Huffman weights | ||||||
| @ -1235,18 +1249,19 @@ The number of symbols to decode is determined | |||||||
| by tracking bitStream overflow condition: | by tracking bitStream overflow condition: | ||||||
| If updating state after decoding a symbol would require more bits than | If updating state after decoding a symbol would require more bits than | ||||||
| remain in the stream, it is assumed that extra bits are 0.  Then, | remain in the stream, it is assumed that extra bits are 0.  Then, | ||||||
| the symbols for each of the final states are decoded and the process is complete. | symbols for each of the final states are decoded and the process is complete. | ||||||
| 
 | 
 | ||||||
| ##### Conversion from weights to Huffman prefix codes | ##### Conversion from weights to Huffman prefix codes | ||||||
| 
 | 
 | ||||||
| All present symbols shall now have a `Weight` value. | All present symbols shall now have a `Weight` value. | ||||||
| It is possible to transform weights into Number_of_Bits, using this formula: | It is possible to transform weights into` Number_of_Bits`, using this formula: | ||||||
| ``` | ``` | ||||||
| Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0 | Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0 | ||||||
| ``` | ``` | ||||||
| Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order. | Symbols are sorted by `Weight`. | ||||||
|  | Within same `Weight`, symbols keep natural sequential order. | ||||||
| Symbols with a `Weight` of zero are removed. | Symbols with a `Weight` of zero are removed. | ||||||
| Then, starting from lowest weight, prefix codes are distributed in order. | Then, starting from lowest weight, prefix codes are distributed in sequential order. | ||||||
| 
 | 
 | ||||||
| __Example__ : | __Example__ : | ||||||
| Let's presume the following list of weights has been decoded : | Let's presume the following list of weights has been decoded : | ||||||
| @ -1255,7 +1270,7 @@ Let's presume the following list of weights has been decoded : | |||||||
| | -------- | --- | --- | --- | --- | --- | --- | | | -------- | --- | --- | --- | --- | --- | --- | | ||||||
| | `Weight` |  4  |  3  |  2  |  0  |  1  |  1  | | | `Weight` |  4  |  3  |  2  |  0  |  1  |  1  | | ||||||
| 
 | 
 | ||||||
| Sorted by weight and then natural order, | Sorted by weight and then natural sequential order, | ||||||
| it gives the following distribution : | it gives the following distribution : | ||||||
| 
 | 
 | ||||||
| | Literal          |  3  |  4  |  5  |  2  |  1  |   0  | | | Literal          |  3  |  4  |  5  |  2  |  1  |   0  | | ||||||
| @ -1265,6 +1280,7 @@ it gives the following distribution : | |||||||
| | prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  | | | prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  | | ||||||
| 
 | 
 | ||||||
| ### Huffman-coded Streams | ### Huffman-coded Streams | ||||||
|  | 
 | ||||||
| Given a Huffman decoding table, | Given a Huffman decoding table, | ||||||
| it's possible to decode a Huffman-coded stream. | it's possible to decode a Huffman-coded stream. | ||||||
| 
 | 
 | ||||||
| @ -1554,6 +1570,7 @@ to crosscheck that an implementation build its decoding tables correctly. | |||||||
| 
 | 
 | ||||||
| Version changes | Version changes | ||||||
| --------------- | --------------- | ||||||
|  | - 0.2.8 : clarifications for IETF RFC discuss | ||||||
| - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell | - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell | ||||||
| - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz | - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz | ||||||
| - 0.2.5 : minor typos and clarifications | - 0.2.5 : minor typos and clarifications | ||||||
|  | |||||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user