729 Commits

Author SHA1 Message Date
Yann Collet
34f3a0ab11
Merge pull request #4413 from arpadpanyik-arm/huf_decode2x
AArch64: Enhance struct access in Huffman decode 2X
2025-07-23 15:03:37 -08:00
Arpad Panyik
a28e8182b1 AArch64: Improve ZSTD_decodeSequence performance
LLVM's alias-analysis sometimes fails to see that a static-array member
of a struct cannot alias other members. This patch:

- Reduces array accesses via struct indirection to aid load/store alias
  analysis under Clang.
- Converts dynamic array indexing into conditional-move arithmetic,
  eliminating branches and extra loads/stores on out-of-order CPUs.
- Reloads the bitstream only when match-length bits are consumed
  (assuming each reload only needs to happen once per match-length
  read), improving branch-prediction rates.
- Removes the UNLIKELY() hint, which recent compilers already handle
  well without cost.

Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":

                 Clang-19  Clang-20   Clang-*    GCC-14    GCC-15
 1#silesia.tar:  +11.556%  +16.203%   +0.240%   +2.216%   +7.891%
 2#silesia.tar:  +15.493%  +21.140%   -0.041%   +2.850%   +9.926%
 3#silesia.tar:  +16.887%  +22.570%   -0.183%   +3.056%  +10.660%
 4#silesia.tar:  +17.785%  +23.315%   -0.262%   +3.343%  +11.187%
 5#silesia.tar:  +18.125%  +24.175%   -0.466%   +3.350%  +11.228%
 6#silesia.tar:  +17.607%  +23.339%   -0.591%   +3.175%  +10.851%
 7#silesia.tar:  +17.463%  +22.837%   -0.486%   +3.292%  +10.868%

* Requires Clang-21 support from LLVM commit hash
  `a53003fe23cb6c871e72d70ff2d3a075a7490da2`
   (Clang-21 hasn’t been released as of this writing)

Co-authored by:
 David Sherwood, David.Sherwood@arm.com
 Ola Liljedahl, Ola.Liljedahl@arm.com
2025-06-24 12:22:23 +00:00
Arpad Panyik
bd38fc2c5f AArch64: Enhance struct access in Huffman decode 2X
In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.

Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.

On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput

Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":

                 Clang-20   Clang-*    GCC-13    GCC-14    GCC-15
 1#silesia.tar:   +0.820%   +1.365%   +2.480%   +1.348%   +0.987%
 2#silesia.tar:   +0.426%   +0.784%   +1.218%   +0.665%   +0.554%
 3#silesia.tar:   +0.112%   +0.389%   +0.508%   +0.188%   +0.261%

* Requires Clang-21 support from LLVM commit hash
  `a53003fe23cb6c871e72d70ff2d3a075a7490da2`
  (Clang-21 hasn’t been released as of this writing)
2025-06-23 14:16:25 +00:00
Michael Kolupaev
a480191f9e Fix Darwin build of huf_decompress_amd64.S 2025-06-08 05:07:09 +00:00
Michael Kolupaev
80cac404c7 Add unwind information in huf_decompress_amd64.S 2025-06-08 05:07:09 +00:00
Nick Terrell
d5b84f5a27 [zstd] Backport D49756856 2025-03-05 10:35:01 -05:00
Yann Collet
dca9791862 fixed minor C++ compat warnings 2025-02-26 14:30:29 -08:00
Yann Collet
db2d205ada fixed -Wconversion for lib/decompress/zstd_decompress_block.c 2025-02-26 10:01:05 -08:00
Pavel P
59afb28c97 Remove unused ZSTD_decompressSequences_t typedef 2025-01-24 02:13:20 +02:00
Pavel P
1204626138 Check DYNAMIC_BMI2 instead of DYNAMIC_BMI2 != 0
`#if DYNAMIC_BMI2` is consistent with the rest of the code.

 + use spaces instead of tabs
2025-01-23 23:59:38 +02:00
Yann Collet
167b00495d
Merge pull request #4246 from pps83/dev-asmx64-win
[asm] Enable x86_64 asm for windows builds
2025-01-18 20:03:16 -08:00
Yann Collet
e8de8085f4 minor: assert that state is not null
replaces #4016
2025-01-18 13:08:04 -08:00
Pavel P
46e17b805b [asm] Enable x86_64 asm for windows builds 2025-01-18 05:33:08 +02:00
Yann Collet
04a2a0219c update type names
naming convention: Type names should start with a Capital letter (after the prefix)
2024-12-29 14:25:33 -08:00
Yann Collet
a2ff6ea784 improve ZSTD_getFrameHeader on skippable frames
now reports:
- the header size
- the magic variant (within @dictID field)
2024-12-29 12:26:04 -08:00
Yann Collet
477a01067f codemod: symbolEncodingType_e -> SymbolEncodingType_e 2024-12-20 10:36:56 -08:00
Yann Collet
31d48e9ffa fixing minor formatting issue in 32-bit mode with logs enabled 2024-10-23 11:50:56 -07:00
Dimitri Papadopoulos
44e83e9180
Fix typos not found by codespell 2024-06-20 20:16:25 +02:00
Elliot Gorokhovsky
741b87bbe1
Fuzzing and bugfixes for magicless-format decoding (#3976)
* fuzzing and bugfixes for magicless format

* reset dctx before each decompression

* do not memcmp empty buffers

* nit: decompressor errata
2024-03-20 19:22:34 -04:00
Elliot Gorokhovsky
7d970bd83c
Implement one-shot fallback for magicless format (#3971) 2024-03-18 10:55:53 -04:00
Elliot Gorokhovsky
559762da12
Remove duplicate and incorrect docs in zstd_decompress.c (#3967) 2024-03-14 15:55:01 -04:00
Nick Terrell
ff0afbad58 [asm][aarch64] Mark that BTI and PAC are supported
Mark that `huf_decompress_amd64.S` supports BTI and PAC, which it trivially does because it is empty for aarch64.

The issue only requested BTI markings, but it also makes sense to mark PAC, which is the only other feature.

Also run add a test for this mode to the ARM64 QEMU test. Before this PR it warns on `huf_decompress_amd64.S`, after it doesn't.

Fixes Issue #3841.
2024-03-13 16:15:51 -04:00
Elliot Gorokhovsky
f65b9e27ce
Exercise ZSTD_findDecompressedSize() in the simple decompression fuzzer (#3959)
* Improve decompression fuzzer

* Fix legacy frame header fuzzer crash, add unit test
2024-03-12 17:07:06 -04:00
Yann Collet
a9fb8d4c41 new method to deal with offset==0
in this new method, when an `offset==0` is detected,
it's converted into (size_t)(-1), instead of 1.

The logic is that (size_t)(-1) is effectively an extremely large positive number,
which will not pass the offset distance test at next stage (`execSequence()`).
Checked the source code, and offset is always checked (as it should),
using a formula which is not vulnerable to arithmetic overflow:
```
RETURN_ERROR_IF(sequence.offset > (size_t)(oLitEnd - virtualStart),
```

The benefit is that such a case (offset==0) is always detected as corrupted data
as opposed to relying on the checksum to detect the error.
2024-03-08 15:26:06 -08:00
Yann Collet
8689633fdf
Merge pull request #3840 from aimuz/fix-reserved
lib/decompress: check for reserved bit corruption in zstd
2024-03-05 13:40:12 -08:00
Yann Collet
f77f634d41 update API documentation 2024-02-24 01:28:17 -08:00
Yann Collet
4b51526412 fix partial block uncompressed 2024-02-24 01:24:58 -08:00
Yann Collet
4683667785 refactor optimal parser
store stretches as intermediate solution instead of sequences.
makes it possible to link a solution to a predecessor.
2024-01-31 02:51:46 -08:00
aimuz
468bb17378
lib/decompress: check for reserved bit corruption in zstd
The patch adds a validation to ensure that the last field, which is
reserved, must be all-zeroes in ZSTD_decodeSeqHeaders. This prevents
potential corruption from going undetected.

Fixes an issue where corrupted input could lead to undefined behavior
due to improper validation of reserved bits.

Signed-off-by: aimuz <mr.imuz@gmail.com>
2023-11-28 21:04:37 +08:00
Nick Terrell
8193250615 Modernize macros to use do { } while (0)
This PR introduces no functional changes. It attempts to change all
macros currently using `{ }` or some variant of that to to
`do { } while (0)`, and introduces trailing `;` where necessary.
There were no bugs found during this migration.

The bug in Visual Studios warning on this has been fixed since VS2015.
Additionally, we have several instances of `do { } while (0)` which have
been present for several releases, so we don't have to worry about
breaking peoples builds.

Fixes Issue #3830.
2023-11-21 20:05:17 -05:00
Nick Terrell
dd4de1dd7a [huf] Fix null pointer addition
`HUF_DecompressFastArgs_init()` was adding 0 to NULL. Fix it by exiting
early for empty outputs. This is no change in behavior, because the
function was already exiting 0 in this case, just slightly later.
2023-11-20 17:13:01 -05:00
Nick Terrell
5ab78c0418 [huf] Improve fast C & ASM performance on small data
* Rename `ilimit` to `ilowest` and set it equal to `src` instead of
  `src + 6 + 8`. This is safe because the fast decoding loops guarantee
  to never read below `ilowest` already. This allows the fast decoder to
  run for at least two more iterations, because it consumes at most 7
  bytes per iteration.
* Continue the fast loop all the way until the number of safe iterations
 is 0. Initially, I thought that when it got towards the end, the
 computation of how many iterations of safe might become expensive. But
 it ends up being slower to have to decode each of the 4 streams
 individually, which makes sense.

This drastically speeds up the Huffman decoder on the `github` dataset
for the issue raised in #3762, measured with `zstd -b1e1r github/`.

| Decoder  | Speed before | Speed after |
|----------|--------------|-------------|
| Fallback | 477 MB/s     | 477 MB/s    |
| Fast C   | 384 MB/s     | 492 MB/s    |
| Assembly | 385 MB/s     | 501 MB/s    |

We can also look at the speed delta for different block sizes of silesia
using `zstd -b1e1r silesia.tar -B#`.

| Decoder  | -B1K ∆ | -B2K ∆ | -B4K ∆ | -B8K ∆ | -B16K ∆ | -B32K ∆ | -B64K ∆ | -B128K ∆ |
|----------|--------|--------|--------|--------|---------|---------|---------|----------|
| Fast C   | +11.2% | +8.2%  | +6.1%  | +4.4%  | +2.7%   | +1.5%   | +0.6%   | +0.2%    |
| Assembly | +12.5% | +9.0%  | +6.2%  | +3.6%  | +1.5%   | +0.7%   | +0.2%   | +0.03%   |
2023-11-20 17:13:01 -05:00
Nick Terrell
c7269add7e [huf] Improve fast huffman decoding speed in linux kernel
gcc in the linux kernel was not unrolling the inner loops of the Huffman
decoder, which was destroying decoding performance. The compiler was
generating crazy code with all sorts of branches. I suspect because of
Spectre mitigations, but I'm not certain. Once the loops were manually
unrolled, performance was restored.

Additionally, when gcc couldn't prove that the variable left shift in
the 4X2 decode loop wasn't greater than 63, it inserted checks to verify
it. To fix this, mask `entry.nbBits & 0x3F`, which allows gcc to eliete
this check. This is a no op, because `entry.nbBits` is guaranteed to be
less than 64.

Lastly, introduce the `HUF_DISABLE_FAST_DECODE` macro to disable the
fast C loops for Issue #3762. So if even after this change, there is a
performance regression, users can opt-out at compile time.
2023-11-20 14:56:46 -05:00
Yann Collet
c1e588fcb4
Merge pull request #3771 from DimitriPapadopoulos/codespell
Fix new typos found by codespell
2023-10-07 19:29:41 -07:00
Nick Terrell
43118da8a7 Stop suppressing pointer-overflow UBSAN errors
* Remove all pointer-overflow suppressions from our UBSAN builds/tests.
* Add `ZSTD_ALLOW_POINTER_OVERFLOW_ATTR` macro to suppress
  pointer-overflow at a per-function level. This is a superior approach
  because it also applies to users who build zstd with UBSAN.
* Add `ZSTD_wrappedPtr{Diff,Add,Sub}()` that use these suppressions.
  The end goal is to only tag these functions with
  `ZSTD_ALLOW_POINTER_OVERFLOW`. But we can start by annoting functions
  that rely on pointer overflow, and gradually transition to using
  these.
* Add `ZSTD_maybeNullPtrAdd()` to simplify pointer addition when the
  pointer may be `NULL`.
* Fix all the fuzzer issues that came up. I'm sure there will be a lot
  more, but these are the ones that came up within a few minutes of
  running the fuzzers, and while running GitHub CI.
2023-09-28 17:35:05 -04:00
Nick Terrell
3daed7017a Revert "Work around nullptr-with-nonzero-offset warning"
This reverts commit c27fa399042f466080e79bb4fd8a4871bc0bcf28.
2023-09-28 17:35:05 -04:00
Dimitri Papadopoulos
fe34776c20
Fix new typos found by codespell 2023-09-23 18:56:01 +02:00
Nick Terrell
cdceb0fce5 Improve macro guards for ZSTD_assertValidSequence
Refine the macro guards to define the functions exactly when they are
needed.

This fixes the chromium build with zstd.

Thanks to @GregTho for reporting!
2023-09-22 16:36:14 -04:00
Nick Terrell
c27fa39904 Work around nullptr-with-nonzero-offset warning
See comment.
2023-08-25 13:20:59 -04:00
Yann Collet
c123e69ad0 fixed static analyzer false positive regarding @sequence initialization
make a mock initialization to please the tool
2023-06-16 16:24:48 -07:00
Yann Collet
c60dcedcc9 adapted long decoder to new decodeSequences
removed older decodeSequences
2023-06-16 15:52:00 -07:00
Yann Collet
33fca19dd4 changed ZSTD_decompressSequences_bodySplitLitBuffer() decoding loop
to behave more like the regular decoding loop.
2023-06-16 15:32:07 -07:00
Yann Collet
84e898a76c removed _old variant from splitLit 2023-06-16 14:42:28 -07:00
Yann Collet
02134fad12 changed (partially) the decodeSequences flow logic
this allows detecting overflow events without a checksum.
2023-06-16 11:57:12 -07:00
Yann Collet
b46236278a detect extraneous bytes in the Sequences section
when nbSeq == 0.

Reported by @ip7z
2023-06-13 11:43:45 -07:00
Yann Collet
3732a08f5b fixed decoder behavior when nbSeqs==0 is encoded using 2 bytes
The sequence section starts with a number, which tells how sequences are present in the section.
If this number if 0, the section automatically ends.

The number 0 can be represented using the 1 byte or the 2 bytes formats.
That's because the 2-bytes formats fully overlaps the 1 byte format.

However, when 0 is represented using the 2-bytes format,
the decoder was expecting the sequence section to continue,
and was looking for FSE tables, which is incorrect.

Fixed this behavior, in both the reference decoder and the educational behavior.

In practice, this behavior never happens,
because the encoder will always select the 1-byte format to represent 0,
since this is more efficient.

Completed the fix with a new golden sample for tests,
a clarification of the specification,
and a decoder errata paragraph.
2023-06-05 16:03:00 -07:00
Nick Terrell
61efb2a047 Add ZSTD_d_maxBlockSize parameter
Reduces memory when blocks are guaranteed to be smaller than allowed by
the format. This is useful for streaming compression in conjunction with
ZSTD_c_maxBlockSize.

This PR saves 2 * (formatMaxBlockSize - paramMaxBlockSize) when streaming.
Once it is rebased on top of PR #3616 it will save
3 * (formatMaxBlockSize - paramMaxBlockSize).
2023-04-17 22:06:44 -07:00
Nick Terrell
0abf2baef9 Reduce streaming decompression memory by 128KB
The split literals buffer patch increased streaming decompression memory
by 64KB (shrunk lit buffer from 128KB to 64KB, and added 128KB). This
patch removes the added 128KB buffer, because it isn't necessary.

The buffer was there because the literals compression code didn't know
the true `blockSizeMax` of the frame, and always put split literals so
they ended 128KB - 32 from the beginning of the block. Instead, we can
pass down the true `blockSizeMax` and ensure that the split literals
end up at `blockSizeMax - 32` from the beginning of the block. We
already reserve a full `blockSizeMax` bytes in streaming mode, so we
won't be overwriting the extDict window.
2023-04-17 16:31:02 -07:00
Yann Collet
e4120c5513 fixing potential over-reads
detected by @terrelln,
these issue could be triggered in specific scenarios
namely decompression of certain invalid magic-less frames,
or requested properties from certain invalid skippable frames.
2023-04-03 16:52:32 -07:00
daniellerozenblit
fcaf06ddb4
Check that dest is valid for decompression (#3555)
* add check for valid dest buffer and fuzz on random dest ptr when malloc 0

* add uptrval to linux-kernel

* remove bin files

* get rid of uptrval

* restrict max pointer value check to platforms where sizeof(size_t) == sizeof(void*)
2023-03-31 23:00:55 -07:00