4840 Commits

Author SHA1 Message Date
Arpad Panyik
bd38fc2c5f AArch64: Enhance struct access in Huffman decode 2X
In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.

Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.

On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput

Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":

                 Clang-20   Clang-*    GCC-13    GCC-14    GCC-15
 1#silesia.tar:   +0.820%   +1.365%   +2.480%   +1.348%   +0.987%
 2#silesia.tar:   +0.426%   +0.784%   +1.218%   +0.665%   +0.554%
 3#silesia.tar:   +0.112%   +0.389%   +0.508%   +0.188%   +0.261%

* Requires Clang-21 support from LLVM commit hash
  `a53003fe23cb6c871e72d70ff2d3a075a7490da2`
  (Clang-21 hasn’t been released as of this writing)
2025-06-23 14:16:25 +00:00
Yann Collet
7eefc22169
Merge pull request #4367 from ClickHouse/cfi
Add unwind information in huf_decompress_amd64.S
2025-06-19 23:41:38 -07:00
Arpad Panyik
7e4937bc75 AArch64: Add SVE2 implementation of histogram computation
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.

On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.

The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.

The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.

This implementation is the best performing of a number of different
cache blocking schemes tested.

Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":

                 Clang-20    GCC-14
 1#silesia.tar:   +6.173%   +5.987%
 2#silesia.tar:   +5.200%   +5.011%
 3#silesia.tar:   +4.332%   +5.031%
 4#silesia.tar:   +2.789%   +3.064%
 5#silesia.tar:   +2.028%   +1.838%
 6#silesia.tar:   +1.562%   +1.340%
 7#silesia.tar:   +1.160%   +0.959%
2025-06-11 12:14:22 +00:00
Michael Kolupaev
a480191f9e Fix Darwin build of huf_decompress_amd64.S 2025-06-08 05:07:09 +00:00
Michael Kolupaev
80cac404c7 Add unwind information in huf_decompress_amd64.S 2025-06-08 05:07:09 +00:00
李子建
d95123f2e6 Improve speed of ZSTD_compressSequencesAndLiterals() using RVV 2025-06-02 17:21:02 +08:00
Nobuhiro Iwamatsu
2d224dc745 Add License variable to pkg-config file
The pkg-config file has License variable that allows you to set the license for
the software. This sets 'BSD-3-Clause OR GPL-2.0-only' to License.

Ref: https://github.com/pkgconf/pkgconf/blob/master/man/pc.5#L116
Signed-off-by: Nobuhiro Iwamatsu <iwamatsu@nigauri.org>
2025-05-06 12:16:28 -07:00
Etienne Cordonnier
8929d3b09f Fix duplicate LC_RPATH error on MacOS
After the update to MacOS 15.4, the dynamic loader dyld treats duplicated LC_RPATH as an error.
The `FLAGS` variable already contains `LDFLAGS`, thus using both `FLAGS` and `LDFLAGS`
duplicates all `LDFLAGS`, including `-Wl,rpath` parameters.

The duplicate LC_RPATH causes this kind of errors:

```
dyld[29361]: Library not loaded: @loader_path/../lib/libzstd.1.dylib
      Referenced from: <7131C877-3CF0-33AC-AA05-257BA4FDD770> /Users/foobar/...
      Reason: tried: '/Users/foobar/..../lib/libzstd.1.dylib' (duplicate LC_RPATH '/usr/mypath.../lib')
```

Closes https://github.com/facebook/zstd/issues/4369

Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
2025-04-18 15:59:06 +02:00
Yann Collet
2fec3989c1 add an assert
to help static analyzers understand there is no overflow risk there.
2025-03-22 18:23:31 -07:00
Z. Liu
cd8ca9d92e lib/zstd.h: move pragma before static
otherwise will cause dev-python/zstandard build failed when compiling with
clang as reported at https://bugs.gentoo.org/950259

the root cause is pycparser, which is unfixed since reported 2.5 years
ago, :(

Signed-off-by: Z. Liu <zhixu.liu@gmail.com>
2025-03-20 03:40:42 +00:00
Yann Collet
4d53e27144 removed OpenBSD specificity 2025-03-12 09:55:14 -07:00
Yann Collet
ddcb41a282 updated documentation 2025-03-11 14:10:35 -07:00
Yann Collet
a9b8fef2e8 add support for C11 Annex K qsort_s()
standard defined re-entrant variant of qsort().
Unfortunately, Annex K is optional.
2025-03-11 14:10:35 -07:00
Yann Collet
dcf675886b re-design qsort() selection in cover
centralizes auto detection tests,
then distribute the outcome in all the places where it's active.
2025-03-11 14:10:35 -07:00
Yann Collet
51b6e79f65 fix #4312
and upgraded the test so that it would fail, both at compile time and at run time, without the fix
2025-03-11 14:10:35 -07:00
Nick Terrell
68dfd14a8c [linux] Opt out of row based match finder for the kernel
The row based match finder is slower without SIMD. We used to detect the
presence of SIMD to set the lower bound to 17, but that breaks
determinism. Instead, specifically opt into it for the kernel, because
it is one of the rare cases that doesn't have SIMD support.
2025-03-11 16:18:59 -04:00
Yann Collet
2ff87aefac fix FreeBSD
use an alias instead of a function

also: added more traces and updated version nb to v1.5.8
2025-03-10 19:04:41 -07:00
Nick Terrell
0de4991942 Add a method for checking if ZSTD was compiled with flags that impact determinism 2025-03-07 10:31:19 -05:00
Nick Terrell
190a620974 [zstd] Remove global variables in dictBuilder
D50949782 fixed a race condition updating `g_displayLevel` by disabling display.
Instead of disabling display, delete the global variable and always "capture" a local `displayLevel` variable.
This also fixes `DISPLAYUPDATE()` by requiring the user to pass in the last update time as the first parameter.
2025-03-05 10:35:01 -05:00
Nick Terrell
d5b84f5a27 [zstd] Backport D49756856 2025-03-05 10:35:01 -05:00
Yann Collet
4e1723a7e4 fixed the script so that it fails when a copy fails
and also: fix the list of files, as `zdict.h` was incorrectly set.
2025-02-27 16:18:44 -08:00
Yann Collet
7340657c6f update build_package.bat by using a subrouting 2025-02-27 16:18:44 -08:00
Yann Collet
e94e09dd7b ensure that a copy error results in the task failing clearly
error code != 0, red status

checked by intentionally inserting an error in another run
2025-02-27 16:18:44 -08:00
Yann Collet
a1a5154b69
Merge pull request #4312 from Cyan4973/musl_ci
introduce ZSTD_USE_C90_QSORT
2025-02-27 14:27:21 -08:00
Yann Collet
22b2fd2517
Merge pull request #4317 from hirohira9119/fix-function-signature
Fix function signature mismatch for ZSTD_convertBlockSequences
2025-02-27 13:03:03 -08:00
Yann Collet
d6fbaaac99
Merge pull request #4320 from sebres/patch-1
build_package.bat: fix path to zstd_errors.h, avoid silently ignore of the errors if build failed
2025-02-26 15:15:03 -08:00
Yann Collet
dca9791862 fixed minor C++ compat warnings 2025-02-26 14:30:29 -08:00
Sergey G. Brester
f0d3173203
build_package.bat: don't swallow the error(s) by copy, exit with error if failed somewhere 2025-02-26 20:02:48 +01:00
Sergey G. Brester
97bc43cc68
build_package.bat: fix path to zstd_errors.h (it is in lib not in lib/common)
closes gh-4318
2025-02-26 19:27:44 +01:00
Yann Collet
db2d205ada fixed -Wconversion for lib/decompress/zstd_decompress_block.c 2025-02-26 10:01:05 -08:00
Yann Collet
2413f17322 fixed -Wconversion for cover.c 2025-02-26 08:33:01 -08:00
Yann Collet
8ffa27d93b fixed -Wconversion for divsufsort.c 2025-02-26 08:12:11 -08:00
Yann Collet
e635221f1b fixed -Wconversion for zdict 2025-02-26 08:07:51 -08:00
Yann Collet
30281d889f fix conversion warning 2025-02-26 07:41:34 -08:00
hirohira
2840631dc1 Fix function signature mismatch for ZSTD_convertBlockSequences 2025-02-26 08:23:48 +09:00
Yann Collet
fd5498a179 document ZSTD_USE_C90_QSORT 2025-02-21 12:48:26 -08:00
Yann Collet
ebfa660b82 introduce ZSTD_USE_C90_QSORT 2025-02-21 11:36:30 -08:00
Yann Collet
d2c562b803 update hrlog comment 2025-02-10 10:48:56 -08:00
Yann Collet
67fad95f79 derive hashratelog from hashlog when only hashlog is set 2025-02-10 10:46:37 -08:00
Yann Collet
09d7e34ed8 adjust mml 2025-02-10 10:46:37 -08:00
Yann Collet
d5e4698267 fix boundary condition 2025-02-10 10:46:37 -08:00
Yann Collet
72406b71c3 update hrlog rule to favor compression ratio a bit more at low levels 2025-02-10 10:46:37 -08:00
Yann Collet
f26cc54f37 dynamic bucket sizes 2025-02-10 10:46:37 -08:00
Yann Collet
4609a40b89 dynamically adjust hratelog and ldmml based on strategy 2025-02-10 10:46:37 -08:00
Yann Collet
23e5f80390 Revert "pass dictionary loading method as parameter"
This reverts commit 821fc567f93a415e9fbe856271ccd452ee7acf07.
2025-02-05 18:47:26 -08:00
Yann Collet
c7cd7dc04b better MT fluidity
--patch-from no longer blocked on first job dictionary loading
2025-02-05 18:42:00 -08:00
Yann Collet
f11bd19c7f ensure cdict is properly reset to NULL 2025-02-05 18:42:00 -08:00
Yann Collet
7406d2b6eb skips the need to create a temporary cdict for --patch-from
thus saving a bit of memory and a little bit of cpu time
2025-02-05 18:42:00 -08:00
Yann Collet
220abe6da8 reduced memory usage
by avoiding to duplicate in memory
a dictionary that was passed by reference.
2025-02-05 18:42:00 -08:00
Yann Collet
85a44b233a always free .cdictLocal 2025-02-05 18:41:59 -08:00