In the multi-stream multi-symbol Huffman decoder GCC generates
suboptimal code - emitting more loads for HUF_DEltX2 struct member
accesses. Forcing it to use 32-bit loads and bit arithmetic to extract
the necessary parts (UBFX) improves the overall decode speed.
Also avoid integer type conversions in the symbol decodes, which
leads to better instruction selection in table lookup accesses.
On AArch64 the decoder no longer runs into register-pressure limits,
so we can simplify the hot path and improve throughput
Decompression uplifts on a Neoverse V2 system, using Zstd-1.5.8
compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 Clang-* GCC-13 GCC-14 GCC-15
1#silesia.tar: +0.820% +1.365% +2.480% +1.348% +0.987%
2#silesia.tar: +0.426% +0.784% +1.218% +0.665% +0.554%
3#silesia.tar: +0.112% +0.389% +0.508% +0.188% +0.261%
* Requires Clang-21 support from LLVM commit hash
`a53003fe23cb6c871e72d70ff2d3a075a7490da2`
(Clang-21 hasn’t been released as of this writing)
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 GCC-14
1#silesia.tar: +6.173% +5.987%
2#silesia.tar: +5.200% +5.011%
3#silesia.tar: +4.332% +5.031%
4#silesia.tar: +2.789% +3.064%
5#silesia.tar: +2.028% +1.838%
6#silesia.tar: +1.562% +1.340%
7#silesia.tar: +1.160% +0.959%
The pkg-config file has License variable that allows you to set the license for
the software. This sets 'BSD-3-Clause OR GPL-2.0-only' to License.
Ref: https://github.com/pkgconf/pkgconf/blob/master/man/pc.5#L116
Signed-off-by: Nobuhiro Iwamatsu <iwamatsu@nigauri.org>
After the update to MacOS 15.4, the dynamic loader dyld treats duplicated LC_RPATH as an error.
The `FLAGS` variable already contains `LDFLAGS`, thus using both `FLAGS` and `LDFLAGS`
duplicates all `LDFLAGS`, including `-Wl,rpath` parameters.
The duplicate LC_RPATH causes this kind of errors:
```
dyld[29361]: Library not loaded: @loader_path/../lib/libzstd.1.dylib
Referenced from: <7131C877-3CF0-33AC-AA05-257BA4FDD770> /Users/foobar/...
Reason: tried: '/Users/foobar/..../lib/libzstd.1.dylib' (duplicate LC_RPATH '/usr/mypath.../lib')
```
Closes https://github.com/facebook/zstd/issues/4369
Signed-off-by: Etienne Cordonnier <ecordonnier@snap.com>
otherwise will cause dev-python/zstandard build failed when compiling with
clang as reported at https://bugs.gentoo.org/950259
the root cause is pycparser, which is unfixed since reported 2.5 years
ago, :(
Signed-off-by: Z. Liu <zhixu.liu@gmail.com>
The row based match finder is slower without SIMD. We used to detect the
presence of SIMD to set the lower bound to 17, but that breaks
determinism. Instead, specifically opt into it for the kernel, because
it is one of the rare cases that doesn't have SIMD support.
D50949782 fixed a race condition updating `g_displayLevel` by disabling display.
Instead of disabling display, delete the global variable and always "capture" a local `displayLevel` variable.
This also fixes `DISPLAYUPDATE()` by requiring the user to pass in the last update time as the first parameter.