* perf improvements for zstd decode tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge) Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization. See how GCC autovectorizes the loop here: https://godbolt.org/z/apr0x0 Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on) After: https://godbolt.org/z/OwO4F8 Note that autovectorization still does not do a good job on the optimized version, so it's turned off\ via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both. silesia benchmark data - second triad of each file is with the original code: file orig compressedratio encode decode change 1#dickens 10192446-> 4268865(2.388), 198.9MB/s 709.6MB/s 2#dickens 10192446-> 3876126(2.630), 128.7MB/s 552.5MB/s 3#dickens 10192446-> 3682956(2.767), 104.6MB/s 537MB/s 1#dickens 10192446-> 4268865(2.388), 195.4MB/s 659.5MB/s 7.60% 2#dickens 10192446-> 3876126(2.630), 127MB/s 516.3MB/s 7.01% 3#dickens 10192446-> 3682956(2.767), 105MB/s 479.5MB/s 11.99% 1#mozilla 51220480-> 20117517(2.546), 285.4MB/s 734.9MB/s 2#mozilla 51220480-> 19067018(2.686), 220.8MB/s 686.3MB/s 3#mozilla 51220480-> 18508283(2.767), 152.2MB/s 669.4MB/s 1#mozilla 51220480-> 20117517(2.546), 283.4MB/s 697.9MB/s 5.30% 2#mozilla 51220480-> 19067018(2.686), 225.9MB/s 665MB/s 3.20% 3#mozilla 51220480-> 18508283(2.767), 154.5MB/s 640.6MB/s 4.50% 1#mr 9970564-> 3840242(2.596), 262.4MB/s 899.8MB/s 2#mr 9970564-> 3600976(2.769), 181.2MB/s 717.9MB/s 3#mr 9970564-> 3563987(2.798), 116.3MB/s 620MB/s 1#mr 9970564-> 3840242(2.596), 253.2MB/s 827.3MB/s 8.76% 2#mr 9970564-> 3600976(2.769), 177.4MB/s 655.4MB/s 9.54% 3#mr 9970564-> 3563987(2.798), 111.2MB/s 564.2MB/s 9.89% 1#nci 33553445-> 2849306(11.78), 575.2MB/s , 1335.8MB/s 2#nci 33553445-> 2890166(11.61), 509.3MB/s , 1238.1MB/s 3#nci 33553445-> 2857408(11.74), 431MB/s , 1210.7MB/s 1#nci 33553445-> 2849306(11.78), 565.4MB/s , 1220.2MB/s 9.47% 2#nci 33553445-> 2890166(11.61), 508.2MB/s , 1128.4MB/s 9.72% 3#nci 33553445-> 2857408(11.74), 429.1MB/s , 1097.7MB/s 10.29% 1#ooffice 6152192-> 3590954(1.713), 231.4MB/s , 662.6MB/s 2#ooffice 6152192-> 3323931(1.851), 162.8MB/s , 592.6MB/s 3#ooffice 6152192-> 3145625(1.956), 99.9MB/s , 549.6MB/s 1#ooffice 6152192-> 3590954(1.713), 224.7MB/s , 624.2MB/s 6.15% 2#ooffice 6152192-> 3323931 (1.851), 155MB/s , 564.5MB/s 4.98% 3#ooffice 6152192-> 3145625(1.956), 101.1MB/s , 521.2MB/s 5.45% 1#osdb 10085684-> 3739042(2.697), 271.9MB/s 876.4MB/s 2#osdb 10085684-> 3493875(2.887), 208.2MB/s 857MB/s 3#osdb 10085684-> 3515831(2.869), 135.3MB/s 805.4MB/s 1#osdb 10085684-> 3739042(2.697), 257.4MB/s 793.8MB/s 10.41% 2#osdb 10085684-> 3493875(2.887), 209.7MB/s 776.1MB/s 10.42% 3#osdb 10085684-> 3515831(2.869), 130.6MB/s 727.7MB/s 10.68% 1#reymont 6627202-> 2152771(3.078), 198.9MB/s 696.2MB/s 2#reymont 6627202-> 2071140(3.200), 170MB/s 595.2MB/s 3#reymont 6627202-> 1953597(3.392), 128.5MB/s 609.7MB/s 1#reymont 6627202-> 2152771(3.078), 199.6MB/s 655.2MB/s 6.26% 2#reymont 6627202-> 2071140(3.200), 168.2MB/s 554.4MB/s 7.36% 3#reymont 6627202-> 1953597(3.392), 128.7MB/s 557.4MB/s 9.38% 1#samba 21606400-> 5510994(3.921), 338.1MB/s 1066MB/s 2#samba 21606400-> 5240208(4.123), 258.7MB/s 992.3MB/s 3#samba 21606400-> 5003358(4.318), 200.2MB/s 991.1MB/s 1#samba 21606400-> 5510994(3.921), 330.8MB/s 974MB/s 9.45% 2#samba 21606400-> 5240208(4.123), 257.9MB/s 919.4MB/s 7.93% 3#samba 21606400-> 5003358(4.318), 198.5MB/s 908.9MB/s 9.04% 1#sao 7251944-> 6256401(1.159), 194.6MB/s 602.2MB/s 2#sao 7251944-> 5808761(1.248), 128.2MB/s 532.1MB/s 3#sao 7251944-> 5556318(1.305), 73MB/s 509.4MB/s 1#sao 7251944-> 6256401(1.159), 198.7MB/s 580.7MB/s 3.70% 2#sao 7251944-> 5808761(1.248), 129.1MB/s 502.7MB/s 5.85% 3#sao 7251944-> 5556318(1.305), 74.6MB/s 493.1MB/s 3.31% 1#webster 41458703-> 13692222(3.028), 222.3MB/s 752MB/s 2#webster 41458703-> 12842646(3.228), 157.6MB/s 532.2MB/s 3#webster 41458703-> 12191964(3.400), 124MB/s 468.5MB/s 1#webster 41458703-> 13692222(3.028), 219.7MB/s 697MB/s 7.89% 2#webster 41458703-> 12842646(3.228), 153.9MB/s 495.4MB/s 7.43% 3#webster 41458703-> 12191964(3.400), 124.8MB/s 444.8MB/s 5.33% 1#xml 5345280-> 696652(7.673), 485MB/s , 1333.9MB/s 2#xml 5345280-> 681492(7.843), 405.2MB/s , 1237.5MB/s 3#xml 5345280-> 639057(8.364), 328.5MB/s , 1281.3MB/s 1#xml 5345280-> 696652(7.673), 473.1MB/s , 1232.4MB/s 8.24% 2#xml 5345280-> 681492(7.843), 398.6MB/s , 1145.9MB/s 7.99% 3#xml 5345280-> 639057(8.364), 327.1MB/s , 1175MB/s 9.05% 1#x-ray 8474240-> 6772557(1.251), 521.3MB/s 762.6MB/s 2#x-ray 8474240-> 6684531(1.268), 230.5MB/s 688.5MB/s 3#x-ray 8474240-> 6166679(1.374), 68.7MB/s 478.8MB/s 1#x-ray 8474240-> 6772557(1.251), 502.8MB/s 736.7MB/s 3.52% 2#x-ray 8474240-> 6684531(1.268), 224.4MB/s 662MB/s 4.00% 3#x-ray 8474240-> 6166679(1.374), 67.3MB/s 437.8MB/s 9.37% 7.51% * makefile changed to only pass -fno-tree-vectorize to gcc * <Replace this line with a title. Use 1 line only, 67 chars or less> Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__) * fix for warning/error with subtraction of void* pointers * fix c90 conformance issue - ISO C90 forbids mixed declarations and code * Fix assert for negative diff, only when there is no overlap * fix overflow revealed in fuzzing tests * tweak for small speed increase
Zstandard library files
The lib directory is split into several sub-directories, in order to make it easier to select or exclude features.
Building
Makefile
script is provided, supporting Makefile conventions,
including commands variables, staged install, directory variables and standard targets.
make
: generates both static and dynamic librariesmake install
: install libraries and headers in target system directories
libzstd
default scope is pretty large, including compression, decompression, dictionary builder,
and support for decoding legacy formats >= v0.5.0.
The scope can be reduced on demand (see paragraph modular build).
Multithreading support
Multithreading is disabled by default when building with make
.
Enabling multithreading requires 2 conditions :
- set build macro
ZSTD_MULTITHREAD
(-DZSTD_MULTITHREAD
forgcc
) - for POSIX systems : compile with pthread (
-pthread
compilation flag forgcc
)
Both conditions are automatically applied when invoking make lib-mt
target.
When linking a POSIX program with a multithreaded version of libzstd
,
note that it's necessary to request the -pthread
flag during link stage.
Multithreading capabilities are exposed
via the advanced API defined in lib/zstd.h
.
API
Zstandard's stable API is exposed within lib/zstd.h.
Advanced API
Optional advanced features are exposed via :
-
lib/common/zstd_errors.h
: translatessize_t
function results into aZSTD_ErrorCode
, for accurate error handling. -
ZSTD_STATIC_LINKING_ONLY
: if this macro is defined before includingzstd.h
, it unlocks access to the experimental API, exposed in the second part ofzstd.h
. All definitions in the experimental APIs are unstable, they may still change in the future, or even be removed. As a consequence, experimental definitions shall never be used with dynamic library ! Only static linking is allowed.
Modular build
It's possible to compile only a limited set of features within libzstd
.
The file structure is designed to make this selection manually achievable for any build system :
-
Directory
lib/common
is always required, for all variants. -
Compression source code lies in
lib/compress
-
Decompression source code lies in
lib/decompress
-
It's possible to include only
compress
or onlydecompress
, they don't depend on each other. -
lib/dictBuilder
: makes it possible to generate dictionaries from a set of samples. The API is exposed inlib/dictBuilder/zdict.h
. This module depends on bothlib/common
andlib/compress
. -
lib/legacy
: makes it possible to decompress legacy zstd formats, starting fromv0.1.0
. This module depends onlib/common
andlib/decompress
. To enable this feature, defineZSTD_LEGACY_SUPPORT
during compilation. Specifying a number limits versions supported to that version onward. For example,ZSTD_LEGACY_SUPPORT=2
means : "support legacy formats >= v0.2.0". Conversely,ZSTD_LEGACY_SUPPORT=0
means "do not support legacy formats". By default, this build macro is set asZSTD_LEGACY_SUPPORT=5
. Decoding supported legacy format is a transparent capability triggered within decompression functions. It's also allowed to invoke legacy API directly, exposed inlib/legacy/zstd_legacy.h
. Each version does also provide its own set of advanced API. For example, advanced API for versionv0.4
is exposed inlib/legacy/zstd_v04.h
. -
While invoking
make libzstd
, it's possible to define build macrosZSTD_LIB_COMPRESSION, ZSTD_LIB_DECOMPRESSION
,ZSTD_LIB_DICTBUILDER
, andZSTD_LIB_DEPRECATED
as0
to forgo compilation of the corresponding features. This will also disable compilation of all dependencies (eg.ZSTD_LIB_COMPRESSION=0
will also disable dictBuilder). -
There are some additional build macros that can be used to minify the decoder.
Zstandard often has more than one implementation of a piece of functionality, where each implementation optimizes for different scenarios. For example, the Huffman decoder has complementary implementations that decode the stream one symbol at a time or two symbols at a time. Zstd normally includes both (and dispatches between them at runtime), but by defining
HUF_FORCE_DECOMPRESS_X1
orHUF_FORCE_DECOMPRESS_X2
, you can force the use of one or the other, avoiding compilation of the other. Similarly,ZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT
andZSTD_FORCE_DECOMPRESS_SEQUENCES_LONG
force the compilation and use of only one or the other of two decompression implementations. The smallest binary is achieved by usingHUF_FORCE_DECOMPRESS_X1
andZSTD_FORCE_DECOMPRESS_SEQUENCES_SHORT
.For squeezing the last ounce of size out, you can also define
ZSTD_NO_INLINE
, which disables inlining, andZSTD_STRIP_ERROR_STRINGS
, which removes the error messages that are otherwise returned byZSTD_getErrorName
. -
While invoking
make libzstd
, the build macroZSTD_LEGACY_MULTITHREADED_API=1
will expose the deprecatedZSTDMT
API exposed byzstdmt_compress.h
in the shared library, which is now hidden by default.
Windows : using MinGW+MSYS to create DLL
DLL can be created using MinGW+MSYS with the make libzstd
command.
This command creates dll\libzstd.dll
and the import library dll\libzstd.lib
.
The import library is only required with Visual C++.
The header file zstd.h
and the dynamic library dll\libzstd.dll
are required to
compile a project using gcc/MinGW.
The dynamic library has to be added to linking options.
It means that if a project that uses ZSTD consists of a single test-dll.c
file it should be linked with dll\libzstd.dll
. For example:
gcc $(CFLAGS) -Iinclude/ test-dll.c -o test-dll dll\libzstd.dll
The compiled executable will require ZSTD DLL which is available at dll\libzstd.dll
.
Deprecated API
Obsolete API on their way out are stored in directory lib/deprecated
.
At this stage, it contains older streaming prototypes, in lib/deprecated/zbuff.h
.
These prototypes will be removed in some future version.
Consider migrating code towards supported streaming API exposed in zstd.h
.
Miscellaneous
The other files are not source code. There are :
BUCK
: support forbuck
build system (https://buckbuild.com/)Makefile
:make
script to build and install zstd library (static and dynamic)README.md
: this filedll/
: resources directory for Windows compilationlibzstd.pc.in
: script forpkg-config
(used inmake install
)