this meant to abstract the sumtype representation required
to transfert `offcode` to `ZSTD_storeSeq()`.
Unfortunately, the sumtype numeric representation is currently a leaky abstraction
that has permeated many other parts of the code,
especially within `zstd_lazy.c` and also within `zstd_opt.c` and `zstd_compress.c`.
While this PR makes a good job a transfering a large nb of call sites
to using the new macros, there are still a few sites where this transformation is more complex,
or where the numeric representation itself it used "as is".
One of the problematics area is the decision to use the numeric format of the sumtype
within the match finders of `zstd_lazy`.
This commit doesn't change the behavior, it only introduces and employes the macros,
but eventually the resulting code remains identical.
At target, if the numeric representation of the sumtype can be completely abstracted
and no other part of the code depends on it,
it will be possible to move it towards something slightly more efficient.
`CFLAGS=-O0 make`
will now use `-O0` instead of enforcing `-O3`
which used to be the behavior before introduction of `libzstd.mk`.
This should result in faster tests,
since a few tests depend on this capability for faster roundtrips.
since this is effectively what is stored in this field (== matchLength - MINMATCH).
This makes it clearer what needs to be done when reading from / writing to this field.
Regression from commit a5f2c45528032ed20c33e0f8cd2c163a800a0017. It is
not possible to unconditionally add the asm sources, since not all
compilers understand the .s file extension.
Specifically for meson, only compilers inheriting from the GNU mixin
will allow a .s file at configure time.
zstd doesn't support asm for MSVC for the same basic reason; if/when
Windows asm support is added, it would involve preprocessing with nasm,
most likely.
the variable has only very limited usage,
being only used once at the beginning of the block for prefetching only,
hence the error had no impact on compression ratio.
This saves some 1.7Kb in rodata section (x86_64, zstd tool),
while assembler code stays the same except
the type of a few load/extend instructions.
Should not have negative performance implications.
mostly for maintenance convenience.
Performance wise, there is very little change,
slightly faster for slog 3 & 4,
neutral or very slightly negative for slot 5 & 6.
I couldn't find a good way to spread `ip0` and `ip1` apart when we accelerate
due to incompressible inputs. (The methods I tried slowed things down quite a
bit.)
Since we aren't splaying ip0 and ip1 apart (which would be like `0_1_2_3_`, as
opposed to the `01__23__` we were actually doing), it's a big ambitious to
increment `step` by 2. Instead, let's increment it by 1, which has the benefit
sliiightly improving compression. Speed remains pretty much unchanged.
The position updates are rewritten from `ip[N] = ip[N-1] + step` to be
`ip[N] = ip[N-2] + step`. This lets us only deal with the asymmetric spacing
of gaps at setup and then we only have to keep a single `step` variable.
This seems to work quite well on GCC and Clang!