2448 Commits

Author SHA1 Message Date
W. Felix Handte
d18a405779 Refer to the Dictionary Match State In-Place (Sometimes) 2018-05-23 17:53:03 -04:00
Nick Terrell
c92dd11940 Error if reported size is too large in edge case 2018-05-23 14:47:20 -07:00
Nick Terrell
a97e9a627a [zstd] Fix decompression edge case
This edge case is only possible with the new optimal encoding selector,
since before zstd would always choose `set_basic` for small numbers of
sequences.

Fix `FSE_readNCount()` to support buffers < 4 bytes.

Credit to OSS-Fuzz
2018-05-23 12:16:00 -07:00
Nick Terrell
e3959d5eba Fixes 2018-05-22 16:06:33 -07:00
Yann Collet
7a8b3496b4 Merge branch 'dev' into staticDictCost 2018-05-22 15:10:05 -07:00
Yann Collet
a8ddf1d370 disable 2-passes strategy 2018-05-22 15:06:36 -07:00
Nick Terrell
49cf880513 Approximate FSE encoding costs for selection
Estimate the cost for using FSE modes `set_basic`, `set_compressed`, and
`set_repeat`, and select the one with the lowest cost.

* The cost of `set_basic` is computed using the cross-entropy cost
  function `ZSTD_crossEntropyCost()`, using the normalized default count
  and the count.
* The cost of `set_repeat` is computed using `FSE_bitCost()`. We check the
  previous table to see if it is able to represent the distribution.
* The cost of `set_compressed` is computed with the entropy cost function
  `ZSTD_entropyCost()`, together with the cost of writing the normalized
  count `ZSTD_NCountCost()`.
2018-05-22 14:33:22 -07:00
Yann Collet
27af35c110
Merge pull request #1143 from facebook/tableLevels
Update table of compression levels
2018-05-19 14:40:37 -07:00
Yann Collet
5381369cb1 Merge branch 'dev' into tableLevels 2018-05-18 18:23:27 -07:00
Yann Collet
b0b3fb517d updated compression levels for blocks of 256KB 2018-05-18 17:17:12 -07:00
Nick Terrell
7cbb8bbbbf [cover] Small compression ratio improvement
The cover algorithm selects one segment per epoch, and it selects the epoch
size such that `epochs * segmentSize ~= dictSize`. Selecting less epochs
gives the algorithm more candidates to choose from for each segment it
selects, and then it will loop back to the first epoch when it hits the
last one.

The trade off is that now it takes longer to select each segment, since it
has to look at more data before making a choice.

I benchmarked on the following data sets using this command:

```sh
$ZSTD -T0 -3 --train-cover=d=8,steps=256 $DIR -r -o dict && $ZSTD -3 -D dict -rc $DIR | wc -c
```

| Data set     | k (approx) |  Before  |  After   | % difference |
|--------------|------------|----------|----------|--------------|
| GitHub       | ~1000      |   738138 |   746610 |       +1.14% |
| hg-changelog | ~90        |  4295156 |  4285336 |       -0.23% |
| hg-commands  | ~500       |  1095580 |  1079814 |       -1.44% |
| hg-manifest  | ~400       | 16559892 | 16504346 |       -0.34% |

There is some noise in the measurements, since small changes to `k` can
have large differences, which is why I'm using `steps=256`, to try to
minimize the noise. However, the GitHub data set still has some noise.

If I run the GitHub data set on my Mac, which presumably lists directory
entries in a different order, so the dictionary builder sees the files in
a different order, or I use `steps=1024` I see these results.

| Run        | Before | After  | % difference |
|------------|--------|--------|--------------|
| steps=1024 | 738138 | 734470 |       -0.50% |
| MacBook    | 738451 | 737132 |       -0.18% |

Question: Should we expose this as a parameter? I don't think it is
necessary. Someone might want to turn it up to exchange a much longer
dictionary building time in exchange for a slightly better dictionary.
I tested `2`, `4`, and `16`, and `4` got most of the benefit of `16`
with a faster running time.
2018-05-18 16:15:27 -07:00
Yann Collet
5cbef6e094 Merge branch 'dev' into staticDictCost 2018-05-18 16:03:06 -07:00
Yann Collet
a95e9e80d1 adding some debug functions to observe statistics 2018-05-18 14:09:42 -07:00
fbrosson
291824f49d __builtin_prefetch did probably not exist before gcc 3.1. 2018-05-18 18:40:11 +00:00
fbrosson
16bb8f1f9e Drop colon in asm snippet to make old versions of gcc happy. 2018-05-18 17:05:36 +00:00
Yann Collet
af3da079d1 fixed minor conversion warning 2018-05-17 17:27:27 -07:00
Yann Collet
8572b4d09f fixed a pretty complex bug when combining ldm + btultra 2018-05-17 16:13:53 -07:00
Yann Collet
134388ba6b collect statistics for first block in ultra mode
this patch makes btultra do 2 passes on the first block,
the first one being dedicated to collecting statistics
so that the 2nd pass is more accurate.

It translates into a very small compression ratio gain :

enwik7, level 20:
blocks  4K : 2.142 -> 2.153
blocks 16K : 2.447 -> 2.457
blocks 64K : 2.716 -> 2.726

On the other hand, the cpu cost is doubled.

The trade off looks bad.
Though, that's ultimately a price to pay to reach better compression ratio.
So it's only enabled when setting btultra.
2018-05-17 12:24:30 -07:00
Yann Collet
a243020d37 slightly improved weight calculation
translating into a tiny compression ratio improvement
2018-05-17 11:19:44 -07:00
Yann Collet
63eeeaa1dd update table levels for blocks <= 16K
also : allow hlog to be slighly larger than windowlog,
as it's apparently good for both speed and compression ratio.
2018-05-16 16:13:37 -07:00
Yann Collet
18fc3d3cd5 introduced bit-fractional cost evaluation
this improves compression ratio by a *tiny* amount.
It also reduces speed by a small amount.

Consequently, bit-fractional evaluation is only turned on for btultra.
2018-05-16 14:53:35 -07:00
Yann Collet
9938b17d4c
Merge pull request #1135 from facebook/frameCSize
decompress: changed error code when input is too large
2018-05-15 11:02:53 -07:00
Nick Terrell
30d9c84b1a Fix failing Travis tests 2018-05-15 09:46:20 -07:00
Yann Collet
0b31304c8d Merge branch 'dev' into staticDictCost 2018-05-14 18:09:26 -07:00
Yann Collet
2c26df0e13 opt: removed static prices
after testing, it's actually always better to use dynamic prices
albeit initialised from dictionary.
2018-05-14 18:04:08 -07:00
Yann Collet
f372ffc64d
Merge pull request #1127 from facebook/staticDictCost
Improved optimal parser with dictionary
2018-05-14 17:45:50 -07:00
Yann Collet
d59cf02df0 decompress: changed error code when input is too large
ZSTD_decompress() can decompress multiple frames sent as a single input.
But the input size must be the exact sum of all compressed frames, no more.

In the case of a mistake on srcSize, being larger than required,
ZSTD_decompress() will try to decompress a new frame after current one, and fail.
As a consequence, it will issue an error code, ERROR(prefix_unknown).

While the error is technically correct
(the decoder could not recognise the header of _next_ frame),
it's confusing, as users will believe that the first header of the first frame is wrong,
which is not the case (it's correct).
It makes it more difficult to understand that the error is in the source size, which is too large.

This patch changes the error code provided in such a scenario.
If (at least) a first frame was successfully decoded,
and then following bytes are garbage values,
the decoder assumes the provided input size is wrong (too large),
and issue the error code ERROR(srcSize_wrong).
2018-05-14 15:32:28 -07:00
Yann Collet
c9227ee16b update table for 128 KB blocks 2018-05-13 17:15:07 -07:00
Yann Collet
b4250489cf update compression levels for large inputs 2018-05-13 01:53:38 -07:00
Yann Collet
761758982e replaced FSE_count by FSE_count_simple
to reduce usage of stack memory.

Also : tweaked a few comments, as suggested by @terrelln
2018-05-11 16:03:37 -07:00
Yann Collet
3193d692c2 minor patch, ensuring LIBDIR is created before installation
follow-up from #1123
2018-05-11 11:31:48 -07:00
Yann Collet
99ddca43a6 fixed wrong assertion
base can actually overflow
2018-05-10 19:48:09 -07:00
Yann Collet
0d7626672d fixed c++ conversion warning 2018-05-10 18:17:21 -07:00
Yann Collet
09d0fa29ee minor adjusting of weights 2018-05-10 18:13:48 -07:00
Yann Collet
1a26ec6e8d opt: init statistics from dictionary
instead of starting from fake "default" statistics.
2018-05-10 17:59:12 -07:00
Yann Collet
74b1c75d64 btopt : minor adjustment of update frequencies 2018-05-10 16:32:36 -07:00
Yann Collet
ac6105463a opt: minor improvements to log traces
slight improvement when using fractional-bit evaluation (opt:dictionay)
2018-05-09 15:46:11 -07:00
Yann Collet
c39061cb7b fixed declaration-after-statement warning 2018-05-09 12:07:25 -07:00
Yann Collet
4d5bd32a00 added traces to look at symbol costs
evaluation looks correct.
2018-05-09 12:00:12 -07:00
Yann Collet
c0da0f5e9e switchable bit-approximation / fractional-bit accuracy modes
also : makes it possible to select nb of fractional bits.
2018-05-09 10:48:09 -07:00
Yann Collet
ba2ad9b6b9 implemented fractional bit cost evaluation
for FSE symbols.

While it seems to work, the gains are negligible compared to rough maxNbBits evaluation.
There are even a few losses sometimes, that still need to be explained.
Furthermode, there are still cases where btlazy2 does a better job than btopt,
which seems rather strange too.
2018-05-08 17:43:13 -07:00
Yann Collet
1aff63b114 opt: shift all costs by 8 bits (* 256)
making it possible to represent fractional bit costs.
2018-05-08 16:19:04 -07:00
Yann Collet
6a3c34aa58 opt: estimate cost of both Hufman and FSE symbols
For FSE symbols : provide an upper bound,
in nb of bits,
since cost function is not able to store fractional bit costs.
2018-05-08 16:11:21 -07:00
Yann Collet
338f738c24 pass entropy tables to optimal parser
for proper estimation of symbol's weights
when using dictionary compression.

Note : using only huffman costs is not good enough,
presumably because sequence symbol costs are incorrect.
2018-05-08 15:37:06 -07:00
Yann Collet
a155061328 minor code refactor for readability
removed some useless operations from optimal parser
(should not change performance, too small a difference)
2018-05-08 12:32:44 -07:00
Baruch Siach
9a0643b633 lib/Makefile: create include directory before headers installation
Make sure that $(INCLUDEDIR) exists before copying the headers there.
Otherwise, the contest of header files is copied over
$(DESTDIR)$(INCLUDEDIR), making it a regular file.

While at it, remove $(DESTDIR)$(INCLUDEDIR) from the list of directories
to create in the install-pc target. The install-pc target does not need
this directory.
2018-05-08 20:59:44 +03:00
Yann Collet
ad4524d605 fix ZSTD_compressBlock() associated with CDict
reported by @let-def.

It's actually a bug in ZSTD_compressBegin_usingCDict()
which would pass a wrong pledgedSrcSize value (0 instead of ZSTD_CONTENTSIZE_UNKNOWN)
resulting in wrong window size, resulting in downsized seqStore,
resulting in segfault when writing into the seqStore later in the process.

Added a test in fuzzer to cover this use case (fails before the patch).
2018-05-07 12:54:13 -07:00
Peter Seiderer
64bfdca5b9 Split library install target into pc, static, shared and include only target
Signed-off-by: Peter Seiderer <ps.report@gmx.net>
2018-04-30 20:32:32 +02:00
Nick Terrell
ca77822ddf Fix parameter adjustment with dictionary
The new advanced API basically set `requestedParams = appliedParams` when
using a dictionary. This halted all parameter adjustment, which can hurt
compression ratio if, for example, the window log is small for the first
call, but the rest of the files are large.

This patch fixes the bug, and checks that the `requestedParams` don't change
in the new advanced API when using a dictionary, and generally in the fuzzer.
2018-04-25 16:32:29 -07:00
Yann Collet
12f60b8c98 clarified documentation related to refPrefix() 2018-04-25 10:17:06 -07:00