Yann Büchau

3 years ago

Yann Büchau
3 years ago

First results of my #compression algorithm benchmark run on a 72MB CSV file. It seems #zstd really has something for everybody, though it can't reach #xz's insane (but slow) compression ratios at maximum settings.

This chart includes multithreaded runs for #zstd.

Very interesting! 🧐

gitlab.com/nobodyinperson/comp…

#Python #matplotlib #Jupyter #JupyterLab

Input file used for the benchmark. A CSV file with a very long header, then columns with a unix timestamp in the first column, then only one column is filled in every row, giving a cascading look to the file. It's atmospheric temperature, humidity and pressure data from several Sensirion SCD30 sensors.

Yann Büchau / ⏱️ Compression Algorithm Benchmark · GitLab

A Python script to benchmark file compression algorithms 🗜️

^GitLab

#python #compression #zstd #xz #matplotlib #jupyter #jupyterlab

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

Single-threaded #zstd doesn't really change the picture all that much.

#zstd

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

⏱️ Compression vs Decompression speed. Interesting that #zstd has some settings that actually decompress slower than they compress. Might just be my machine, though... #xz is still a turtle 🐢, #gzip is not much better, #lz4 decompresses much faster than is compressed, #zstd again has something for everybody.

Plot of compression vs decompression speed. In general, all compression algorithms decompress faster than they compress, except for zstd which in this case has some settings that decompress slower than they compress.

#zstd #xz #gzip #lz4

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

Interestingly, #xz at low compress levels actually compresses better than #zstd with lower RAM usage!

Plot compression ratio vs max. RAM usage at compression. zstd. spans areas near both axes gzip and lz4 are close to lowest RAM usage, xz at highest compress ratio uses maximum RAM of all

#zstd #xz

This entry was edited (3 years ago)

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

First let's look only at the non-fancy options (no --fast or multithreading) and make log-log plots to better see what's happening in the 'clumps' of points. Points of interest for me:

- #gzip has a *really* low memory footprint across all compression levels
- #zstd clearly wins the decompression speed/compression ratio compromise!
- #xz at higher levels is unrivalled in compression ratio
- #lz4 higher levels aren't worth it. #lz4 is also just fast.

#zstd #xz #gzip #lz4

This entry was edited (3 years ago)

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

Repeated the #compression #benchmark with the same file on a beefier machine (AMD Ryzen 9 5950X), results are quite identical, except faster overall.

This plot is also interesting:

- #gzip and #lz4 have fixed (!) and very low RAM usage across levels and compression/decompression
- #xz RAM usage scales with the level from a couple of MBs (0) to nearly a GB (9)
- #zstd RAM usage scales weirdly with level but not as extreme as #xz

#Python #matplotlib

Plot RAM at compression vs RAM at decompression for different compression algorithms. gzip very low RAM, lz4 a little more, xz increasing strongly with the level, zstd not as prominently

#python #compression #zstd #xz #matplotlib #gzip #lz4 #benchmark

in reply to Yann Büchau

Yann Büchau

in reply to Yann Büchau 3 years ago

My conclusion after all this is that I'll probably use #zstd level 1 (not the default level 3!) for #compression of my #CSV measurement data from now on:

- ultra fast compression and decompression, on par with #lz4
- nearly as good a compression ratio as #gzip level 9
- negligible RAM usage

When I need ultra small files though, e.g. for transfer over a slow connection, I'll keep using #xz level 9.

plot, emoji finger pointing to zstd level 1 being fast and nearly as good a ratio as gzip -9

plot, emoji finger pointing at zstd level 1 having reasonably low RAM usage

plot, emoji finger pointing at zstd level 1 being ultra fast

#csv #compression #zstd #xz #gzip #lz4

⇧

Yann Büchau 3 years ago • •

Yann Büchau
3 years ago