I’ve known about the LZMA compression algorithm for a little while, but I haven’t really played with it. So, giving it a quick try, I thought I would sick it after all the text files in my /etc directory. I’m using GNU tar to archive the files, and the maximum compression possible with each algorithm to get the tightest squeeze on that archive:
$ sudo tar -cf etc.tar /etc [sudo] password for aaron: tar: Removing leading `/' from member names $ time gzip -c9 etc.tar > etc.tar.gz gzip -c9 etc.tar > etc.tar.gz 8.01s user 0.04s system 100% cpu 8.048 total $ time bzip2 -c9 etc.tar > etc.tar.bz2 bzip2 -c9 etc.tar > etc.tar.bz2 8.12s user 0.04s system 99% cpu 8.170 total $ time lzma -c9 etc.tar > etc.tar.lzma lzma -c9 etc.tar > etc.tar.lzma 36.67s user 0.38s system 99% cpu 37.055 total $ ls -lh etc.tar* -rw-r--r-- 1 aaron aaron 37M 2008-12-14 13:52 etc.tar -rw-r--r-- 1 aaron aaron 2.8M 2008-12-14 13:49 etc.tar.bz2 -rw-r--r-- 1 aaron aaron 4.0M 2008-12-14 13:47 etc.tar.gz -rw-r--r-- 1 aaron aaron 1.5M 2008-12-14 13:50 etc.tar.lzma
As you can clearly see, when cranking up the compression on the TAR file, BZIP2 is comparable to GZIP. However, LZMA takes nearly 5 times as long to complete. However, the space saved from this time is significant- 1.5 MB versus 4 MB coming from GZIP. I’m not convinced 100%, though. Let’s sick it after some binary data. I have another TAR file, but this time with JPEGs and AVIs from my camera. Let’s see the results here (emphasis mine):
$ cd /media/NIKON/DCIM/103NIKON/ $ tar -cf ~/pics.tar * $ cd $ time gzip -c9 pics.tar > pics.tar.gz gzip -c9 pics.tar > pics.tar.gz 7.18s user 0.22s system 85% cpu 8.690 total $ time bzip2 -c9 pics.tar > pics.tar.bz2 bzip2 -c9 pics.tar > pics.tar.bz2 25.44s user 0.31s system 99% cpu 25.841 total $ time lzma -c9 pics.tar > pics.tar.lzma lzma -c9 pics.tar > pics.tar.lzma 68.49s user 0.82s system 99% cpu 1:09.46 total $ ls -lh pics.tar* -rw-r--r-- 1 aaron aaron 111M 2008-12-14 14:09 pics.tar -rw-r--r-- 1 aaron aaron 108M 2008-12-14 14:05 pics.tar.bz2 -rw-r--r-- 1 aaron aaron 110M 2008-12-14 14:04 pics.tar.gz -rw-r--r-- 1 aaron aaron 109M 2008-12-14 14:07 pics.tar.lzma
Yeah… LZMA isn’t giving me a lot here. In fact, I find it interesting that BZIP2 won in terms of the smallest size. Now, granted, I’m already aware that JPEG and AVI files are initially compressed, so I’m not looking to gain a lot here. As already mentioned, this is mostly a quest of curiosity. Again, notice the times- over a minute to complete with LZMA, where GZIP only took 8 seconds. However, let’s see what this would do on a file of nothing but binary zeros. Pulling from /dev/zero, I can create a file of any arbitrary size. So, let’s create a 512 MB file, and sick the compression algorithms after it:
$ dd if=/dev/zero of=file.zero bs=512M count=1 1+0 records in 1+0 records out 536870912 bytes (537 MB) copied, 12.4654 s, 43.1 MB/s $ time gzip -c9 file.zero > file.zero.gz gzip -c9 file.zero > file.zero.gz 4.86s user 0.18s system 99% cpu 5.052 total $ time bzip2 -c9 file.zero > file.zero.bz2 bzip2 -c9 file.zero > file.zero.bz2 11.35s user 0.24s system 100% cpu 11.586 total $ time lzma -c9 file.zero > file.zero.lzma lzma -c9 file.zero > file.zero.lzma 189.81s user 0.92s system 99% cpu 3:10.73 total $ ls -lh file.zero* -rw-r--r-- 1 aaron aaron 512M 2008-12-14 14:14 file.zero -rw-r--r-- 1 aaron aaron 402 2008-12-14 14:23 file.zero.bz2 -rw-r--r-- 1 aaron aaron 509K 2008-12-14 14:23 file.zero.gz -rw-r--r-- 1 aaron aaron 75K 2008-12-14 14:27 file.zero.lzma
Heh. All I can say, is heh. BZIP2 again took the top prize for being the most compressed, getting 512 MB into a mere 402 bytes. And it only took 6 extra seconds compared to GZIP. LZMA, while compressing fairly well, did miserably in the reported time. Three minutes?! On binary zeros?! What was it doing? Watching some YouTube while doing the compression?
All in all, I’m not impressed with LZMA. It’s a horrible performer, and only gives marginal results. It seems to do well on ASCII text, but fails miserably on binary files, where BZIP takes the clear win in compression. While it may pull out some impressive compression, the time it takes to perform isn’t worth it. BZIP2, is a much more capable algorithm, and although it’s a horrible performer too, it’s not nearly as bad as LZMA. I would make it worth my while to use BZIP2, whenever possible, reaching for GZIP only with time is the primary factor.
I would be interested in some other benchmarks on different data, if anyone has access to those. I think these results give us a good idea about LZMA though- STEER CLEAR.

{ 45 } Comments
What about DEcompression times? bzip for example, is slower than gzip compressing, but much better going the other way. I wonder how LZMA fairs.
I wonder how fast LZMA is when de-compressing. There must be an advantage somewhere..
Heh
Here is a good one to mess with your head. With gnu tar >= 1.2.0, you can do something like tar –lzma -cf files.tar.lzma directory/. Also get ahold of 7zip and take a look at the various compression algorithms it supports.
For text files, the ppm (http://en.wikipedia.org/wiki/Prediction_by_Partial_Matching) compresses better than anything out there. We’ve seen multi-gigabyte firewall logs compress down to a hundred megabytes or so with it before. Download the 7zip package (sudo apt-get install p7zip-full) and put something like this in your production backup scripts as you might not want to kill I/O on your boxes:
ionice -c3 nice -n +20 /usr/bin/7za a -mx9 -m0=ppmd ${file}.7z $file
Also note that lzma is optimized for fast decompression, not fast compression. Time the difference between the two and you’ll see.
Forgot to mention one last thing, instead of using bzip2, use pbzip2, it is a multithreaded parallel bzip implementation that speeds up bzip considerably. Watch it or nice + ionice it though as it will chew through all available bandwidth if you let it.
Hello!
First of all, the selected tests are not that good, as zeros and multimedia files aren’t files one usually compresses. You could have compressed e.g. /usr/bin
Then you should have sticked to the default compression ration for a fair time comparison, which is -7 for lzma. From the manpage:
Have a nice day,
oleid
Who cares about compression time when you are using the “-9″ flag? It’s a bit of an oxymoron comparison here. I’d prefer to see the compression time and size when using the “quick” and “default” compression mode.
Your “steer clear” verdict is useless for everyone, including yourself.
I have to agree with Pete.
These results were also surprising, as everything thrown by me at lzma gets smaller than gz/bzip2.
One thing that never left my mind was when some guy at Ubuntu suggested to change the packaging to use lzma instead of the usual suspects. This was said a couple of years ago.
I’ve found the exact opposite results in some data sets, but I may need to redo them if lzma wasn’t always used in 7z. As always, understanding how the compression works is key to understanding which is appropriate for the situation. lzma allows for HUGE dictionaries working across entire directories. This means if you have lots of similar files grouped together, it will take advantage of that.
ROMsets for example, often carry multiple versions of the same game with small localization patches. Large chunks of the binaries are identical, so lzma can effectively load the dictionary with 99 percent of the binary and describe the changes very efficiently. It does require that you organize your data in a way that exploits the similarities, or else the dictionary can’t exploit the frequency of massive common strings.
Paul Sladen has a nice in depth discussion of the things we’ve touched on as well.
I love it when people comment without using the gray matter between their ears. Pete, check out these default times (using -6):
Pretty consistent with differences in time with -9. Also, the differences in file size. Now, on my pictures, again, using the default -6:
What’s this? Consistent, both in file size, and duration. Ok. I’ll give you one more shot to redeem yourself. Let’s run against the binary file of zeros:
The only thing to take note here, is the extra time that these algorithms are spending on sqeezing every last little bit out, isn’t seen much with GZIP or BZIP2, but with LZMA, it’s taking FOREVER. Which should give you yet another reason why to avoid LZMA. It just doesn’t fare well.
Next time, before placing a comment that doesn’t make you sound very intelligent, I’d recommend reading the docs, and understanding how things work.
LZMA fares well with compression sizes. That’s not disputed here. What is disputed, is the time it takes to get to that point.
Interesting post. I’ll comment there here in a second, but I have a question for you: why is BZIP2 considerably smaller than LZMA on a binary file of zeros? Surely, according to your argument, it would compress that thing down to practically zip. There must be some data overhead in LZMA that doesn’t exist in BZIP2.
Decompression is coming up in the next post. I’ll send the exact files I just compressed through decompression, and see how each fare.
One thing you didn’t look at is resilience to corruption. Go ahead and corrupt one random byte of the compressed files and see if the decompressors even detect corruption, and if they do then how much of your data you can recover.
Common corruption is bytes changing to all 0 or all 0xff or the insertion of \r before \n.
Aaron, nice one but I think that LZMA/7z is optimised for fast decompression. So this should really go into your little test.
Another thing to look at is corruption of the archives, what use is a highly compressed archive if you can’t decompress it…
On your second test I don’t see that difference as much a significant advantage as you did. I am also surprised that the compressed file is actually smaller than the uncompressed one for those kinds of files.
Your last test is a scenario that would never happen and didn’t throw very useful results. Any of the three formats reduced the file size more than 5000 times.
A more real scenario would be to try with a bunch of compiled executable binaries (so that they don’t contain only zeros). Also you should try 7zip compressor.
7zip is a container, that contains multiple compression algorithms. It’s not comparing apples to apples.
Also, the test on /dev/zero was to see sheer speed on a file with completely identical data. Sure, it’s not “real world”, but I’m not after that. I’m after speed, and what would be faster than parsing exactly the same data? LZMA failed to impress.
Yes- decompression is coming in a separate post.
I haven’t looked at corruption. That would be interesting to see. I’ll see what I can come up with, and if it’s worthy to put in a separate post.
I cannot profess arcane knowledge of the lzma implementation, but my guess would be that the dictionary itself has some minimum size. Plus, bzip is designed for random binary files, so it has a lot of considerations that might always make sense. One in particular seems to be RLE, which /dev/zero has in spades.
I would caution you against using edge cases like /dev/zero; do you regularly archive /dev/zero and friends?
Why a separate post, when you’ve already published the conclusion?
It would be interesting to know how well they perform with binary files (taking some stuff from /bin or /usr/bin, for instance)
Images and video files are not typical binary files that are often used in compression tests. Files and JPEG images are uncompressible because they are already compressed and the only hope is to try to collate some headers or something.
To get a real and compressible set of binary files look no further than /usr/bin or /usr/lib
I am much more excited about the lzo compressor – it has the priority on speed.
Images and video files are not typical binary files that are often used in compression tests. Files and JPEG images are uncompressible because they are already compressed and the only hope is to try to collate some headers or something.
To get a real and compressible set of binary files look no further than /usr/bin or /usr/lib
I am much more excited about the lzo compressor – it has the priority on speed.
When reading the lzma manpage one discovers the note, that -1 is faster and creates smaller files than bzip2.
A quick test compressing /usr/bin
(sorry, I don’t know how to insert tables here)
So you clearly see that if only size matters you should take lzma. If time matters you sould take gzip. If time AND size matters, you should select lzma with “-1″ option.
LZMA is said to decompress very fast. It takes 17s to decompress the lzma archive (no matter what compression used), bzip2 takes 36s. Only gzip decompresses faster with 6s. If distributors could adopt lzma compression for packages one would have smaller download sizes and faster installation (compared to bzip2 compressed archives) as decompressing is faster and reading from dvd drives would be faster.
I rarely compress already-compressed or totally empty files. Do you ?
In real life, lzma outperforms gzip/bzip2 in the vast majority of cases, be it text files, databases, executables (have you tried compressing ELF or PE executables ?) and so on.
lzma is slower ? Of course it’s slower ! Better compression requires more processing power.
There is even better than lzma: paq8hp5 outperforms lzma, but with even longer compression times (See http://prize.hutter1.net/ )
That’s no secret: That’s a simple space/time trading (less space, more CPU time).
This is exactly like the transition from MPEG2 to MPEG4 (Xvid/Divx) to H.264. Better compression requiring more CPU power.
lzma is adequate for today computers and offers reasonable compression times with excellent compression ratios.
Yeah Firefox should have “steered clear” shouldn’t they. It was only thanks to lzma that they were able to take the FF2 installer below the magic 5mb barrier, saving something like 2mb if I remember correctly, A saving like that will have surely made a huge influence on it’s uptake. Frankly “steer clear” is an absurd pronouncement to make based on such limited and unrealistic criteria.
Yes- archiving /dev/zero and friends isn’t all that practical. I wasn’t going after that. I wanted to see speed, and I figured what could be faster than a bunch of identical data? I was way wrong.
I’m willing to denounce my conclusion, if I can find where LZMA shines.
Thanks for the insightful input. It’s comments like these that keep me blogging.
Tom- re-read the post. If time isn’t a factor, then great! It certainly gives us great compression ratios. However, if time is a factor, BZIP2 performs fairly well, GZIP better, and they give good, not great, but good ratios.
I’ll fix the comment for you, putting your data in a table.
I’m preparing another post, where we look at different types of compression, yet again, with LZMA. I’ll be throwing -1 after it this time, as I’ll be dealing with massively large amounts of binaries, and I just don’t have all day for LZMA to do the compression. We’ll see what the results are.
Keep an eye on a future post.
Yes, I’m aware of the fact that JPEGs and AVIs are already compressed, so I’m not expecting to see much in terms of compression. I was hoping to see something in the way of speed, however.
I’ll be outlining in a separate post more compression with LZMA. We’ll see more results then, and whether LZMA is truly the new hotness.
Yes. I am aware of the speed increases with -1. I was completely and totally after maximum compression, as my post outlines. I wanted to compare it to the others.
@Aaron: Actually, you are incorrect. 7zip is the reference implementation of LZMA. The fact that you aren’t using it says a lot about your testing methodology.
To quote your words verbatim, “Next time, before placing a comment that doesn’t make you sound very intelligent, I’d recommend reading the docs, and understanding how things work.”.
I’m wondering what lzma implementation you used.
From the wikipedia link(and the ubuntu packaged version); I suppose you used the 7-zip/lzma utils implementation.
But, following freshmeat news, I see a lot of updates on a utility called lzip that work kinda like gzip, but with a lzma algorithm: http://www.nongnu.org/lzip/lzip.html
Maybe having a look at that implementation will give different results. (I should find time to do that)
I have used 7zip before. It’s a container. Straight from Wikipedia (emphasis mine):
Try again.
I used the the vanilla LZMA package from the Ubuntu archives. It’s installed by default, as DPKG can now take advantage of it, as can RPM, GNU TAR, and others.
The file.zero case for LZMA is a bug in the encoder. It is only using 257 byte per loop per backward reference, when it could easily use a single 512MB repeation. This naturally increases the compression time a lot as well.
That makes a lot of sense, actually. Thank you. Also, I think you’re the first NetBSD commenter on my blog.
If it really is a bug, it is an extremely serious bug. A bug so serious that it should be reason not to use LZMA at all.
I compress lots of files that have big sections of nothing but zeros.
I agree that it is stupid. I might have been wrong on the format part from more reading of the code, but the number of long repeatitions is high on some of the data I care for. I haven’t had time to investigate and measure the required changes though.
I should correct myself, bzip is not designed for “random” binary files. Rather, it gets used for that purpose because its crazy affine transformations and compression stack seem to help out, like RLE.
Любопытно. Чувствуется позитив
{ 2 } Trackbacks
[...] couple days ago, I covered the LZMA compression algorithm as it related to compression. Well, as pointed out in the comments, we need to see the other side [...]
[...] the inaugurial post in Aaron Toponce’s series on compression, two critical errors are made and highlighted by a [...]
Post a Comment