Image of the glider from the Game of Life by John Conway
Skip to content

LZMA Part II- Decompression

A couple days ago, I covered the LZMA compression algorithm as it related to compression. Well, as pointed out in the comments, we need to see the other side of compressed data, and that is decompressing the data. I've kept all my data files, so, let's decompress them with GZIP, BZIP2 and LZMA, and see how everything fares out. Also, Justin Dugger pointed out that LZMA's strength, lies not in compressing and decompressing random data, but similar data between multiple files. So, that will be addressed in another post.

So, let's get started decompressing. Remember, from my previous post, I took a look at ASCII data, pictures and video, then finally similar data, using /dev/zero. Let's decompress the these files, timing them. The point of this post, and last, is not how well it compresses and decompresses, but how long it takes to perform the operation. Looking first at the ASCII data (emphasis mine):

$ time gunzip -c etc.tar.gz > etc.tar
gunzip -c etc.tar.gz > etc.tar  0.24s user 0.11s system 27% cpu 1.261 total
$ time bunzip2 -c etc.tar.bz2 > etc.tar
bunzip2 -c etc.tar.bz2 > etc.tar  0.96s user 0.12s system 53% cpu 2.022 total
$ time unlzma -c etc.tar.lzma > etc.tar
unlzma -c etc.tar.lzma > etc.tar  0.36s user 0.13s system 30% cpu 1.622 total

Impressive, on ASCII text, LZMA was quite unlike the results when compressing. Let's see how it does on random binary data:

$ time gunzip -c pics.tar.gz > pics.tar
gunzip -c pics.tar.gz > pics.tar  1.36s user 0.29s system 34% cpu 4.822 total
$ time bunzip2 -c pics.tar.bz2 > pics.tar
bunzip2 -c pics.tar.bz2 > pics.tar  11.83s user 0.42s system 79% cpu 15.430 total
$ time unlzma -c pics.tar.lzma > pics.tar
unlzma -c pics.tar.lzma > pics.tar  10.71s user 0.34s system 78% cpu 14.049 total

Still outperforming BZIP2, even if it wasn't as impressive as ASCII data. Now, the final run on our binary file of zeros:

$ time gunzip -c >
gunzip -c >  2.77s user 1.15s system 29% cpu 13.368 total
$ time bunzip2 -c >
bunzip2 -c >  2.36s user 1.66s system 25% cpu 15.542 total
$ time unlzma -c >
unlzma -c >  1.74s user 1.18s system 23% cpu 12.236 total

Wow. Absolutely impressive. I have to say, that I was not expecting these times going into it. LZMA absolutely sucks in relation to time when using maximum, but beats BZIP2 in every case with decompressing. Notice how much faster it was on the binary file of zeros. With max compression, it took nearly 3 minutes to complete, but decompressing, it took only 12 seconds. Overall, you're still better off with GZIP or BZIP2, but LZMA is holding its own with decompression.

Verdict on LZMA now? Still not my first choice, but maybe I was a bit harsh on it with the earlier post. Our benchmarks aren't over yet, though. Let's see how it compares with multiple files that have similar data between them. Supposedly, this is where LZMA shines.

{ 16 } Comments

  1. Paul | December 16, 2008 at 6:53 pm | Permalink

    Sounds like lzma would be a win for things that are compressed once and then distributed, mirrored and/or decompressed a lot (such as deb packages and perhaps more so for source packages given it's bigger advantage in text compression).

    Is there some explanation somewhere of LZMA's "multiple file" advantage?
    At face value it seems to me that in most cases on Linux the compression would be done on a single tar file. Is it that LZMA's bigger dictionary allows more patterns to be discovered throughout the tar-aggregated files?

  2. Aaron | December 16, 2008 at 9:16 pm | Permalink

    I don't know. The only advantage I can see is keeping a single dictionary on multiple files, where a great deal of the files are similar- something like SNES ROMs. I'll investigate this, and report in another post.

  3. sebsauvage | December 17, 2008 at 1:03 am | Permalink

    tar takes files as they come.

    7z format groups files by extension, yielding much better optimized dictionnaries, thus better compression.

    (I'm not talking about .tar.lzma, which is less optimized than .7z. Besides, .7z format support encryption and a few other features.)

  4. Paul | December 17, 2008 at 1:20 am | Permalink

    I see, so that advantage is the result of 7z rather than the LZMA algorithm itself?

    Presumably it would be possible to create tar compatible archives with files stored in a more intelligent order.

  5. Aaron | December 17, 2008 at 6:22 am | Permalink

    From what I understand of 7z, besides being a container that supports more than just LZMA, when using LZMA, it takes the --fast switch, whereas .tar.lzma is using -6. Tarballing the archive first, then runing 'lzma --fast foo.tar' should provide the same experience that .7z provides.

  6. Michael "Agree" H | December 16, 2008 at 7:07 pm | Permalink

    Paul: Agree on that. Slax uses LZMA for it's modules, which makes sense based on these benchmarks.

  7. seb | December 16, 2008 at 7:20 pm | Permalink

    thanks for the post!,
    will you address the archive corruption as well?

    another idea is now to rank the algorithms by time AND compression ratio for both compression and decompression.

  8. Aaron | December 16, 2008 at 9:14 pm | Permalink

    I don't know enough about archive corruption, but I'll look into it. That topic was also mentioned in comments on the previous post.

  9. Meneer R | December 16, 2008 at 7:28 pm | Permalink

    Testing compression algorithms with random data is not just silly; it is downright offensive.

    The theory of information tells us, that if algorithm X makes a specific subset of all possible data D smaller, than there must also be a subset which would increase in size.

    To put it a little more obvious: the core foundation of a good compression algorithm, is exactly the same as a good learning algorithm. It needs to have a BIAS.

    The bias is the sort of data it will make smaller, at the expense of data that does not fit that bias. Data that would grow because it contradicts the bias.

    So, why is random data such a BAD choice? Because, the perfect algorithm would make all my files really small. And the only sort of data I do have no use for would be the RANDOM data.

    Why do I say that? Because I can't do anything with that data. I can't in any way interpret it. If there was something valuable to be interpreted, it wouldn't be random.

    Worse: we have no real random data to use. All our so called random data is either generated with the intent of being random or sampled out of something we consider to be random.

    Your OS uses an algorithm to create random data. The perfect compression of that particular data is the actual 4 lines of C code that generate the data and the initial seed. We call it random not because it is random, but because we would need an extremely large sample size to find the correlation that would lead to the original mathematical function.

    Take static on a TV. I can have you watch a picture of static for an hour. The same exact still image. Then I would take that image away and show you three other images. You wouldn't be able to pick the one you stared at for an hour. Our brains too can't compress that data.

    So what we experience as random data, is data with a structure that is computationally complex. That is: it is nearly impossible to correlate the data.

    The perfect compression algorithm will GROW random data, because it random data should contradict the fine-tuned bias for usefull stuff.

    Yet, somehow, when somebody measures a compression algorithm using random data, they think it's a good thing it can compress that shit.

    To summerize:
    - 'random data' is subjective to what our brains or machienes find hard to data-mine.
    - 'compression' is about having a bias for one set of data, at the expensive of another set of data
    - the perfect compression algorithm will grow random data

  10. Aaron | December 16, 2008 at 9:12 pm | Permalink

    You're taking my use of the word "random" a bit too literally. However, you are correct in your dissertation on compressing random data. My use of the word is merely some random files of my hard drive- not that the bits are literally randomized.

  11. dudus | December 16, 2008 at 9:00 pm | Permalink

    I'm not a compression expert, but I know that 7zip is pretty effective when compressing many similar files. You should add it to your benchmarks in the next round.

    Also instead of compressing random data try to compress something more usefull like your /usr folder

  12. Aaron | December 16, 2008 at 9:13 pm | Permalink

    You, as well as many others on the previous post, are missing the point. It's not about how "useful" or "practical" this exercise is. It's merely a test for speed. Compressing and decompressing already compressed data shouldn't take three years and a day to complete. Same with compressing and decompressing zeros. We're after sheer speed here. That is the entire point of the exercise.

  13. Justin Dugger | December 16, 2008 at 10:23 pm | Permalink

    "Overall, you’re still better off with GZIP or BZIP2, but LZMA is holding its own with decompression."

    That depends on what you're using the archive for. For a Deb package, it might make sense to use LZMA compression and save bandwidth and decompression time (at the cost of RAM). For a compressed backup, where one hopefully never needs to decompress, maybe it's not a good idea. Clearly, there's a balance in here that a single variable cannot address.

    Which is why I don't think you can make blanket recommendations like "never use lzma", but rather simply publish results and speculate on how suitable is it for specific purposes.

  14. Aaron | December 17, 2008 at 6:20 am | Permalink

    Yeah. If you can't already tell, I'm coming to that conclusion. I still have a couple more posts to put to bed, then I think LZMA will have been analyzed in a decent manner.

  15. leidola | December 17, 2008 at 5:41 pm | Permalink

    >> “Overall, you’re still better off with GZIP or BZIP2, but LZMA is holding its own with decompression.”
    > I’m coming to that conclusion
    You came to that conclusion, but ignored in your previous post, that "lzma -1" compresses faster and yields smaller archives than bzip2. As I see it, you can either choose speed (gzip) or compression (lzma). The only reason bzip2 might be useful is for rescuing data from damaged archives, as its blocks blocks are independant and you loose at most 900k. I don't know anything about lzma, but I guess restoring broken files isn't that easy.

  16. Joerg Sonnenberger | December 19, 2008 at 6:10 am | Permalink

    (a) What are you using for decompression? E.g. lzma-4.62 is much faster than the lzma-utils-4.32.7 (have nothing else to compare against).

    (b) Where does the difference between user+sys and real time come from? It looks like you are more measuring efficency of the buffering than actual decompression timing. For more stable numbers it might help to always write the output to /dev/null.

Post a Comment

Your email is never published nor shared.