Comments on: LZMA Part II- Decompression Linux. GNU. Freedom. Sun, 13 May 2018 18:21:35 +0000 hourly 1 By: Joerg Sonnenberger Fri, 19 Dec 2008 13:10:23 +0000 (a) What are you using for decompression? E.g. lzma-4.62 is much faster than the lzma-utils-4.32.7 (have nothing else to compare against).

(b) Where does the difference between user+sys and real time come from? It looks like you are more measuring efficency of the buffering than actual decompression timing. For more stable numbers it might help to always write the output to /dev/null.

By: leidola Thu, 18 Dec 2008 00:41:24 +0000 >> “Overall, you’re still better off with GZIP or BZIP2, but LZMA is holding its own with decompression.”
> I’m coming to that conclusion
You came to that conclusion, but ignored in your previous post, that "lzma -1" compresses faster and yields smaller archives than bzip2. As I see it, you can either choose speed (gzip) or compression (lzma). The only reason bzip2 might be useful is for rescuing data from damaged archives, as its blocks blocks are independant and you loose at most 900k. I don't know anything about lzma, but I guess restoring broken files isn't that easy.

By: Aaron Wed, 17 Dec 2008 13:22:55 +0000 From what I understand of 7z, besides being a container that supports more than just LZMA, when using LZMA, it takes the --fast switch, whereas .tar.lzma is using -6. Tarballing the archive first, then runing 'lzma --fast foo.tar' should provide the same experience that .7z provides.

By: Aaron Wed, 17 Dec 2008 13:20:28 +0000 Yeah. If you can't already tell, I'm coming to that conclusion. I still have a couple more posts to put to bed, then I think LZMA will have been analyzed in a decent manner.

By: Paul Wed, 17 Dec 2008 08:20:30 +0000 I see, so that advantage is the result of 7z rather than the LZMA algorithm itself?

Presumably it would be possible to create tar compatible archives with files stored in a more intelligent order.

By: sebsauvage Wed, 17 Dec 2008 08:03:58 +0000 tar takes files as they come.

7z format groups files by extension, yielding much better optimized dictionnaries, thus better compression.

(I'm not talking about .tar.lzma, which is less optimized than .7z. Besides, .7z format support encryption and a few other features.)

By: Justin Dugger Wed, 17 Dec 2008 05:23:25 +0000 "Overall, you’re still better off with GZIP or BZIP2, but LZMA is holding its own with decompression."

That depends on what you're using the archive for. For a Deb package, it might make sense to use LZMA compression and save bandwidth and decompression time (at the cost of RAM). For a compressed backup, where one hopefully never needs to decompress, maybe it's not a good idea. Clearly, there's a balance in here that a single variable cannot address.

Which is why I don't think you can make blanket recommendations like "never use lzma", but rather simply publish results and speculate on how suitable is it for specific purposes.

By: Aaron Wed, 17 Dec 2008 04:16:50 +0000 I don't know. The only advantage I can see is keeping a single dictionary on multiple files, where a great deal of the files are similar- something like SNES ROMs. I'll investigate this, and report in another post.

By: Aaron Wed, 17 Dec 2008 04:14:36 +0000 I don't know enough about archive corruption, but I'll look into it. That topic was also mentioned in comments on the previous post.

By: Aaron Wed, 17 Dec 2008 04:13:38 +0000 You, as well as many others on the previous post, are missing the point. It's not about how "useful" or "practical" this exercise is. It's merely a test for speed. Compressing and decompressing already compressed data shouldn't take three years and a day to complete. Same with compressing and decompressing zeros. We're after sheer speed here. That is the entire point of the exercise.

By: Aaron Wed, 17 Dec 2008 04:12:06 +0000 You're taking my use of the word "random" a bit too literally. However, you are correct in your dissertation on compressing random data. My use of the word is merely some random files of my hard drive- not that the bits are literally randomized.

By: dudus Wed, 17 Dec 2008 04:00:27 +0000 I'm not a compression expert, but I know that 7zip is pretty effective when compressing many similar files. You should add it to your benchmarks in the next round.

Also instead of compressing random data try to compress something more usefull like your /usr folder

By: Meneer R Wed, 17 Dec 2008 02:28:30 +0000 Testing compression algorithms with random data is not just silly; it is downright offensive.

The theory of information tells us, that if algorithm X makes a specific subset of all possible data D smaller, than there must also be a subset which would increase in size.

To put it a little more obvious: the core foundation of a good compression algorithm, is exactly the same as a good learning algorithm. It needs to have a BIAS.

The bias is the sort of data it will make smaller, at the expense of data that does not fit that bias. Data that would grow because it contradicts the bias.

So, why is random data such a BAD choice? Because, the perfect algorithm would make all my files really small. And the only sort of data I do have no use for would be the RANDOM data.

Why do I say that? Because I can't do anything with that data. I can't in any way interpret it. If there was something valuable to be interpreted, it wouldn't be random.

Worse: we have no real random data to use. All our so called random data is either generated with the intent of being random or sampled out of something we consider to be random.

Your OS uses an algorithm to create random data. The perfect compression of that particular data is the actual 4 lines of C code that generate the data and the initial seed. We call it random not because it is random, but because we would need an extremely large sample size to find the correlation that would lead to the original mathematical function.

Take static on a TV. I can have you watch a picture of static for an hour. The same exact still image. Then I would take that image away and show you three other images. You wouldn't be able to pick the one you stared at for an hour. Our brains too can't compress that data.

So what we experience as random data, is data with a structure that is computationally complex. That is: it is nearly impossible to correlate the data.

The perfect compression algorithm will GROW random data, because it random data should contradict the fine-tuned bias for usefull stuff.

Yet, somehow, when somebody measures a compression algorithm using random data, they think it's a good thing it can compress that shit.

To summerize:
- 'random data' is subjective to what our brains or machienes find hard to data-mine.
- 'compression' is about having a bias for one set of data, at the expensive of another set of data
- the perfect compression algorithm will grow random data

By: seb Wed, 17 Dec 2008 02:20:07 +0000 thanks for the post!,
will you address the archive corruption as well?

another idea is now to rank the algorithms by time AND compression ratio for both compression and decompression.

By: Michael "Agree" H Wed, 17 Dec 2008 02:07:35 +0000 Paul: Agree on that. Slax uses LZMA for it's modules, which makes sense based on these benchmarks.

By: Paul Wed, 17 Dec 2008 01:53:34 +0000 Sounds like lzma would be a win for things that are compressed once and then distributed, mirrored and/or decompressed a lot (such as deb packages and perhaps more so for source packages given it's bigger advantage in text compression).

Is there some explanation somewhere of LZMA's "multiple file" advantage?
At face value it seems to me that in most cases on Linux the compression would be done on a single tar file. Is it that LZMA's bigger dictionary allows more patterns to be discovered throughout the tar-aggregated files?