Sunday, July 11, 2010

Why you can't have infinite data compression.

I was leaving a comment on this blog post and it got kind of long so I thought I'd make it a post (and got a whole lot longer!).

Even if you can squeeze a quart into a pint pot...

Compression methods make lots of files smaller. Zip can shrink text because they have lots of redundancy. In fact most things created by humans have lots of redundancy in them. Repeating something in a different way it helps us grasp understand more easily. Redundancy also defends against small errors. So if, in my first sentence, I wrote "compresion", with just one "s" you would still know that I probably meant "compression". Maybe I meant "comprehension" but from the title and the rest of the sentence you could rule that out.

It seems like maybe, if you were clever enough, you could somehow compress a file, then compress the compressed file and then do that again and again and end using lots less disk space (at the expense of doing lots of compression and decompression all the time).

... you can't squeeze 511 quarts into 255 pint pots.

To see why this cannot be done, no matter how clever you are, imagine you had a compressor that always made things smaller. Let's consider all possible strings with 8 or fewer bits. There are 511 (2^9 - 1) of them. If they all get smaller then they must all turn into strings of 7 or fewer bits. There are 255 (2^8 - 1) of those.

Now take all of these 7 bit strings and uncompress them. You have 255 inputs so you get 255 outputs, but that's not enough. You started off by compressing 511 things and you only got half of them back. Some of your input strings must have compressed down to the same output string. That's no good, you need to always get the original back. A compressor that doesn't give the original back is called "lossy" and while that can be great for pictures and movies where the human eye discards lots of the fine detail anyway, it's not the kind of compression that you need for legal documents or computer software.

What if I only compress the ones that get smaller?

It seems like you can still do better by only compressing the ones that get smaller and leaving the others uncompressed. The problem here is that while the strings get smaller on average, you now have to remember which ones you compressed and which you didn't. So now you have to add an extra bit to signify whether it was compressed. There are 256 exactly-8-bit strings and 255 7-or-fewer-bit strings which means at least one of the 8 bit is left as an 8-bit string. Now add the "was I compressed" bit and you get a 9-bit string. It got bigger! Actually it's a little bit more subtle than this because you might say that for an 8-bit, uncompressed string you don't need the extra bit, it obviously wasn't compressed but you must remember that some of the 7-bit strings weren't compressed either and so became 8-bit strings. You have to be able to tell them apart from an 8-bit string that wasn't compressed. Either way, you have 511 inputs and you don't want to mix any up so you need 511 outputs and in the best case, that requires strings up to 8 bits long.

Working at it from the ground up.

Another way to come at this is by thinking about what such a miracle compressor would do to short strings. First, consider the 1-bit strings. There are 2 of them, "0" and "1". Since this compressor must make everything shorter it's stuck already. So let's make an exception for 1-bit strings and say that they always compress to 1-bit strings. So what about 2-bit strings? Well, if it turned any of those into 1-bit strings we'd have a problem because the 1-bit outputs are already taken (for the 1-bit strings). So let's make an exception for the 2-bit strings as well and let them compress to 2-bit outputs. What about the 3-bit strings? Well, all the 2-bit outputs are already taken, so we'll have to make an exception for the 3-bit strings too... Eventually we'll make an exception for every input, no exceptions!

Back in the real world.

There appears to be a loop-hole in real-world computing. It seems you can always get a win-win solution by only compressing the files that compress well and applying the right method. So Zip your text files, PNG your images and don't compress anything that doesn't compress well. You always save space or break even. This seems to contradict what I wrote above. That's because it ignores the fact that filenames take up space. It only works because your computer has already allocated some number of bytes for the name you're going to give the file.. So even if you give it a really short name, the extra space is used anyway. Also, if you already had a really long name, adding .zip on the end may not be possible as it would exceed the size-limit on filenames. Some filesystems don't have a limit and only use as much space as the length of the name but then adding .zip takes up 4 more bytes.

Just being silly now.

If you use a filesystem that does not have a limit on the length of a file name then the following compression method "works". Convert the file you want to compress to a number (that's essentially what your computer does when you save a file anyway - the numbers are big, this blog post has about 6,000 letters in it so would be saved as a number with about 14,000 digits). Now compress this number by subtracting 1 from it and sticking a .z on the end of the file name. Now do that again and again and again. Eventually the number will be 0 and the file data will take no space on disk. When you want to read your file you just keep uncompressing it until all the .zs are gone from the file name. Great, now if only you could find a way to compress the filenames you'd be set...

1 comment:

Troy said...

Maybe he was suggesting guessing? Or some kind of ridiculous procedural treeing with a good final hash you start with?

Doable on military grade equipment purpose built for the computational and memory requirements (such as on nuclear submarines for ULF RX), but it hardly counts as compression, being bounded branching and brute forcing...

Anyway, thought the comment might be enjoyed by anyone considering these ideas. Hashing is not compression.