Here’s a neat explanation of why there can never be a single algorithm that can compress all files, even if only by one bit each.
- Take the set of all files of 10 bits length. There are 1024 files. Call this Set 1024.
- Now take the set of all files of 9 bits length. It is one half the size of the previous set, with 512 files. Call this Set 512.
- Let’s take a compression algorithm and compress all the files in Set 1024. Each output file must be found in Set 512, as it is by definition compressed and of 9 bits or less.
- But note, that’s a mapping of a set of 1024 files to a set of 512 files. There must be at least one output file in Set 512 which is mapped from two or more input files in Set 1024. Call this File A. So, when we decompress File A from Set 512, which of the files in Set 1024 that it is mapped from should it return?
Edit:
There’s a wrinkle here that I didn’t appreciate when I first wrote this. I should have used the word string rather than file. The problem with using files is that in many cases the same compressed output can be stored differently, by using different length files. E.g. the output 000100101 could be stored as a 9 bit file, or 8 bit file, or 7 or 6 bit file! So the output set, when using files, has a total space of 2 + 4 + 8 + 16 + 32 + 64 + 128 + 256 + 512 = 1022, rather than 512. However, the basic explanation still holds, because 1022 < 1024, so some output files must map back to more than one input file.
January 9, 2010 at 1:19 pm
Nice explanation!
It might help to add that therefore for some input file, the output would be equivalent or larger to the input file…
January 9, 2010 at 6:43 pm
I’m trying to keep the explanation as short as possible – but maybe it could do with a few extra bytes
January 9, 2010 at 2:43 pm
Are you assuming the algorithm is not adaptive?
I think of compression as a “family” of algorithms depending on the data to be compressed.
For example, if I knew some of the 10bit words were more likely to occur (like the words “the” in English or the letter “e”), then my algorithm will pick a short encoding for that word.
Words like “anticonstitutional” that don’t occur that often will be mapped to longer words.
The benefits of compression only become apparent when you have large BLOCKS of text.
January 9, 2010 at 6:38 pm
Hi Ivan
no matter how adaptive or clever the algorithm is, it won’t be able to compress everything. Read my explanation another time, or try Matt’s article from the comments.
January 9, 2010 at 3:10 pm
Nice, a succinct explanation!
I wrote up a longer version for my students some years ago:
http://matt.might.net/articles/why-infinite-or-guaranteed-file-compression-is-impossible/
It uses a proof via the pigeonhole principle to show why it’s impossible in general.
January 9, 2010 at 6:36 pm
Hi Matt
your article brings to mind a PCW article (a uk computing magazine) around 20 years ago, where the journalist reported an amazing new compression algorithm that could keep compressing just about anything. It was in the same period that they reported on a new software product called The Last One (TLO) which was for developing business apps. The claim was that you would never need another programming tool again. Those were the days!
January 9, 2010 at 3:53 pm
Initially this sounded cool. But… is it possible to write a compression algorithm that does this? ie – takes any 10 bit string and compresses it to 9 bits exactly? I think your basic assumption must be flawed.
January 9, 2010 at 6:39 pm
I’m not sure I understand your point. My explanation aims to show that it is impossible to code an algorithm that will compress (and decompress correctly!) e.g. all 10 bit files.
January 9, 2010 at 4:24 pm
[...] get publicly humiliated (not until this post, but who is reading this anyway ). There are various proofs for why perfect compression cannot be [...]
January 9, 2010 at 6:55 pm
You can always compress a non-empty string: simply lop off the last bit. What you really mean is that no such *lossless* compression algorithm exists.
January 9, 2010 at 7:27 pm
Your logic is flawed. You claim to prove that “there can never be a single algorithm that can compress all files, even if only by one bit each”, but all you do prove is that there cannot be an algorithm that compresses all 10 bit files by exactly one bit. It is easy to extend your proof to one that proofs that there cannot be an algorithm that compresses all N bit files to N-1 bits, but that does not imply that there cannot be an algorithm that compresses all N bit files to fewer than N bits.
The proof of that requires one to notice that (including the zero-length file) there are 2^N-1 files of length up to N-1 bits, but 2^N of length N.
January 9, 2010 at 7:33 pm
When I first wrote this I used the word ‘proof’, but soon changed it to ‘explanation’.
Having said that, I agree with you.
January 9, 2010 at 10:58 pm
There are more than 2^n-1 files of length /up to/ n-1, aren’t there? The 1-bit file 1 is different than the 2-bit file 01; 2^n-1 is the number of files of length n-1.
January 9, 2010 at 11:03 pm
I should have used binary numbers or strings rather than files in the explanation, to avoid this issue!
January 10, 2010 at 4:19 am
magine that there is a algorithm=turing machine
that always decreases the length of its input
now apply that machine n+1 times to the input of the length n and imagine what will happen
January 10, 2010 at 8:00 am
So image you have 100 different files. You keep compressing each of them until they are only one bit each. You now have 100 files, each of which can have one of two values: 0, or 1. Now, decompress the two 1-bit output files.You have two output files. What happened to the other 98?
Basically, there is no algorithm that can compress all files, even if only by one bit.
January 10, 2010 at 9:44 am
i agree with you – it just seems that my argument is shorter and does not appeal to the determinacy of the decompression
May 26, 2010 at 7:22 pm
Except for, in the reality only a tiny part of your set1024 will have a meaning. Therefore, It must be possible to reencode the meaningful part of your set1024 into a set512.