Remember a while back, I wrote about a crackpot who pestered me both about
converting to Christianity, and his wonderful, miraculous compression system? He
claimed to be able to repeatedly compress any file, making it smaller each time.
Well, he’s back pestering me again. Repeatedly asking him to leave me alone,
shouting at him, etc., hasn’t worked. (His current claim is that he doesn’t know how
to delete me from his gmail contacts list.) So I’m resorting to another round of
public humiliation, which I hope will be informative and entertaining as well.
To remind you of the relevant background about compression: what compression does
is reduce the size of a string by eliminating redundancy in that string. A trivial example
of that is that most text strings use a very limited range of characters – a string
made up of pure ASCII text will typically need less that 7 bits per character – but the
common representation uses a minimum of 8 bits per character. The high bit is always zero. So
in a string like that, without considering anything else, 1/8th of the size of the document
is completely redundant.
Looked at in terms of information theory, for a given system, any string contains a particular
amount of information. That quantity of information is typically much smaller than any
common representation – the representation is full of padding and redundancy. So what compression
does is reduce the size of an input string to something closer to the minimum length required by
the information it contains.
But most strings are not compressible at all – which is really
easy to prove. A compression function must be invertible – that is, a given
function f can only be a compression function if there is a function f-1, such
that f-1(f(n)) = n.
Think of the strings of N bits. Suppose we want to compress them to one-half
of their original length. How many of those strings can be compressed that small?
There are 2n/2 strings of length n/2; and there are 2n uncompressed
strings. So, at best, 1/2n/2 strings are compressible to
a string of length n/2 – any more, and the compression function isn’t invertible – you can’t
uncompress!
Even a trivial compressor – one that removes exactly one bit from the input string – can’t compress half of its inputs. If it could, it wouldn’t be invertible. And
even that analysis has a problem: it assumes that all inputs have length N. If you were looking
at strings whose length is less than or equal to n, then it gets even worse.
The problem is very simple to see – the number of compressed strings is smaller than the number of input strings – but to uncompress, you need to know exactly which input string produced a compressed string.
My persistently dimwitted correspondent is apparently incapable of understanding this. In
his latest missive, he sent me a list of documents, which are the result of repeatedly
running bzip2 on an input document. In between runs of bzip2, he runs his own magic program,
which takes the uncompressible output from bzip2, and renders it compressible. According to him,
using his system, he can compress any input document down to about 12K. As a
“proof” of this, he provides a script that runs his program, and shows the file sizes. This
is pretty long, but it’s amusing. What he’s doing is iteratively running bzip2, then running his
magic program, and then running bzip2 again.
7160799 -rw-r--r-- 1 user wheel 415241 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 415241 Apr 14 10:24 in 7160797 -rw-r--r-- 1 user wheel 375688 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 340370 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 307278 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 277451 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 251925 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 228574 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 208202 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 189791 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 172300 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 155808 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 140951 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 128431 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 116992 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 107075 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 97124 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 89283 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 82238 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 74710 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 68219 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 63097 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 58134 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 53304 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 48433 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 45429 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 41524 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 38161 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 35385 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 33316 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 30773 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 29004 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 26714 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 25371 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 23605 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 22728 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 21723 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 20478 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 20092 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 19682 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 19247 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 18686 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 18102 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 17418 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 16543 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 15252 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 15123 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 14958 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 14779 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 14558 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 14313 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 14044 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 13705 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 13283 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12838 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12242 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12469 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11729 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11882 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12086 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12295 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11394 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11543 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11729 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11910 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12105 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12334 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11441 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11565 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11772 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12041 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12235 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12442 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11580 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11754 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11918 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12120 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12369 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11564 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11718 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11883 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12047 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12250 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12481 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11672 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11879 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12093 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12300 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11361 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11510 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11710 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11880 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12109 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12371 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11584 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11790 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12017 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12219 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12431 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11619 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11777 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11977 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12175 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12400 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11576 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11781 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12014 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12203 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12461 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11675 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11854 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12082 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12341 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11489 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11641 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11850 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12047 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12277 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12495 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11692 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11904 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12114 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12296 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11348 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11447 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11564 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11736 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11950 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12142 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12379 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11523 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11642 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11849 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12026 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12249 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12472 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11710 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11891 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12126 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12341 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11403 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11563 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11765 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12004 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12225 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12435 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11563 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11774 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11926 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12152 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12395 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11541 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11744 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11920 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12135 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12320 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11493 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11624 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11788 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11988 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12216 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12461 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11620 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11838 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12017 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12232 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12468 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11759 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11946 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12180 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12385 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11498 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11661 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11869 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12077 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12273 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12515 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11752 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11972 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12146 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12343 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11474 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11593 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11792 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12024 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12223 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12486 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11745 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11932 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12121 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12318 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11396 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11573 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11716 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11949 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12175 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12384 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11565 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11738 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11914 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12123 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12307 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11370 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11521 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11706 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11923 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12112 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12369 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11539 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11672 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11883 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12111 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12383 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11552 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11740 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11893 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12124 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12349 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11547 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11752 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11989 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12233 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12452 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11637 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11830 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12010 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12225 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12469 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11731 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 11912 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 12120 Apr 14 10:24 in 7160799 -rw-r--r-- 1 user wheel 12355 Apr 14 10:24 in 7160798 -rw-r--r-- 1 user wheel 11500 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11664 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 11876 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 12066 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 12297 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11397 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 11558 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11750 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 11935 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 12151 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 12367 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11495 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 11676 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11879 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 12113 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 12341 Apr 14 10:25 in 7160798 -rw-r--r-- 1 user wheel 11458 Apr 14 10:25 in 7160799 -rw-r--r-- 1 user wheel 11687 Apr 14 10:25 in
For any arbitrary input file, he can do the same process – so that for any
possible input, his process will, eventually, converge on a file size around 12K. Sounds
absolutely amazing, doesn’t it?
But you know what? I can do better. I have here a script which you can
run on any input, and which will render it recompressible – and it works until the file
reaches around 40 bytes. And it usually does it in very few steps. For example,
I took my local copy of plan9 from userspace – source code and binaries – and
tarred it up. The resulting file contains 149,852,160 bytes. Running gzip on it
shrinks it to 52,824,151. Pretty good compression – around 60%. Interestingly,
rerunning gzip once does actually compress it a tad more – reducing it to 52,563,667 bytes. Doing it again increases the size by a small amount – repeating just cycles aronud 52,600,000 bytes.
Now, I’ll run my magic program on it. The output of my magic program is exactly
the same as the length of its input – only now it’s compressible! So I’ll run gzip on
it again. The result is 29,918,882. Wow!
Now I’ll repeat that! 17013250! Again: 9674608. Again: 5502157. And again, and again, and again, until I get down to 43 bytes.
What’s the catch? It should be obvious – it’s exactly what I said when I talked about
the proof. I can’t uncompress. I can repeatedly remove information from the
file, making it possible to compress it more – but I can’t put that information back,
because it’s not in the compressed file.
And that’s exactly what the jerk did: he can compress, and compress, and compress to his hearts content. But he can’t uncompress.
For folks who are interested, here’s my magic recompressibleizer program:
#include int main(int argc, char** argv) { for (int c = getchar(); c != -1; c = getchar()) {} c = c & 170; putchar(c); } }
What does it do? It changes every other bit in the input file to 0. That removes a heck
of a lot of information – now every byte of the file contains 4 meaningful bits instead
of 8. So gzip can then compress it. And then I can run my program on it again. And again. And
again. Until it reduces the file to a basic minimum space overhead needed by
gzip.
Of course, that “recompressibleizer” script is silly. But the basic idea of it – reducing
the amount of information contained in a compressed string string in an irreversible way –
is what any program that produces iterative compressibility must do. You can’t
ever get around the basic facts of compressibility: most strings aren’t compressible. They
can’t be.
The simple fact, from the proof up towards the top of this post, is that the set of
strings of length N is larger than the set of strings of length less than N. You can’t
get around that.
Any idiot can produce a script which renders uncompressible documents compressible – if they don’t have to be able to uncompress them afterwards. If you want to be able to
uncompress them, then you’ve got a problem: you can’t reinsert the information you
extracted without knowing what it is. The only way of making this scheme work
at all is by adding information to the decompression program – making it larger. As I
explained last time, you can do that, and it can even have a very nice payoff
in the size of the compressed document. For example, using a pre-populated dictionary
of common n-grams in english text can have a great effect on the compressed size of english
documents. But your compressor and decompressor need to be considerably larger to store that dictionary, and they’ll perform worse than a standard LZW compressor on non-english text inputs. And it only works once – once you’ve used that trick to strip out the english-language pattern
data, using that trick again won’t work – the patterns that it takes advantage of
are gone.
So, Mark. Since you’re on the inside, give us the scoop on Google’s plans to speed up Python.
😉
Seems to me, then, is that this “Mr Wheel” needs to demonstrate both the compression and decompression of a document. If he can’t, he should apologize to you and leave you alone.
Re #1:
If I actually knew anything about it, I wouldn’t be able to tell it to you. But since I actually don’t know anything about it, I can safely say everything I know about it – which is absolutely nothing 🙂
Seriously, one of the things that makes Google such a fun place to work is that things are very bottom up, and structured around small, fairly autonomous teams. Since I’m currently working in data analysis for logs, I’m very far removed from any team that might be working on Python.
Re #2:
Wheel isn’t his name, it’s the name of one of the standard administrative groups under unix. My guess is that he’s running MacOS, where the name of the default group for
adminstrative users is “wheel”. (Most Unix variants have a “wheel” group, but it’s rarely a default, which I why I suppose MacOS.
Interesting.
So he’s smart enough to invent some world-shattering compression algorithm, but not smart enough to remove you from his google contact list?
Idiot savant?
Re #4: yeah, I figured as much, but I didn’t have anything else to call him, so…
By the way, I expect that he won’t give a demonstration. To prevent cheating, he has to allow you to perform either the compression or the decompression part yourself (or both), and for you to choose the file to compress. I predict that he won’t allow that, because he’ll claim that the risk is too great that you’ll steal his code (never mind the existence of NDAs).
Still, he would be wise to perform the experiment for himself, to prove to himself if his idea works or not.
Great fun! When I was eight I came up with a very simple scheme to divide any two numbers. I thought I was gonna be famous for that. The only problem, though, is that it never ever produced a single right result… I’m afraid this poor chap is somehow in the same position I was 30 years ago 🙂
Reminds me of my favorite compression algorithm:
Look at an arbitrary bitstring. The zeros are empty, and of course, contain no information, so you can remove them. So, now you have a string of just ones. (dividing the length in about half)
However, the only information in this string is the length, so calculate the length and represent it in binary. (now the string is ln the previous length)
But now we have zeros again, so we can iterate.
After just a few iterations, we’ll end up with just one bit: 1. Can’t beat that!
My favorite compression scheme was from a spoof post to one of the compression usenet groups about a decade ago (I can’t find it now or else I would link to it – it was brilliant).
It used a long sequence of operations to reduce the size of any sequence of bits. For example: discard the lowest 3 bits – this will make it impossible to decompress 7 out of 8 files but this is OK; 7 out of 8 people will use this program to store porn of some kind and will not complain if it doesn’t work.
Once you got down to a single bit, the operation was simple. If the bit is 1 then store a zero byte file, otherwise don’t store a zero byte file.
Compressing data down to zero bytes has many advantages – you could download HD movies over the internet without even being connected internet. The author pointed out that it didn’t correctly compress random data – but nobody wants to compress random data anyway.
Re: #1
You don’t need to be an insider to know about the Python project. It is an open source project, with one of the goals being to push as much of the work back out to the standard Python distribution as possible.
There’s more info here: http://code.google.com/p/unladen-swallow/wiki/ProjectPlan
This guy should take up James Randi’s million dollar challenge.
Kristian, then after compression, he’d end up with $1! w00t
-Rusty
It would be easier to just explain that compression algorithms never really compress anything. The word compression is the problem.
To be really simple just say it’s just about pointing from a big number (big file) to a smaller number (small file). The algorithm is just the decision over which numbers we like and which we don’t like, so that while the numbers we like can most often be given a smaller number to point to (like people in a crowd) the numbers we don’t like must (give room and move to) point even bigger numbers.
0 Don’t like -> 2
1 Don’t like -> 3
2 Like -> 0
3 Like -> 1
Uuuu! Compression! 0 is smaller than 2! Uuuu!
One might peruse from any infinite compression artist what is the number of bits after which The Miracle of Compression Emergence actually happens.
Universal Shrinkulator: Make a B&W graphic of the datafield, then png format. Compression! Draw randomly oriented lines (need only pi rotation – already down by 50%!) of random widths using unremarkable BASIC RND seeds. Any pixels under the line reverse color. Proceed until the entire field is an alternating checkerboard of black and white pixels. *Any* length file can be reversibly compressed to: Two seeds, number of cycles, 0 or 1 (first pixel color), field length and width.
Those sum to waaaay smaller than 12K. Optimizing the pixel recoloring algoritm is left as an exercise for the alert reader.
No they don’t. This is just an obfuscated version of the old “run a pseudorandom number generator until the sequence is replicated and output the position in the PRNG’s output.” It’s not hard to show that, even given a PRNG with a long enough period, the resulting count likely as not contains as many or more bits than the input does.
Re #15:
I’ll answer you the same way that I always answer the Cantor critics who claim to have an enumeration of the real numbers:
How you get around the proof?
If you have 2N strings of length less than N,
and you want to compress it by just one bit, you’re mapping
2N uncompressed values onto 2N-1
compressed values. So every compressed value has on average two uncompressed strings mapped to it. How do you distinguish them?
Brian and Andrew, those were both funny!
I can imagine ‘Mr. Wheel’ (let’s call him so for lack of other information) – believing himself to be a ‘misunderstood genius’, struggling to get the attention of people like you so that his excellent idea can come out and he can get the fame (and perhaps money) he
‘deserves’! Somewhat pathetic in a way…
Moral of the story: More people need to learn and keep in mind the Pigeonhole principle.
Two notes to the author:
1. The (or whatever library) has been stripped out in the program
2. In the second half of the article using the word ‘uncompress’ in two meanings is slightly confusing: “Any idiot can produce a script which renders uncompressible documents compressible – if they don’t have to be able to uncompress them afterwards”. The second ‘uncompress’ can better be called ‘decompress’ as you’ve done elsewhere in the article.
3. The Preview function fails silently if Javascript is not enabled – it gives some crypting error like ‘The requested document was not found’. I guessed Javascript was the problem only by looking at the URL which had something like ‘javascript-is-required’. I don’t know if this is a problem with ScienceBlogs and if so if you can do anything about it, just informing.
Err… The two notes became three after I tried the previewing functionality. Sorry for the silly mistake. 🙂
Mark, in case you’re unaware, Uncle Al is also a crank. A very imaginative crank, but a crank nonetheless. If you were to take his comments at face value, he has solved just about all of the world’s problems. His ability to engineer ad ignorantiam is unparalleled.
Let me see. If you can’t get the data back out of the file, then it’s pretty much worthless. So you can simply throw it away.
Tada. Compression down to 0 bytes.
Well,I dont know if this is the same person I met, say 15 years back (back than he was around 20 years old) where I first heard this ridicules solution.
This individual was speaking about this staff (hologram compression), obviously he was clueless.
That’s the not-so-fine line between compression and hashing; retaining information and losing it; reversible and irreversible.
Someone once wrote a program that would compress and decompress any file. It indeed managed to make every file smaller and to decompress it correctly afterwards. The trick was that it saved the extra data in the alternative data streams using a relatively unknown feature of the Windows file system, NTFS. Of course, what it did is that it used a design shortcoming of the file system to hide the data where it’s not readily visible (for example, the apparent size of the file with a megabyte-large alternative data stream can be just a few bytes). So the super-compressor “worked” in a typical test you would do (compress a file and then decompress it), but, of course, it’s impossible to decompress the file after it’s moved to another file system or transferred over the network. Also, while such ”compression” reduces the apparent size of the file, it doesn’t increase the free space on your disk. But it’s a clever hack, isn’t it? Took some time to figure it out.