As lots of you have heard, William Dembski and Robert Marks just had a
paper published in an IEEE journal. In the last couple of days, I’ve received about 30
copies of the paper in my email with requests to analyze it.
My biggest criticism of the paper is how utterly dull it is. It’s obvious
how they got it published – they removed anything that’s really interesting from it. It’s
a rehash of the stuff they’ve written before, stripped of any content that directly hints
at the anti-evolution part of their claims – which leaves a not-particularly-interesting
paper on search algorithms.
I’m not going to rehash my primary criticisms of Dembski’s approach here – I’ve done it lots of times before, most recently in this post, which critiques a very closely related paper by D&M. In fact, this paper
is really just an edited version of the one I critiqued in that post: it’s that paper with all of the intelligent-design speak removed.
What this paper does is take the basic idea of the NFL theorems, and builds on it in a
way that allows them to quantify success. If you recall, NFL says that averaged over
all possible search landscapes, the performance of all search algorithms are
equivalent – and therefore, no search algorithm is generally better than pure random
search. But searches obviously work – we use search algorithms every day, and they
clearly work a whole lot better than random guessing.
Search algorithms work because they’re not intended to succeed in all
possible landscapes – they’re designed to work in specific kinds of landscapes.
A successful search algorithm is designed to operate in a
landscape with a particular kind of structure. The performance of a particular search is
tied to how well suited the search algorithm is to the landscape it’s searching. A hill-climbing search-algorithm will work well in a continuous landscape with well-defined hills. A partitioning landscape will work well in a landscape that can be evenly partitioned in a well-defined way.
So how can you quantify the part of a search algorithm that allows it to
perform well in a particular kind of landscape? In terms of information theory, you can
look at a search algorithm and how it’s shaped for its search space, and describe how
much information is contained in the search algorithm about the structure of the
space it’s going to search.
What D&M do in this paper is work out a formalism for doing that – for quantifying the amount of information encoded in a search algorithm, and then show how it applies
to a series of different kinds of search algorithms.
None of this is surprising. Anyone who’s studied Kolmogorov-Chaitin
information theory will respond to this with “Yeah, so?”. After all, K-C theory describes
the information content of stuff in terms of the length of programs – the search programs
clearly contain information – they’re programs. And they clearly only search certain spaces, so some portion of the information in the program is an algorithm that
exploits the structure of the search space. This much is obvious. Being able to quantify
how much information is encoded in the search? If you could do it well,
it would be moderately interesting.
It’s a nice idea. So what’s wrong with the paper?
Personally, I come from a background where I look at these things in K-C terms. And
one of the major results in K-C theory is that you can’t really quantify
information very well. How much information is in a particular search? Damned hard to say
in a meaningful way. But D&M don’t bother to address that. They just use a very naive
quantification – the same logorithmic one that Dembski always uses. I dislike it
intensely, because it essentially asserts that any two strings with equal
numbers of bits have the same amount of information – which is just wrong.
(“00000000” and “01001101” have different amounts of information, according to information theory.) The reason that Dembski can quantify things specifically is really because
he’s handwaved that important distinction away – and so he can assert that a particular
search encodes a certain number of bits of information. What he actually means is
that a given search algorithm can be said to contain something equivalent to a string whose length is N bits – but he doesn’t say anything about what the actual true
information content of that string is.
Second, he rehashes some old, sloppy examples. As has been discussed numerous times,
Dembski likes to harp on the “Methinks it is like a weasel” example used by Dawkins. But
he misrepresents the example. In Dawkins’ presentation of that example, he used a program
that searched for a string based on its closeness to the target string. It did
not lock positions in the string. But Demsbki has always presented it as if it
locks. (In a locking version of this, when a character position is identified as correct,
it’s removed from the set of possible mutations – it’s locked, so that it will never
change from correct to incorrect. In Dawkins’ real example, a correct character position
would sometimes change to incorrect – if a given mutation round, one of the candidate
strings added two new correct character positions, but changed one formerly correct one to
something incorrect, that candidate would survive.) In this paper, one of the examples
that Dembski uses is the Weasel with locking. (And he does it without any reference to
Dawkins.) He uses it as an example of partitioned search. He’s absolutely correct that the
locking version of that is partitioned search. But frankly, it’s a lousy example
of partitioned search. The classic partitioned search algorithm is binary search. For more
complicated examples, people generally use a graph search. So why did he use the locking
weasel? I’m attributing motive here without proof, but based on his track record, he wants
to continue harping on the weasel, pretending that Dawkins’ example used locking, and he
wants to be able to say that his use of it was approved of by a set of peer reviewers.
If he claims that in the future, it will be a lie.
As for intelligent design? There’s really nothing in this paper about it. I’m Dembski will run around bragging about how he got an ID paper published in a peer-reviewed journal. But arguing that this is an ID paper is really dishonest, because all ID-related content was stripped out. In other places, Dembski has used these quantification-based
arguments to claim that evolution can’t possible work. But this paper contains none of
that. It’s just a fairly drab paper about how to quantify the amount of information
in a search algorithm.
A final couple of snarky notes:
- This paper is published in “IEEE Transactions on Systems, Man, and
Cybernetics, Part A: Systems and Humans”. This is not exactly a top-tier journal.
It’s a journal of speculative papers. Out of curiosity, I looked through
a handful of them. I’m not super impressed by any of the papers I looked
at. - As usual, Dembski is remarkably pompous. At the end of each paper in the
journal is a short biography of the author. Dembski’s is more than 3 times
longer than the bios of others of any of the other papers I looked at. He just
can’t resist going on about how brilliant he is. - On a related topic, at the beginning of the paper, in the author identification,
Dembski is identified as a “Senior Member” of the IEEE. What’s a senior member
of the IEEE? Someone who wrote the IEEE a check to make them a senior member. It’s
a membership level based on dues – student member, standard member, senior member.
Marks is an IEEE fellow – that’s a title that you need to earn based on your
work.
Saw this on Pharyngula, glad to see you’ve looked at it.
Just a quick question about your information theory example: What class of strings is it possible to compare the information content of strings? You say “one really can’t quantify information very well”, but then compare the information content of “00000000” and “01001101”. Are certain strings quantifiable? If not, do certain strings have a partial ordering on information content?
I assume this is similar to your comment about “most numbers are unobservable” (ie, most integers, reals, etc cannot be represented finitely).
Looking forward to the book!
Not what WmAD claims, over at UD:
Impressive, no?
More seriously, I look on this as wMad using “information” in his usual sloppy way. “Active information” is a measure of how much better a search does on the landscape it’s defined on, as compared to the average over all landscapes. I guess this might be useful, as a summary of the effectiveness of the search, but it would have to be packaged better if it wasn’t to be just ignored.
Is “partitioning landscape” in the 5th paragraph supposed to read “partitioning algorithm”?
Dembski sez:
Truly, he is the Isaac Newton of Information Theory (TM), for his knowledge of compression algorithms is so great that he can pack 16 times more FAIL into a given sentence than the average mere mortal! Let’s see:
1. Critiquing a toy program cooked up to illustrate one aspect of evolution — the difference between mutation and mutation with selection — is a long way from critiquing our understanding of evolutionary processes.
2. A critique of A is not automatically an argument in support of B. What about possibilities C through Z?
3. “[I]nformation is not created by unguided evolutionary processes”. . . oh, mighty Zarquon. We know that mutations can increase information content, by any reasonable quantitative definition of “information”. Start with a population of N identical genes, all of them composed solely of A, and let random point mutations change A to C, G or T for several generations. The probability distribution over the space of possible sequences will have spread out, so the Shannon entropy of that distribution will have gone up; the nucleotide sequences in the final generation will take longer programs than “100 A’s in a row” to describe, so the K-C information content will be larger. This is without considering selection at all, or even other ways of altering genes besides point mutations (like gene duplication, which ups information content in an obvious way).
Re. Sharkey (No. 1)
It is similar to the “most numbers are uncomputable” thing. Specifically, you can show that there is no algorithm that compute the Kolmogorov complexity of strings in general (Details here: http://en.wikipedia.org/wiki/Kolmogorov_complexity#Incomputability_of_Kolmogorov_complexity The gist of it is similar to the Berry paradox: “Let n be the smallest positive integer that cannot be defined in fewer than twenty English words.”)
That doesn’t mean you can’t make statements about the Kolmogorov complexity of strings. The easiest is to put an upper bound on it: since the complexity of a string is defined as the length of the shortest algorithm that generates it, the complexity can’t be any greater than any program that outputs the string that you happen to have (it might, however, be less).
In some cases, you might be able to prove exactly how much information is in a given string—but it would require examining all possible programs that could output the given string up to a certain program size. That’s not something you could do in any but the most special situations.
Comparing the complexity of “00000000” and “01001101” is just a fast “reminder” example, because strings that are just a repeat of a single character have extremely efficient representations (i.e. “print ‘0’ n times”), while strings made by tossing a coin should tend towards infinite complexity as the string becomes infinitely long (for properly defined notions of “should”).
“You say “one really can’t quantify information very well”, but then compare the information content of “00000000” and “01001101”. Are certain strings quantifiable?”
IIRC, the Kolmogorov-Chaitin definition of information in a string is something like “the smallest program required to produce that string”. The program to produce a repeating series of zeros is smaller than the program to produce a specific pattern of zeros and ones. So by the K-C definition the former contains less information.
This is quite different from the Shannon type information theory which deals with the uncertainty in signal transmission. Not sure if these too approaches are related or not.
From the conclusion:
From what I understand, (1) is just about impossible without going through the complete search space, as well as knowing in advance what you’re looking for. (2) might be even harder. Plenty more of these kind of statements trying to dismiss or smear the field of evolutionary computation.
It also has completely random things like throwing out big numbers:
bits -> grams?
Dembski and Marks introduce “partitioned search” with a citation at the end of the sentence, reference number 12. Number 12 is Dawkins’ “The Blind Watchmaker”.
“”00000000” and “01001101” have different amounts of information, according to information theory.”
I do not have an Info Theory background, but I do have a lot of programming experience. Could someone explain this to me. Seems to me there are cases where these string do in fact represent the exact same amount of information. I am confused about examples as to how these can represent different amounts of information.
If there are only two possibilities of characters in those strings (0 and 1), and their position represents the state of something (and the representation is the same from both strings), then how can one string represent more information than the other?
Did the boundaries that I put on the meaning do that (2 states, state representation)?
Can someone show me a laymans example as to how these two strings represent different amounts of information?
p.s. i’m not a troll. I’m just curious.
Are you saying the dr dr Dembski is being dishonest?! My goodness, missus Pfefferharzen, methinks me’ll have the vapours!
You say that none of the papers impress you. That’s understandable, but just out of curiosity, were there any that impressed you less than Dumbski?
“Senior Member of the IEEE” (Ayeeeeeee!) = you get to bypass peer review just like the big boys in the AAAS (So there! We can too has be professional).
Dr dr Doctorsky is indeed the Isaac Newton of information theory – just like Isaac Newton was the Isaac Newton of alchemy and biblical studies. Personally, I love the fact that Newton was non-trinitarian – has anyone ever properly confronted dr dr Sweatersky with that?
An IEEE senior member is a little more than someone who just writes a check. It’s certainly not the most prestigious title but it is not as worthless as you imply. The IEEE guidelines for elevation to senior status are here:
http://www.ieee.org/web/membership/grade_elevation/grade_elevation.html#Senior%20Member%20Grade
(Please don’t take this as endorsement of Dembski.)
Information-wise, the 00000000 only contains a single significant digit of data. Adding ZERO’s to the LEFT of a number in any standard number-system imparts no additional information. So (in decimal) 48 is the same as 000048, or 00000000048. The zero’s to the left are assumed and intrinsic by convention.
So, if a number is 0, then it’s one digit. The number 000000 is still one digit because it’s really, genuinely and discretely just 0.
Otherwise, every value would be equal in size and information and that size or information-measure would be infinite since any value(s) could have zeros added (prefixed) to the left without changing it’s value.
TechSkeptic:
IIUC, the idea is that “00000000” can be generated by a simpler “program” (print 0 8 times) than “01001101” (print 0, print 1, print 0 2 times, print 1 2 times, print 0, print 1). By that metric, the second string contains more information than the first.
Seems like any link one might draw between this paper and evolution-as-search-algorithm will be significantly weakened by the fact that the searched landscape changes with every iteration of the search. That dynamism has to impact any analysis of the algorithm.
Is the program language defined?
printf(“%08b”,0) isn’t much different from printf(“%08b”,0x4D) after compilation.
TechSkeptic@#9
When dealing with transmitting strings, one (and not the only) way of looking at information can be informally described as how “surprised” you are by the next element in the string (related to their relative probabilities).
So, if you receive a string where all elements are identical, then you aren’t really getting any information because you *know* what the next element is going to be. The most information is passed when both 1s and 0s are equiprobable.
Mark’s approach is slightly different to this, in that he was describing the ability to compress the string – for a string like “0000000000…” or “01010101010101” this can be compressed highly efficiently, while a truly random binary string cannot be compressed. Hence, the random string contains a lot more information.
Of course, “information” in the mathematical sense doesn’t equate in any way to USEFUL information: a random string the length of Shakespeare’s complete works would contain more information than the works themselves by any metric that I know of.
Be aware that there are lots of definitions of information around, and each of them make different assumptions and are for different situations. In particular, be wary of reifying information – it’s an abstract concept that’s useful for analysing systems, but it’s not a magical thing that, for example, floats around in DNA in some metaphysical aether.
Even then, it sounds like using bogosort as your standard comparison for all sorting algorithms.
I obviously took way too long writing my comment 😉
Reading Dembsky’s paper, he seems to have a fundamentally wrong view of search.
Or maybe its me? I see two sources of “information” available to the search
program::
– the programmer adds some information (e.g. the knowledge that the space is
smooth with hills, or that there is some structure to the space, or that
certain parts of the space are less likely to contain the goal, etc.)
– the environment adds some information as the search progresses, in the form
of feedback (e.g. yes/no answers, hints, etc.)
But in dembsky’s world, the hardness of search problems are always measured
against a random-query strategy, and it doesn’t matter how much feedback the
environment provides after each trial. In dembsky’s world, playing hide and seek
with my 3 year old is just as “hard” as playing with my 6 year old.
The 3 year old game goes like this:
Me: [loundly announces] Is he in the closet? [looks in closet]
3-year-old son: [loundly announces] No, he is under the bed! [giggles]
Me: [loundly announces] Is he in the dresser? [looks in dresser]
3-year-old son: [loundly announces] No, he is under the bed! [giggles some more]
The 6 year old son has figured out how to play, you know, the normal way.
Now, let’s see. Under Dembsky’s definition both search games have the same
amount of “endogenous information”, since a random strategy will do equally well
for both games. But according to Dembsky’s paper, the reason I can find the 3
year old so much faster is because I am so much smarter — I have been
programmed (by Jebus?) with “an enormous amount of active information”. But in
the 6 year old game, I’m just not as clever — presumably my programmer Jebus
didn’t give me that active information.
Re: “00000000” and “01001101”
My intention isn’t to pick apart this post, and I don’t know much K-C, but…
I’m just curious and interested.
Seems to me like “0000000” has less information than “01001101” only under
certain assumptions — it really depends on context. I buy the notion that
“information content of string X” corresponds to “length of shortest program
that can produce X”. But it all depends on your programming language.
E.g. In a sane language (say Python), producing a string of zeros is trivial —
very short programs. But in something like Brainf***, it might be “010001101”
that is the one-liner, and producing a string of zeros might actually require an
extraordinary complex program.
Re #8:
Without going too far into detail:
Kolmogorov-Chaitin information theory deals with what they call algorithmic information. The basic idea of it is that you can take a universal computing device, and measure information content relative to that device. Most commonly, it’s described in terms of a simple turing machine.
The information content of a string is then the length of the shortest program for that turing machine that generates that string.
So we can say things like a string of all 0s has less information than a string made up of a random mix of zeros and ones: the program to generate a million zeros is going to be shorter than the program to generate a specific string of randomly mixed zeros and ones.
But for anything more than a trivial length string, it’s impossible to say for absolutely certain what the information content of a string is – because it’s impossible to determine the length of the shortest program that will compute that string. You can
say things like “The information content of this string is no more than X. But you can’t say “The information content of this string is exactly X”.
Would it be fair to say that a corollary to their proposal is that the brain needs no particular approach to finding information – a random memory search would be just as good as an algorhithmic one?
The language doesn’t matter but the strings get to be really long. Think of printing a string of 1,000,000 digits. Then one compiled down to something one the order of a couple of bytes and the other will take up just about a megabyte.
What is it with Dembski and “Methinks it is like a weasel”? For crying out loud, it was just one, small, illustrative example in a book Richard published in the 1980’s! Yes, the 1980’s! And in that book, he pointed out all the limitations of the example and how it was just an extreme simplification to demonstrate one aspect of natural selection and random mutation.
Almost every single piece of muddled writing by Dembski I have had the patience to wade through has contained references to “The Blind Watchmaker.” I think Dembski’s obsession with this book (rather than any of Richard’s other books, even “The God Delusion”) may be due to the subtitle: “Why the evidence of evolution reveals a universe without design.”
It’s time to move on, Bill.
Oops that was RE#15.
Exactly – pounding on about an ancient and completely trivial example of selection when there are fantastic examples of mutation/selection within Artifical Life systems/genetic algorithms/genetic programming makes him look like an idiot.
Which shouldn’t be that much of a surprise, obviously.
@TechSkeptic and the various folks who attempted to answer:
I presume Mark is referring to Shannon entropy or a similar concept. I don’t understand it formally, but since I do a lot of work in compression I have a vague grasp of it. Pretty much all lossless compression schemes work by attempting to maximize the entropy in the compressed data.
If you understand Huffman coding then you can get a very rough understanding of entropy in information theory by thinking of it this way: If you pick the optimum Huffman table for a string and it barely compresses, then the string has a lot of entropy/information. If it compresses very well, then it doesn’t have much entropy/information.
Note that natural language is extremely low in information content. This makes sense when you think about it — natural language developed to facilitate communication over extremely noisy channels, i.e. human hearing, speech, comprehension, etc. Redundancy facilitates error-correction, which is important over a noisy channel, but it also means less entropy.
Which incidentally is one reason why you can compress the shit out of plain text 🙂
Re #20:
As I said, K-C information theory describes the information content of a string relative to a particular universal programming device – most frequently, a basic universal Turing machine. You can get different metrics by using a different computing device – a minimal brainfuck program is obviously going to be a different length than a minimal universal turing machine program, which is going to be different from a minimal Minsky machine program. But as long as you’re willing to assume that you’re talking about information using a computing machine which wasn’t designed to be arbitrarily pathological, you can make general statements. The minimal length of a program for a particular string is going to be different on a Minksy machine than it is on a turing machine – but the program to generate a string of 20 zeros will still be shorter that the program to generate a particular string of 20 random digits.
Thanks all who took the time to answer my question. I understand now.
Re #14:
Not really.
As I’ve mentioned before – any dynamic landscape can be modelled as a static landscape with extra dimension. You might need to add a lot of extra dimensions, producing an incredibly complex landscape, but the basic concept of search still works.
Dumbski:
Wow. That’s a lot of fail right there. One would like to think that a *cough* professor would at least understand and recognize such a blatant false dichotomy when submitted by a student…..
Plus, in the case of evolution I’d be guessing that the rate of change of objective function landscape vs. generation time is generally quite low, meaning that hill-climbing is still effective, even if you’re climbing a hill that’s gradually changing shape.
Certainly, for an objective function that changed very quickly you’d probably end up with a situation very similar to NFL, in that your last measurements would relate bugger all information about the current state and so you may as well revert to random search.
To clarify the two binary strings thing:
Another way of thinking of K-C information measures is that the information in a string is equal to the shortest possible description of that string that doesn’t refer to anything else.
The shortest description of 00000000 is “8 0s”.
Can you imagine any description of “01001101” that will be shorter than that? (I can’t think of anything shorter than 01001101.)
@Mark #33
“77”? 😉
Re: 24 & 26
I’m just a humanities guy who linked here from Pharyngula. Besides your points – isn’t it odd in a “scientific” paper to build a thesis around a critique of a popular work (let alone a minor illustrative point therein) rather than a peer-reviewed scientific publication?
Seems to reduce the credibility of Dembski’s paper to me right off the bat, even if I know bubkes about info theory and evolutionary biology.
One of my research projects right now, actually, is investigating which approximations used in theoretical ecology break down when the fitness landscapes are dynamical. In the models we’re studying, the timescale for the fitness landscape is 200 generations or so: long enough that you’re definitely not just looking at pure noise, but short enough to be significant.
Mark C. Chu-Carroll: But for anything more than a trivial length string, it’s impossible to say for absolutely certain what the information content of a string is – because it’s impossible to determine the length of the shortest program that will compute that string.
In this case “impossible” more formally means “requiring general solution of the halting problem for the program class”.
This isn’t quite the same as “impossible” in a philosophical sense of “yields contradiction internal within the system”. If you restrict the complexity of computer creating the string to something more basic than a Turing Automaton, the corresponding halting problem may become Turing Computable. Correspondingly, if you have a hypercomputer lying about in your philosophical toys, UTMs become similarly tractable; however, no such entity has been identified in “real world” practice, there’s little hope of obtaining one, and it would merely shift the question to hypercomputational information content.
In the case of finite strings and a UTM, it’s trivial to find some program that generates the string, giving an upper bound to the number of states/symbols. This bound shifts the difficulty from finding a fully general ω-(state/symbol) bounded halting solution to finding some particular n-(state/symbol) bounded halting solution… which is finitely enumerable, and therefore for any particular n-bound has some Turing computable solution. This, however, is akin to saying that it is mathematically possible find particular values for the Busy Beaver function; it gives no help in the general process of finding answers, beyond bare hope of recognizing an answer should it walk up and bite you.
So, while not “impossible”, it’s close enough for government work; your odds are better for talking your two favorite Hollywood star(let)s into a weekly ménage à trois.
Re #35:
It would if that were the main point of the paper – but it’s not.
As I said – as a math paper, it’s about using information theory to quantify the information in a search system that allows it to succeed in certain kinds of fitness landscapes. The reference to Dawkin’s weasel is a small section, which never directly critiques Dawkins’; it just uses it as example of the technique.
I’m just guessing, but I think that for Dembski information is, or should be, linked to meaning. He would probably say that a string of DNA that encodes a gene contains more information than a random string of the same length. This is the only explanation I can think of for his denying that random mutations tend to increase the information content of the genome (he just ignores gene duplication and the like, of course).
That also explains why he uses the logarithmic measure for information content, rather than one based on K-C theory. Under the former, point mutations don’t alter the amount of information. In other words, the choice of quantification method is probably motivated by his ID background.
Now, if he really was the Newton of Information Theory, he would come up with an entirely new, semantic measure of information content.
#12, we are not talking about numbers but strings of characters. Then 0’s matter. For instance, a string of 0s carries information in its length, but not much else information.
#15: The programming language doesn’t matter in asymptotics. Changing programming language adds a constant to the KC complexity, which can be ignored when looking at strings of length n as n goes to infinity.
General reply: A more mathematically precise way to give an example of two strings with different KC complexities is to say that the complexity of a string of n 0s is O(log n), which is an upper bound you can see by writing a program… while the (expected) complexity of randomly chosen string of n bits is at least, say, n-1, a lower bound you can see by a counting argument.
In other words, to speak rigorously about KC complexity (indeed, since it is very hard to give lower bounds on the complexity of a *fixed* string), you have to talk about asymptotic behavior and/or strings chosen according to some distribution.
@34:
Assuming the smile means you know it’s directional arrow is reversed …
I’m glad you looked at it. As a member of IEEE, I felt dirty seeing Dumbski in an IEEE journal, but that’s not my field, and it would have taken me hours, primarily reading your previous refutations, to analyze it.
Re: 20, 25, 40
Obviously, the choice of computing model matters for the specific bit count you’re going to get, but it only matters up to a constant difference—similar to how the same temperature in Fahrenheit, Celsius, and Kelvin is different, but relationships between temperatures hold true.
The proof goes something like “You have a program P in language L that outputs string x. As a worst case, write an interpreter for language L in language M, bundle your interpreter with program P and run it to output x. Your new program’s length is the length of P plus the length of your interpreter.” [end{handwaving}]. See wikipedia: http://en.wikipedia.org/wiki/Kolmogorov_complexity#Definition
“The shortest description of 00000000 is “8 0s”.
Can you imagine any description of “01001101” that will be shorter than that? (I can’t think of anything shorter than 01001101.)”
4Dh
Nope – 4Dh isn’t “01001101”. It’s an alternate representation of the string in hexidecimal form. To be a description of the string, you’d have to say “4Dh in binary”. And even then, you’re relying on the fact that “h” is a shorthand for “base 16” – so you’re really saying “The base-16 number 4D as a binary string”. Which is a whole lot longer that the 8 bit string itself.
Searching the comments, I find three uses of “Dumbski”, a really bad (and amusing) pun.
I salute you!
that is a lovely way to say it. I am not sure why so many people are going back and forth on K-C here. chaitin wrote numerous papers on his work, all now available in one volume: http://www.amazon.com/THINKING-ABOUT-GÖDEL-TURING-Complexity/dp/9812708960
“All that is correct is not new, and all that is new is not correct”
Because a comment thread about Kolmogorov-Chaitin complexity is more interesting than anything William Dembski has ever published.
MPL:
Yes.
E.g. In a sane language (say Python), producing a string of zeros is trivial — very short programs. But in something like Brainf***, it might be “010001101” that is the one-liner, and producing a string of zeros might actually require an extraordinary complex program.
well, it turns out that what language you choose (Python or brainf*ck) to implement the program in only contributes a constant factor to the KC complexity, so asymptotic bounds will still hold. in fact, this constant can, with some effort, be evaluated to a determinate value!
all the details can be found on greg chaitin’s home page, he has all his books online and they are pretty smooth going. more exclamation points per page than any other mathematical writer!
“As I’ve mentioned before – any dynamic landscape can be modelled as a static landscape with extra dimension. You might need to add a lot of extra dimensions, producing an incredibly complex landscape, but the basic concept of search still works.”
Even when the searcher’s state is an input into the function that changes the state of the landscape? I wouldn’t have expected it to simplify so readily.
Don’t you have to state what the information is about? Couldn’t both strings then have no information about that something?
As one or two have pointed out, “Senior Member” of the IEEE is not based on dues. I’m not sure where you got that. It’s between a “Member” and “Fellow”. Every Fellow was once a Senior Member.
Second, I’ve read in the past that his logarithmic definition of information confuses you. I haven’t read any of Dembski’s work, but Shannon Information Theory, which is more prevalent in the field of communications, makes great use of this definition, and that’s probably where he got it.
From my sparse knowledge, K-C IT has a hard time assigning absolute values to the information of a string or whatnot, but Shannon Theory defines “mutual information” as a (logarithmic) function of the pdf of a signal and the pdf of the channel the signal travels though. But you guys all know much more about math than I do, I’m a comm guy and the logarithmic definition makes some sense to me.
Finally, I’ve been a (student) member of the IEEE for 7 years now and I’ve never heard of that journal. The IEEE is known for having large variation in standards amongst its journals, but this is all speculation and I know nothing about this journal or paper.
Hmm… With all this talk of Information Theory and the quantification of the information bits in strings, I’m curious if anyone here has come across the use of T-complexity as a deterministic method for quantifying the information content of finite-length strings.
http://tcode.auckland.ac.nz/~mark/Deterministic_Information_Theory.html
I’m hardly an expert in Information Theory – but I’ve always found this to be an interesting way of looking at it.
Curious to hear what other people think.
Re #53:
No. In information theory, information isn’t about anything.
There are two main branches of information theory: Shannon, and Kolmogorov-Chaitin.
In Shannon, the point of studying information is to understand how it can be communicated over a channel. An ultra-simplified idea of what it’s used for is: Given a particular communication channel, it’s got a bandwidth that determines how much you can transmit a certain amount of information over that channel. That’s an absolute maximum. If there’s noise on the channel, then the noise adds information; if the message you’re trying to send plus the information added by the noise exceeds the channel’s bandwidth, then you’re going to lose part of the message. By understanding the properties of the channel and the kinds of noise that could be introduced, you can determine an effective maximum usable bandwidth
for the channel. If you think of it that way, the content of the message doesn’t matter.
Kolmogorov-Chaitin winds up with a definition that is effectively very similar – but which comes from a very different starting point. K-C came out of theoretical computer science – particulary the field of computational complexity. In K-C, the amount of information in a string is, basically, the length of the shortest complete, self-contained description of that string. Again, what the string means doesn’t matter: the string is nothing but a sequence of characters – all that matters is how long the shortest description of it is.
Re #54: as I understand it, Dembski’s definition of “information” doesn’t match either the Shannon or K-C definition. Going back to Mark’s binary string, Dembski argues that “00000000” and “01001101” have exactly equivalent information; according to my (admittedly rusty) understanding of Shannon entropy, the second byte has more information–that is, it’s less predictable, it’s less redundant, and/or errors are much easier to spot.
well, it turns out that what language you choose (Python or brainf*ck) to implement the program in only contributes a constant factor to the KC complexity, so asymptotic bounds will still hold. in fact, this constant can, with some effort, be evaluated to a determinate value!
just wanted to follow myself up: this constant is the minimum program length for which the complexity is uncomputable in the language, so for programs shorter than the constant, you can say whether they are of minimum length or not, but for programs longer than the constant, it’s uncomputable!
pretty weird how it gives a definite threshold between computability and uncomputability just in program length.
In Shannon information theory, all point distributions, such as 00000000 and 00101100, have the same information (relative to the uniform distribution). This theory is what Dembski uses, and it is closely tied to the No Free Lunch theorems that he often distorts.
Mark: In the mainstream, “information theory” always refers to Shannon information theory (see e.g. the Wikipedia page). When talking about Kolmogorov complexity/information, please add “KC”. (Especially in the sentence ‘”00000000″ and “01001101” have different amounts of information, according to information theory’)
Without context, it’s not possible to decide about a difference in 00000000 and 01001100. It’s less predictable? Well, those strings are given in the past, there is nothing to predict.
You might see them as 76 and 0, or as ‘L’ and ”. In the first case a leading zero woulnd’t add much information here and there, but in the second case ist would be a big surprise to get a byte of 9 bits.
Maybe those zeros are binary answers to 8 questions, and then we have 8 answers here, and 8 answers there. Different answers, but who will decide without seeing the questions?
One is more random than the other? Well – assume a coin was thrown 8 times. The first might be a result of such an experiment, and the second too, and both are equal probable. More Predictable? Which? Why?
Without context, there is no information in 01001100. And none in 0000:0000 too. It’s noise.
Re #60:
Wrong.
“Context”, as you’re using it, is just another way of saying “meaning”. Information theory doesn’t care about meaning.
Before you start talking about how that’s all wrong, you need to understand what information theory tries to do.
Take anything which is descriptive – whether it’s the positions of atoms in a crystal, the number of photons being released by a star, the words of a book, the bits in a computers memory. There’s an essential underlying similarity between all of those – that is, they can be described. The whole concept of information is essentially based on that fundamental idea of describability.
Algorithmic information theory (which is the one that I’ve studied) takes that, and abstracts it. If all of those things are describable, then what is a description? What properties does it have? How can we quantify it?
What you called context is related to meaning – which information theory has deliberately abstracted away. But in addition to that, it’s got elements of description to it. If a string of bits consists of answers to a series of questions, then it’s not a complete description – to be a complete description (and thus something quantifiable by IT), it needs to include the questions.
In other words, if you want to quantify the information content of the answer to the question “Have you studied information theory”, then what you’re going to quantify isn’t just “yes”, it’s “Yes I have studied information theory”.
Okay, let’s call it meaning instead of context. But you like to call the description information, and claim the description to be of no meaning too?
I flipped a coin 10 times, and the result was 0100:1100. (Assume 1 means number, 0 means head).
I have to give you explicit information (The sentence above) and there is implicit information too, we rely on to share: What is a coin, what is a number, what is a zero, what is a blog and so on).
Since I didn’t study information theory we have more work to do to archive a common starting point. My question is: The description of an information is another information, and has a closely related meaning. Maybe you don’t call it number and head in the US – I’m not a native english speaker, we call it ‘Kopf oder Zahl’ or ‘Wappen oder Zahl’ in Germany, the second meaning is emblem, crest or number. Ah, now I found it, you call it: head and tail.
I don’t understand tail, which I know from ‘tail of a list’ for example, or the tail of a cat.
Using a dictionary, I can understand, that ‘head and tail’ means ‘Kopf oder Zahl’ in german, but of course I have to know what ‘Kopf oder Zahl’ is.
Information is normally transferred to someone else, and if you share knowledge and culture, you can transfer the information very easily, but if not, you will have to transfer more information. But there is no absolute value of how many information you need, independtly from the receiver.
If you share an algorithm for data compression, you can reuse it for many informations, granted your receiver knows it, and is able to detect its usage. Using compression to transfer 0100:1100 in a different form here in the blog will not work, because the description of the compression of binary(76) or binary(‘L’) is longer than the described data, but this will change, if we have 8000 coin flips.
Look, you just admitted that you’ve never studied information theory, and you have no idea what it’s about. How on earth can you think that you can make a meaningful argument about a specialized field of mathematics that you haven’t bothered to learn anything about?
Well – I thought information theory is a science, where you have arguments, and proofs, which are communicable, not that you get it magically by studying.
If you can explain your theory, do so.
If you don’t like to talk to people without academic grade, I’m sorry I disturbed you.
Re #64:
You know, I could have sworn that most people don’t consider “studying” to be magic. Most people consider it the basic method of expending effort to learn about a topic.
I also could have sworn that most people understood that they’re not entitled to a free education in any topic that they want in any forum they want just because they demand it.
Like anything meaningful, you’re not going to learn everything you want to know about information theory in an afternoon of reading comments on a blog. You’re not going to learn to completely understand relativity in an afternoon on a blog if you’ve never studied math and physics. You’re not going to be able to learn to prepare fugu sushi in an afternoon on a blog if you’ve never learned anything about making sushi. You’re not going to be able to learn to make blown-glass champagne flutes in an afternoon if you’ve never studied glassblowing. And if you demanded that some random person who knows how to do any of those things spend their time teaching you, just because you think you’re entitled, what do you think they’d say?
Roughly the same thing I’m saying. It’s not my job to make up for your unwillingness to study the subject.
I studied math for years while I was in college and grad school. I spent a full semester on an introductory course in Kolmogorov complexity (which is based on Kolmogorov-Chaitin information theory), and then dozens of hours of my own time with Greg Chaitin’s books, in order to become basically literate in information theory. I’m far from an expert on the subject, but I’ve got at least a reasonable clue. It’s not a trivial topic. You’re not going to really understand it by reading a blog for an afternoon.
What you’re trying to do is to demand that I spend my time spoon-feeding you a subject that you could learn on your own if you were willing to expend the effort. There are plenty of papers, textbooks, and college courses available to teach you information theory if you really want to know. Hell, there are some introductory articles on it in the archives on this blog; there are other introductions scattered around the web.
Sefan, Mark’s already told you the most important thing you need to know about information theory- that it’s not about _meaning_, meaning is unquantifiable- and you’re ignoring him. You might want to do a little more learning _on your own time_ now.
Just to illustrate the point, the ballad of Paul Revere involves a message being sent which contains _one bit_ of information: the lights in the tower are one if by land, two if by sea. That’s the information to make one binary choice: 1 bit. The _meaning_ of the message is a very different thing: one of its meanings _now_ is “the American Revolution was successful”; one of its meanings _then_, to a redcoat, was “Oh shit, we’ve been spotted”.
I made an argument that there are cases, where 0000:0000 doesn’t contain less information than 0100:1100, and you insisted that I shouldn’t argue on questions I didn’t study. You didn’t argue with how much success one should have studied that subject, so I thought just studying would make me magically a valuable participant for the discussion.
The question, whether both Strings contain the same amount of information or not seems to me a very basic question, which I might answer without having a diploma. I’m not completely foreign to mathematics, statistics and programming, but in my opinion one should argue on arguments, not on grades.
If we have the metainformation, that the following message consists of 8 bits, and the convention to omit leading zeros, we can compress it to a single 0. But without the metainformation, a single 0 might mean any number of zeros, and therefore it looks questionable to me, whether the second String contains more information or not. A message of a single 0 might be incomplete – without the convention to omit leading zeros, we just don’t know.
Just from the fact that the zero is repeated, you can’t conclude it is redundant. Maybe the last bit is a checkbit – maybe not.
Whether these 8 bits of information show the state of 8 switches on a machine, or 8 events of coinflipping, or 8 answers to yes/no questions in a questionaire is of no interest – I agree. But whether this is the whole message, and whether it is encoded or compressed touches the question of the meaning of the information. In the real world, there are implicitly more informations involved. Else, I could argue, you receive the first bit of 0100:1100 and instead of waiting for the rest of information, you could assume it to be the complete message (0000:0000).
I would still like to see a description of eight bits of zero which is shorter than a byte. Of course “8 0s” isn’t shorter – not on my machine.
You said “the program to generate a string of 20 zeros will still be shorter that the program to generate a particular string of 20 random digits”. That’s a contradiction in adjecto, because a random string of 20 digits could be 20 zeros, which cannot be longer than itself.
I don’t demand anybody to spend more than a few minutes on my post, just see it as invitation. If you don’t like it, ignore it.
Stefan, you are arguing that there are cases where 00000000 doesn’t _mean_ less than 01001100. True. So what? But there are well-defined mathematical senses in which 00000000 does have less _information_ (sum of p ln p , for example?). Which is what Mark was saying. So….?
Which contradicts the assumption, that you can abstract away the context in which the information is used – no?
Stefan, for the Nth time: meaning context-dependent. Mathematical information much less so. Are we done?
Re #67:
That’s what you say, but you refuse to listen to simple answers to your questions, demanding deeper answers; and the deeper answers, you don’t understand.
You’re *still* arguing that “information” has to be interpreted in terms of meaning. Meaning isn’t quantifiable. Meaning is context dependent. Meaning is intrinsically fuzzy and abstract in terms of quantification. Information theory isn’t about meaning. Meaning is irrelevant. Information theory is about quantification.
There are two branches of information theory – any neither one is concerned with meaning. Shannon theory deals with communication over a channel – but communication is defined as nothing more than a sequence of bits. Kolmogorov-Chaitin theory deals with information even more abstractly, in terms of algorithms. But they’re both focused on quantification – which works the same way regardless of what the stuff means. If you want meaning, don’t waste time on information theory, because that’s not what it’s about. You can find mathematical work studying meaning in logic, denotational semantics, domain theory, representation theory, and things like that. But not information theory.
You keep harping on things like the idea of stripping leading zeroes from a number. That’s still meaning. You’re looking at a sequence of symbols,
interpreting them as representing a number, and then saying that “00000000” represents the same number as “0”. But that’s because you’re appyling meaning. “00000000” is a string of symbols taken from a particular alphabet. It’s not the same as “0”, or “0000000000000000000000”. If you take the meaning of those as decimal numbers, they’re the same. If you take the meaning of those as unary numbers, they’re not the same. If you take the meaning of them as C-format 0-terminated character strings in computer memory, they’re not the same. But that’s all meaning.
I didn’t say that. I said, that you can’t assume the zeros to be redundandant without knowing their meaning.
Re #72:
Redundant is a term related to meaning. You can’t assume a sequence of zeros can be compacted to one zero unless you know how to intepret the sequence – which requires meaning.
But in terms of information theory, you absolutely can say that a series of zeros has less information than a series of randomly mixed binary digits. The random stream isn’t compressible; any series of zeros is.
It’s easier to illustrate using a larger example. Imagine a sequence of 255 bytes of zeros. You can generate that with a program:
for i from 1 to 11111111 {
for i from 0 to 111 {
emit 0
}
}
But to emit a string with an equal number of zeros and ones with a normal distribution, in the most common case – that is, for fully 1/2 of the strings, to write a program that generates those strings, you need 2040 emit statements to generate the 2040 bits of the string – you can’t find any method for removing even one emit statement from a program to generate it.
That’s true regardless of whether you interpret the meaning of those strings as decimal numbers, or null-terminated UTF-8 strings, or alphabetic sequences. The meaning is completely independent of the information content. The information contains everything you need to reproduce the string, regardless of what it’s going to mean. By nailing it down to a single meaning, you can choose an interpretation which allows you to recognize a shorter string with the same meaning -but it won’t have the same information content, because information content is independent of meaning.
It’s not easier to illustrate it with larger numbers, it’s not possible to illustrate it with small numbers, because the small number will always be shorter than a generating program.
If your inner ‘i’ doesn’t affect the outer ‘i’ I would expect your program to print 8×255 zeros, but I assume that that is an unimportant detail. It is pseudocode, isn’t it?
And you compare an algorithm to generate a specific string with an algorithm to generate unspecific strings? That’s not fair. Your program will look nearly identical it you just print “1”s instead of “0”, but it’s not the same.
If you produce a sequence of 0101010… your algorithm needs to run half the time but allways call
emit 0
emit 1
Of course it depends on predefined possibilities (for-loop, emit) which aren’t well defined.
If you look at this, which might look like a random pattern:
10111011011111000101010001001110011
you could assume it to need one emit statement per digit, but since you expect some argument you won’t be surprised to hear, that there was a simple algorithm to produce this pattern:
echo “scale=32;-3+(4*a(1))” | bc -l | tr [2-56-9] “00001111”
(4*a(1) creates PI, a:=atan)
Okay, what do I want to show: There are patterns which are obviously easy to produce, and other pattern, which are easy to produce, but we don’t see it. I wouldn’t recognize any PI-pattern like the one above, but it’s clearly an easy to produce pattern.
If you would say, that, because it is easy to produce, therefore it contains less information than – for example – 101001001100000110111110100101101 (which i produced with an easy algorithm too, but which isn’t easily reproduced:
for i in {0..32} ; do echo -n $((RANDOM%2)) ; done
) – well, I agree the first one is easy to reproduce algorithmically, but I wouldn’t accept that as a measurement for information. If it is already common sence in science – well, that’s not my problem. I wouldn’t accept that as a reasonable measurement.
But maybe there are a lot of useful cases for that measurement.
Conclusion: I can agree with you, that algorithms, that produce sequences of digits, are short, if there is just one repeated digit, and if the sequence is longer than a threshold, depending on the language, used to produce it.
And I see, that the Kolmogorow-complexity is a well established measurement, and that your claims are in conformance with that theory.
But I doubt the Kolmogorow-complexity is an adaequat measurement for information. Depending on the specific turing machine or the programming language the algorithm to produce a pattern might be more or less complex.
Talk about arrogance! You admit that you’ve never actually studied information theory. But based on a half-assed intuitive argument on a blog, you feel qualified to conclude that it’s all wrong and useless.
Some of the finest mathematicians in the world worked on information theory. Claude Shannon was one of the greatest scientists of the 20th century. Without him, we wouldn’t have the network that we’re using to communicate, or the computers that we use to read and write these messages. Much of the technology relied on his work in information theory.
Greg Chaitin is, in my opinion, the finest living mathematician in the world, and one of the greatest of the 20th century. I’ve had the incredible good fortune to meet him personally, when we both worked at IBM. He is, by far, the smartest person I’ve ever met. His work on information theory started as a teenager, when he saw it as an implication of Godel’s work, and derived it from first principles himself, at the same time as Kolmogorov. His work on algorithmic information theory has been both profound, and incredibly useful.
But of course, you, as some random twit in the internet, know better than Kolmogorov, or Chaitin, or Shannon. Your intuition tells you that it doesn’t make sense, and that’s enough.
Jackass.
Stefan, your criticism seems to be that information theory does not do what you think information theory should do. This, however, is not a problem with what information theory _does_ do.
And if Mark points out that _in general_ random-looking strings need a long algorithmic description, and you respond with _some_ strings that might have shorter descriptions, then you have not actually responded to the point. Notice, by the way, that your example using a call to random is a completely different kind of algorithm and, to be comparable to Mark’s earlier pseudocode. would have to explicitly include the random function!- good luck getting _that_ shorter than “emit 0 n times”.
You might, in short, want to get to grips with what information theory _does_ before declaring that it’s unsatisfactory to you; otherwise you’re rather in the position of declaring jet aircraft unsatisfactory for air travel because they don’t dance the merengue or taste of key lime pie.
@Stephen: The (maybe pseudo-)random string I presented was meant to be in contrast to the generated String, made from the digits of PI. In contrast.
I only showed the way I produced it to underline, that it is not easily reproduceable. NOT. I mentioned so.
Your critic could have mentioned that you need a mathlib to generate atan(1) – that would be true, and it would fit to my example, but as far as my shallow knowledge of information theory goes, you are intermixing Shannons theory, and Kolmogorow. You would need a machine, to generate some output, too, and who defines what is built in into the machine, and what is not?
If you show me a specific sequence of ones and zeros, (which might be produced by coinflipping, I just was too lazy to do that) but then you show it to me. Then I may produce a machine, which has a few commands, and the shortest command is the one to produce exactly that sequence.
There wouldn’t be a similar command to produce other, similar sequences.
That’s the same situation as declaring 1111:1111 to contain less information than 0110:1111, just because the normal machines we deal with can produce the former more easily.
Look at 2396745. It doesn’t look very redundand, does it? Not in decimal notation. In octal notation, it does. The octal representation is 1111:1111. Very redundant.
So the information you describe depends on the encoding, the numbering system? But you could invent special encoding systems just for a single chunk of information, just to make a single representation very short.
But isn’t the price for simplyfying the algorithm to produce a special pattern a more complex programing language with build in shortcuts? Well – on the one side I like to agree, but on the other side, we move the problem from describing the complexity of some information to the field of describing the complexity of software, which is another piece of information, so isn’t this process tautological?
@Mark: I never spoke about intuition, did I? Au contraire! From intuition I would agree, and claim 0110:1111 to contain less information than 1111:1111, but I doubt my intuition.
The merits of the famous scientists make me think I could be wrong, yes, but they don’t show me how.
Sorry – ^that’s my posted, I didn’t intend to post anonymous.
Re #77:
You just don’t give up.
Do you think that people like Kolmogorov and Chaitin were too dim to think of things like that?
You’re insisting that without understanding the theory at all, you’re qualified to make judgements about it. You’re throwing out “problems” that even the most shallow and trivial understanding of the theory would show are nonsense.
And yes, that’s intuition. Intuition is playing games with “but it seems like this case is wrong”, without understanding enough to actually work through what your objection actually means in terms of the theory. It just seems to you that your objections are valid. You’ve got no actual basis for that, beyond your intuition for what this stuff is about.
To give you an example of how shallow your objection is: In K-C theory, you start with a minimal universal computing machine. You don’t get to choose a machine after seeing the string. You have one minimal universal machine. Every string you consider, you consider relative to that one, fixed, machine.
One of the classic examples in K-C theory is that if you have a string S, whose length is |S|, but you can compress it to a shorter string S’ such that |S’| isn’t just |S’|, the length of the compressed string – it’s the length of the compressed string plus the length of an uncompressor.
Your objection is nothing but a form of that. A machine which can emit a specific string with a single instruction is really nothing but a machine with a weak compression system embedded in it. You don’t get to include a compression algorithm in your machine for free; and you don’t get to choose a machine after you see the string.
The information content of a string isn’t something that you get to chose a machine to compute. You’re always working in a framework where you’ve got a fixed, simple, minimal machine. In general, we don’t even bother to specify the machine in detail, because for most of the interesting results of information theory, it doesn’t matter.
For example, one of the classic, simple results of information theory is that most strings are not compressible. Why?
Consider the set of binary strings of length N. There are 2N strings of length N. Now imagine that we’ve got a compression system. Any compression system consists of two functions, C (the compressor) and D (the decompressor), such that
for all x, D(C(x)) = x – in other words, D=C-1. For that basic identity to work, the compression function has to be a 1:1 function – if it’s not, it doesn’t have an inverse. So, how many strings of length N can be compressed by just one bit, to length N-1? 2N-1.
So given a set of fixed length strings, only half of them can be compressed by even one bit. Half can’t be compressed at all.
But I said that most strings are non-compressible, right?
In real compression functions, we don’t look at fixed length strings. A compression function typically works on all inputs. So you’ve got a set of strings of length less than or equal to N – that is, 2N+1 strings. For strings of length N-1, only half can, in theory, be compressed by one bit to length n-2. Of the ones that can’t be compressed, we can no longer use their representations as compressed forms of strings of length N. Continue that down the line. The end result is that the vast majority of possible strings cannot be compressed by even one bit.
That’s a good explanation, but I would expect it to be valid for your for-loop-example too; that you don’t know in advance which string to compress.
Sefan: today’s xkcd is just for you (# 675). Enjoy.