Today’s contribution on category theory is going to be short and sweet. It’s an example of why we really care about [natural transformations][nt]. Remember the trouble we went through working up to define [cartesian categories and cartesian closed categories][ccc]?
As a reminder: a [functor][functor] is a structure preserving mapping between categories. (Functors are the morphisms of the category of small categories); natural transformations are structure-preserving mappings between functors (and are morphisms in the category of functors).
Since we know that the natural transformation can be viewed as a kind of arrow, then we can take the definitions of iso-, epi-, and mono-morphisms, and apply them to natural transformations, resulting in *natural isomorphisms*, *natural monomorphisms*, and *natural epimorphisms*.
Expressed this way, a cartesian category is a category C where:
1. C contains a terminal object t; and
2. (∀ a,b ∈ Obj(C)), C contains a product object a×b; and
a *natural isomorphism* Δ, which maps each Functor over (C×C): ((x → a) → b) to (x → (a×b))
What this really says is: if we look at categorical products, then for a cartesian category, there’s a way of understanding the product as a
mapping within the category as a pairing structure over arrows.
structure-preserving transformation from arrows between the pairs of values (a,b) and the products (a×b).
The closed cartesian category is just the same exact trick using the exponential: A CCC is a category C where:
1. C is a cartesian category, and
2. (∀ a,b ∈ Obj(C)), C contains an object ba, and a natural isomorphism Λ, where (∀ y ∈ Obj(C)) Λ : (y×a → b) → (y → ab).
Look at these definitions; then go back and look at the old definitions that we used without the new constructions of the natural transformation. That will let you see what all the work to define natural transformations buys us. Category theory is all about structure; with categories, functors, and natural transformations, we have the ability to talk about extremely sophisticated structures and transformations using a really simple, clean abstraction.
[functor]: http://scienceblogs.com/goodmath/2006/06/more_category_theory_getting_i.php
[nt]: http://scienceblogs.com/goodmath/2006/06/category_theory_natural_transf.php
[ccc]: http://scienceblogs.com/goodmath/2006/06/categories_products_exponentia_1.php
Category Archives: goodmath
Arrow Equality and Pullbacks
We’re almost at the end of this run of category definitions. We need to get to the point of talking about something called a *pullback*. A pullback is a way of describing a kind of equivalence of arrows, which gets used a lot in things like interesting natural transformations. But, before we get to pullbacks, it helps to understand the equalizer of a pair of morphisms, which is a weaker notion of arrow equivalence.
We’ll start with sets and functions again to get an intuition; and then we’ll work our way back to categories and categorical equalizers.
Suppose we have two functions mapping from members of set A to members of set B.
f, g : A → B
Suppose that they have a non-empty intersection: that is, that there is some set of values x ∈ A for which f(x) = g(x). The set of values C from A on which f and g return the same result (*agree*) is called the *equalizer* of f and g. Obviously, C is a subset of A.
Now, let’s look at the category theoretic version of that. We have *objects* A and B.
We have two arrows f, g : A → B. This is the category analogue of the setup of sets and functions from above. To get to the equalizer, we need to add an object C which is a *subobject* of A (which corresponds to the subset of A on which f and g agree in the set model).
The equalizer of A and B is the pair of the object C, and an arrow i : C → A. (That is, the object and arrow that define C as a subobject of A.) This object and arrow must satisfy the following conditions:
1. f º i = g º i
2. (∀ j : D → A) f º j = g º j ⇒ (∃ 1 k : D → C) i º k = j.
That second one is the mouthful. What it says is: if I have any arrow j from some other object D to A: if f and g agree on composition about j, then there can only be *one* *unique* arrow from C to D which composes with j to get to A. In other words, (C, i) is a *selector* for the arrows on which A and B agree; you can only compose an arrow to A in a way that will compose equivalently with f and g to B if you go through (C, i) Or in diagram form, k in the following diagram is necessarily unique:
There are a couple of interesting properties of equalizers that are worth mentioning. The morphism in an equalizer is a *always* monic arrow (monomorphism); and if it’s epic (an epimorphism), then it must also be iso (an isomorphism).
The pullback is very nearly the same construction as the equalizer we just looked at; except it’s abstracting one step further.
Suppose we have two arrows pointing to the same target, f : B → A and g : C → A. Then the pullback of of f and g is the triple of an object and two arrows (B×AC, p : B×AC → B, q : B×AC → C). The elements of this triple must meet the following requirements:
1. f º p = g º q
2. (f º p) : B×AC → A
3. For every triple (D, h : D → B , k : D → C), there is exactly one unique arrow A : D → B×AC where pºA = h, and q º A = k.
As happens so frequently in category theory, this is clearer using a diagram.
If you look at this, you should definitely be able to see how this corresponds to the categorical equalizer. If you’re careful and clever, you can also see the resemblance to categorical product (which is why we use the ×A syntax). It’s a general construction that says that f and g are equivalent with respect to the product-like object B×AC.
Here’s the neat thing. Work backwards through this abstraction process to figure out what this construction means if objects are sets and arrows are functions, and what’s the pullback of the sets A and B?
{ (x,y) ∈ A × B : f(x) = g(y) }
Right back where we started, almost. The pullback is an equalizer; working it back shows that.
Categories and SubThings
What’s a subset? That’s easy: if we have two sets A and B, A is a subset of B if every member of A is also a member of B.
What’s a subgroup? If we have two groups A and B, and the values in group A are a subset of the values in group B, then A is a subgroup of B.
For any kind of thing **X**, what does it mean to be a sub-X? Category theory gives us a way of answering that in a generic way. It’s a bit hard to grasp at first, so let’s start by looking at the basic construction in terms of sets and subsets.
The most generic way of defining subsets is using functions. Suppose we have a set, A. How can we define all of the subsets of A, *in terms of functions*?
We can take the set of all *injective* functions to A (an injective function from X to Y is a function that maps each member of X to a unique member of Y). Let’s call that set of injective functions **Inj**(A). Now, we can define equivalence classes over **Inj**(A), where two functions f : X → A and g : Y → A are equivalent if there is an isomorphism between X and Y.
The domain of each function in one of the equivalence classes in **Inj**(A) is a function isomorphic to a subset of A. So each equivalence class of injective functions defines a subset of A.
We can generalize that function-based definition to categories, so that it can define a sub-object of any kind of object that can be represented in a category.
Before we jump in, let me review one important definition from before; the monomorphism, or monic arrow.
>A *monic arrow* is an arrow f : a → b such that (∀ g1,
>g2: x → a) f º g1 = f º g2
>⇒ g1 = g2. (That is, if any two arrows composed with
>f in f º g end up at the same object only if they are the same.)
The monic arrow is the category theoretic version of an injective function.
Suppose we have a category C, and an object a ∈ Obj(C).
If there are are two monic arrows f : x → a and g : y → a, and
there is an arrow h such that g º h = f, then we say f ≤ g (read “f factors through g”). Now, we can take that “≤” relation, and use it to define an equivalence class of morphisms using f ≡ g ⇔ f ≤ g &land; g ≤ f.
What we wind up with using that equivalence relation is a set of equivalence classes of monomorphisms pointing at A. Each of those equivalence classes of morphisms defines a subobject of A. (Within the equivalence classes are objects which have isomorphisms, so the sources of those arrows are equivalent with respect to this relation.) A subobject of A is the sources of an arrow in one of those equivalence classes.
Categories: Products, Exponentials, and the Cartesian Closed Categories
Before I dive into the depths of todays post, I want to clarify something. Last time, I defined categorical products. Alas, I neglected to mention one important point, which led to a bit of confusion in the comments, so I’ll restate the important omission here.
The definition of categorical product defines what the product looks like *if it’s in the category*. There is no requirement that a category include the products for all, or indeed for any, of its members. Categories are *not closed* with respect to categorical product.
That point leads up to the main topic of this post. There’s a special group of categories called the **Cartesian Closed Categories** that *are* closed with respect to product; and they’re a very important group of categories indeed.
However, before we can talk about the CCCs, we need to build up a bit more.
Cartesian Categories
——————–
A *cartesian category* C (note **not** cartesian *closed* category) is a category:
1. With a terminal object t, and
2. ∀ a, b ∈ Obj(C), the objects and arrows of the categorical product a×b are in C.
So, a cartesian category is a category closed with respect to product. Many of the common categories are cartesian: the category of sets, and the category of enumerable sets, And of course, the meaning of the categorical product in set? Cartesian product of sets.
Categorical Exponentials
————————
Like categorical product, the value of a categorical exponential is not *required* to included in a category. Given two objects x and y from a category C, their *categorical exponential* xy, if it exists in the category, is defined by a set of values:
* An object xy,
* An arrow evaly,x : xy×y → x, called an *evaluation map*.
* ∀ z ∈ Obj(C), an operation ΛC : (z✗y → x) → (z → xy). (That is, an operation mapping from arrows to arrows.)
These values must have the following properties:
1. ∀ f : z×y → x, g : z → xy:
* valy,x º (ΛC(f)×1y)
2. ∀ f : z×y → x, g : z → xy:
* ΛC(evaly,x *ordm; (z×1y)) = z
To make that a bit easier to understand, let’s turn it into a diagram.
You can also think of it as a generalization of a function space. xy is the set of all functions from y to x. The evaluation map is simple description in categorical terms of an operation that applies a function from a to b (an arrow) to a value from a, resulting in an a value from b.
(I added the following section after this was initially posted; a commenter asked a question, and I realized that I hadn’t explained enough here, so I’ve added the explanation.
So what does the categorical exponential mean? I think it’s easiest to explain in terms of sets and functions first, and then just step it back to the more general case of objects and arrows.
If X and Y are sets, then XY is the set of functions from Y to X.
Now, look at the diagram:
* The top part says, basically, that g is a function from Z to to XY: so g takes a member of Z, and uses it to select a function from Y to X.
* The vertical arrow says:
* given the pair (z,y), f(z,y) maps (z,y) to a value in X.
* given a pair (z,y), we’re going through a function. It’s almost like currying:
* The vertical arrow going down is basically taking g(z,y), and currying it to g(z)(y).
* Per the top part of the diagram, g(z) selections a function from y to x. (That is, a member of XY.)
* So, at the end of the vertical arrow, we have a pair (g(z), y).
* The “eval” arrow maps from the pair of a function and a value to the result of applying the function to the value.
Now – the abstraction step is actually kind of easy: all we’re doing is saying that there is a structure of mappings from object to object here. This particular structure has the essential properties of what it means to apply a function to a value. The internal values and precise meanings of the arrows connecting the values can end up being different things, but no matter what, it will come down to something very much like function application.
Cartesian Closed Categories
—————————-
With exponentials and products, we’re ready for the cartesian closed categories (CCCs):
A Cartesian closed category is a category that is closed with respect to both products and exponentials.
Why do we care? Well, the CCCs are in a pretty deep sense equivalent to the simply typed lambda calculus. That means that the CCCs are deeply tied to the fundamental nature of computation. The *structure* of the CCCs – with its closure WRT product and exponential – is an expression of the basic capability of an effective computing system.
We’re getting close to being able to get to some really interesting things. Probably one more set of definitions; and then we’ll be able to do things like show a really cool version of the classic halting problem proof.
A Great Quote about Methods
On my way home from picking up my kids from school, I heard a story on NPR that included a line from one of the editors of the New England Journal of Medicine, which I thought was worth repeating here.
They were discussing an article in this month’s NEJM about [vioxx/rofecoxib][nejm]. The article is a correction to an earlier NEJM article that concluded that the cardiac risks of Vioxx were not really significant until around 18 months of continued use. With more data available, it appears that the 18 month was just an artifact, not a real phenomenon, and the data appears to show that the cardiac risks of Vioxx start very quickly.
[nejm]: http://content.nejm.org/cgi/reprint/NEJMc066260v1.pdf
As usual for something like this, the authors and the corporate sponsors of the work are all consulted in the publication of a correction or retraction. But in this case, the drug maker objected to the correction, because they wanted the correction to include an additional analysis, which appears to show an 18-month threshold for cardiac risk.
The editors response, paraphrased (since I heard it on the radio and don’t have the exact quote):
“That’s not how you do science. You don’t get all of the data, and then
look for an analysis that produces the results you want. You agree on the
analysis you’re going to do *before* you start the study, and then you use
the analysis that you said you were going to use.”
Hey [Geiers][geier], you listening?
[geier]: http://goodmath.blogspot.com/2006/03/math-slop-autism-and-mercury.html
Validity of Mathematical Modeling
In the comments to one of my earlier [Demsbki][demsbki-info-theory] posts, I was responding to a comment by a poster, and realized that what we were discussing was probably interesting enough to promote up to a front-page post.
A poster named [Garote][garote] said:
>I think PaulC said exactly what needs to be said, that grants us all
>hearty permission to ignore Dembski’s work. I think it bears repeating:
>
>”If I am betting on coin flips and notice any kind of algorithmic pattern then
>there are legitimate reasons to conclude that the person flipping coins is not
>playing fair. However, this method does not allow one to discriminate between
>the product of evolution and the product of “intelligent design” since neither
>of these processes in any way resemble random coin flips.”
>
>Dembski’s feverishly comparing apples to oranges, to prove that one is an
>artichoke.
Now, my contempt for Dembski is pretty clear. But this argument against him struck me as making a mistake – and it’s a mistake that strikes at something near-and-dear to my interests: mathematical modeling. Modeling can be sorely abused, [as I’ve discussed before][modeling]; but I don’t want to throw out the baby with the bathwater.
My whole viewpoint on mathematics is that much of its value comes from **abstraction**: the ability to look at something complex; to identify a set of relevant properties for understanding some feature of it; and to use that to create a simpler mathematical model of the complex system based on its fundamental principles. That kind of abstraction isn’t just useful for diddling around in the world of pure math, but for much of how math is used in science. There are a lot of interesting things that we can’t grasp in their entirety; but that we can understand by breaking them down through abstraction into a set of simpler facets that we can study.
I don’t want to throw away the validity of the concept of abstraction and mathematical modeling; but one way of reading that comment is as an argument against the concept of working with a simplified abstract model to understand something complex. (Note that I’m not saying that that’s what Garote meant; but I thought that focusing on this could produce an interesting discussion.)
In Dembski’s favor, I can see the “coin-flip” thing is an abstraction of a complex process to something simple that contains a relevant feature. The thing is, there are things in nature that produce highly random results; and there are other things that produce interesting patterns. Recognizing that there’s something different about processes that produce patterns could be an interesting thing; and recognizing the distinction between randomness and pattern is an interesting subject which probability theorists have spent time studying; it’s right at the intersection between probability theory and information theory, and it’s produced some interesting insights about what information is and what it means.
The problem WRT to Demsbki isn’t that he’s reducing a problem to a mathematical model through abstraction and then analyzing the model. The problem is that when you build a mathematical model, you *need to make sure that your abstraction captures all of the relevant features of what you’re modeling*. And when you draw conclusions based on the analysis of the mathematical model, *you need to make sure that your abstraction isn’t the source of the conclusion*. Both of those are different ways of stating the fundamental rule of mathematical modeling:
**Mathematical models must be validated against the thing that they model.**
Demsbki’s reduction of the process of analyzing certain features of the universe to the a metaphor involving recognition of patterns in coin flips is fine with me. The problem is that he creates a model that *omits important elements* of the real world, and then insists that the *conclusions drawn from his abstract model are applicable to the real world*. But in fact, his conclusions are the result of properties of his model, *not* of the real world. The validation of his model fails: the specific conclusions are the result of eliminating things from his model that might permit a different conclusion.
To continue with the example of the coin flip metaphor: if you’re studying patterns, you could abstract the problem of observing patterns to one of observing flipping coins in an idealized universe, where the coin is perfectly even, and nothing about the surface that it lands on can affect the outcome of a flip. That might be a useful model for trying to understand the difference between random sequences of flips, and patterns in sequences.
If you then try to use that model to determine the *cause* of a pattern, you’ll conclude that the only possible cause of a pattern *is the actions of the coin flipper*. If you’ve abstracted away everything that could influence the outcome of the coin-flip except the influence of the coin-flipper, then *in your model*, concluding that the flipper is the only possible source of a pattern could be a reasonable conclusion.
That doesn’t mean that you can then say that in the real world, the *only possible cause* of a pattern in a sequence of coin-flips is some action taken by the coin-flipper. You can’t apply that simplified abstraction of your model to the real world without showing that it’s a valid model with respect to the the properties that affected your conclusion.
The ultra-simple coin flip model isn’t valid for the real world, because it deliberately omits factors of the real world that could influence the conclusion. In the real world, there are unfair coins (both deliberately unfair coins, and coins that have become unfair either through minting flaws or through wear and tear). There are magnetic fields. There are irregular surfaces.
The problem with many of the conclusions that bad mathematicians draw from mathematical models is that *the models don’t represent what they claim to represent*. They abstract away relevant features of reality, and then demand that reality doesn’t possess those features, because they aren’t in the model.
[demsbki-info-theory]: http://scienceblogs.com/goodmath/2006/06/dembskis_profound_lack_of_comp.php
[garote]: http://garote.bdmonkeys.net/
[modeling]: http://goodmath.blogspot.com/2006/03/on-how-not-to-apply-mathematical.html
An Introduction to Information Theory (updated from Blogspot)
Back when I first started this blog on blogspot, one of the first things I did was write an introduction to information theory. It’s not super deep, but it was a decent overview – enough, I thought, to give people a good idea of what it’s about, and to help understand why the common citations of it are almost always misusing its terms to create bizzare conclusions, like some [“law of conservation of information”,][conservation]
[conservation]: http://en.wikipedia.org/wiki/Law_of_conservation_of_information “wikipedia on Dembski’s law of CoI”
This is a slight revision of that introduction. For the most part, I’m just going to clean up the formatting, but once I’m going through it, I’ll probably end up doing some other polishing.
****
So what is information theory?
================================
Information theory is a relatively new field of mathematics that tries to characterize what information is in a quantifiable way. It’s an area of math that was almost totally obscure until not too long ago, and one which is frequently misunderstood, even by people who mean well.
A bit of history
—————–
Modern information theory comes from two very different sources: Shannon’s information theory, and Kolmogorov/Chaitin algorithmic information theory. Fortunately, they both converge, mostly, on the same place.
### Shannon’s Information Theory
The more famous branch of IT is Claude Shannon’s information theory, as presented in his infamous paper, [Communication in the Presence of Noise][shannon]. Shannon’s interest came from his work at Bell Labs for AT&T, which wanted a way of being able to figure out how much wire they needed to lay for the telephone network.
AT&T was interested because for a telephone company, laying wire in an incredibly expensive operation. The wire itself (especially back in the ways when it was bundles of copper) is quite expensive; the process of running and burying the cables is even more expensive. So ideally, as a phone company, you want to run the cables once; you want to lay enough that you’ll never have to come back and do it again; and you want to lay the smallest amount that will do the job, since the materials are so expensive.
The problem that they wanted Shannon to help solve was, how could they determine how much wire did they actually need to lay? It’s actually an astonishingly tricky question. The less they laid, the less it cost them in the short term. But they *knew* that the number of phone lines was going to increase dramatically – so if they were cheap, and laid too little wire, when the number of phones exceeded the capacity of what they had already laid, they’d have to go back and lay more. So they wanted to be able to make a good prediction of the minimum amount of wire they could lay that would meet their needs both at the time it was laid, and in the future.
But there’s a big trick to that. First: how much information can a single wire carry? And when you bundle a bunch of wires together, how much can you pump over a single wire without it interfering with signals on it’s neighbors in the bundle?
That’s what Shannon was trying to understand. He was trying to find a mathematical model that he could use to describe information transmission, including the effects of imperfections in transmission, including the introduction of noise and interference. He wanted to be able to quantify information in a meaningful way that would allow him to ask questions like: “At what point does noise in the line start to eliminate the information I want to transmit?”.
The answer is just fascinating: a given communication channel has a capacity for carrying information; and a given message has a quantity of information in it. Adding noise to the signal *adds* information to message *until* the capacity of the channel is reached, at which point information from the noise will start to *replace* information from the message; at that point, you can say that information from the message will start to be lost.
Shannon called the information content of a signal it’s *entropy*, because he saw a similarity between his information entropy and thermodynamic entropy: in a communicating system, entropy *never* decreases: it increases until the capacity of the channel is reached, and then it stays content. You can’t reduce the amount of information in a message. (There are various rumours that in fact this choice of terminology was actually suggested by von Neumann himself.)
That naming led directly to the most common misuse of information theory. But that’s a post for another day.
### Algorithmic Information Theory
The other main branch of information theory came from a very different direction. The two pioneers were Andrey Kolmogorov, and Greg Chaitin. Kolmogorov and Chaitin didn’t know each other at the time they independently invented the same mathematical formalization (in fact, Chaitin was a teenager at the time!), a fact which led to some friction.
In the interests of honesty, I’ll say that Greg Chaitin works in the same building that I do, and I’ve met him a number of times, and been greatly impressed by him, so my description is heavily influenced by Greg.
Anyway – algorithmic information theory grew out of some of the fundamental mathematics being done at the beginning of the 20th century. There was an effort to create a single, fundamental, set of mathematical axioms and inference rules, which was both complete (every true statement was provably true), and consistent (every well-formed statement was either true or false). This effort failed, miserably; the nail in the coffin was Gödel’s incompleteness theory. Gödel presented an extremely complicated proof that showed, essentially, that no formal system could be both complete and consistent. (See my recent article about [Extreme Math][extreme] for more about this.)
[extreme]: http://scienceblogs.com/goodmath/2006/06/extreme_math_1_1_2.php “1+1=2”
Most people saw Gödel’s proof, did the equivalent of saying “Oh, shit!”, and then pretended that they hadn’t seen it. It does mean that there are fundamental limits to what you can do using mathematical formalization, but for the most part, they don’t affect you in normal day to day math. (Sort of the way that car designers didn’t change the way they build cars because of relativity. Yes, it showed that the fundamental physical model that they were using was wrong – but at the speeds that cars move, that wrongness is so small that it just doesn’t matter.)
But for people in the early stages of what would become computer science, this was a big deal. One of the early hopes was that the mathematical system would produce a “decision procedure”, a mechanical process by which a computing device could check the truth or falsehood of any mathematical statement. Gödel showed that it couldn’t be done.
But the early computer scientists – in particular, Alan Turing – embraced it. It led directly to two of the fundamental rules of computer science:
1. The Church-Turing Thesis: all mechanical computing systems are basically the same: there is a fundamental limit to what is computable, and any “acceptable” system can compute anything up to that limit – so if you can find a way of doing it on any computing device, then you can, ultimately, do it on every acceptable computing device.
2. The Halting Problem: there are some things that you cannot program a device to do. In particular, you cannot write a program for any computing device that examines another program, and tells you if that other program will eventually stop.
The halting problem turns out to say *exactly* the same thing as the Gödel incompleteness theorem. Only you can write a proof of it that a freshman college student can understand in ten minutes, instead of a proof that the average math grad student will need a semester to plow through.
Chaitin and Kolmogorov saw this as a big deal: using an algorithmic approach to how to process information, you could prove something very simply, which would require a much more complicated proof using other methods.
K-C information theory grew out of this. According to Kolmogorov and Chaitin, the fundamental amount of information contained in a string (called the string’s information entropy after Shannon), is the shortest string of program + data for a computing device that can generate that string.
(If you’re interested in Gödel’s incompleteness theorem, then Hofstadter’s
[Godel, Escher, Bach: an Eternal Golden Braid][geb] is a really fun introduction. For K-C theory, Greg Chaitin has written a bunch of really [great][chaitin1] [books][chaitin2] for mathematical laymen.
[geb]: http://www.amazon.com/gp/search/ref=br_ss_hs/002-3998406-1014458?search-alias=aps&keywords=godel%20escher%20bach “GEB amazon link”
[chaitin1]: http://www.amazon.com/gp/product/9814021725/sr=8-4/qid=1141849782/ref=pd_bbs_4/002-3998406-1014458?%5Fencoding=UTF8
[chaitin2]: http://www.amazon.com/gp/product/1852336684/sr=8-3/qid=1141849782/ref=pd_bbs_3/002-3998406-1014458?%5Fencoding=UTF8″
Information and Entropy
————————-
Now that I’ve got a bit of background and origins done, it’s time to get to some of the fun stuff.
As I said yesterday, both of the big branches of information theory have a concept of *entropy*. While the exact presentations differ, because they’re built on somewhat different theoretical foundations, the basic meaning of entropy in both branches is pretty much the same. I’m going to stick with the terminology of the Kolmogorov-Chaitin branch, but the basic idea is the same in either.
In K-C theory, we talk about the information content of *strings*. A string is an ordered sequence of characters from a fixed alphabet. It doesn’t really matter what alphabet you choose; the point of this kind of theory isn’t to come up with a specific number describing the complexity of a string, but to be able to talk about what information means in a formal algorithmic sense.
The K-C definition of the entropy of a string is the total quantity of information encoded in that string.
Here’s where it gets fun.
Another way of saying exactly the same thing is that the entropy of a string is a measure of the *randomness* of the string.
Structured strings are structured *precisely* because they contain *redundancy* – and redundancy does not add information.
Which string has more information: “XXXXYYYYY” or “4X5Y”? They have the same amount. One is a lot smaller than the other. But they contain the same amount of information.
Here’s a third way of saying exactly the same thing: the entropy of a string is the length of the smallest compressed string that can be decompressed into the original.
Here’s a better example. Go back and look at the first two paragraphs of yesterday’s post. It took 514 characters.
Here’s the same information, compressed (using gzip) and then made
readable using a unix utility called uuencode:
M’XL(“)E8$$0“VIU;FL`19$Q=JPP#$7[6≤7KIN’,`M+]≤HHLPH“)8BMOPY
M[#Y/GCDG#
MLY2EI8$9H5GLX=*R(_+ZP/,-5-1#HRJNT`77@LL,MZD,”H7LSUKDW3@$#V2
MH(KTO$Z^$%CN1Z>3L*J^?6ZW?^Y2+10;+SOO’OC”/;7T5QA%987SY02I3I$
MLKW”W,VZ-(J$E”[$;’2KYI^-_L./3BW.+WF3XE8)≤@D8X^U59DQ7IA*F+X/
MM?I!RJ*%FE%])Z+EXE+LSN*,P$YNX5/P,OCVG;IK=5_K&CK6J7%’+5M,R&J]
M95*W6O5EI%G^K)8B/XV#L=:5_`=5ELP#Y#UJ??>[[DI=J”*2D];K_F230″
$`@(““`
`
That’s only 465 characters; and if I didn’t have to encode it to stop
it from crashing your web-browser, it would only have been 319
characters. Those characters are a lot more random than the original –
so they have a higher information density. The closer we get
to the shortest possible compressed length, the higher the information
density gets.
Actually, I’m lying slightly; there’s an additional factor that needs to be considered in K-C theory.
Remember that K-C comes from the foundational mathematics of computer science, something called recursive function theory. We’re talking about strings being processed algorithmically by a program. So really, we get to talk about programs that generate strings. So it’s not just strings: you have a program and a data string. Really, you measure information content of a string as the total size of the smallest pairing of program and data that generates that string. But for most purposes, that doesn’t make that much difference: the combination of the program and the data can be encoded into a string.
In the two examples above, the program to follow is implied by the contents of the string; if we wanted to really model it precisely, we’d need to include the programs:
* For the string “XXXXYYYYY”, the program is roughly:
while there are more characters in the data:
read a character
output the character that you read
end
* For the string “4X5Y”, the program is roughly:
while there are more characters in the data:
read a number n from the data
read a character c from the data
output c n times
end
Now, specifics aside: the definitions I gave earlier are pretty much correct in both K-C and Shannon information theory: information content is proportional to randomness is proportional to minimum compressed length.
So – what happens if I take a string in very non-compressed format (like, say, an uncompressed digitized voice) and send it over a noisy phone line? Am I gaining information, or losing information?
The answer is: *gaining* information. Introducing randomness into the string is *adding* information.
“AAAABBBB” contains less information than “AAAABxBB”.
The way this is used to refute bozos who claim things like “You can’t create information” should be obvious.
Category Theories: Some definitions to build on
Sorry, but I actually jumped the gun a bit on Yoneda’s lemma.
As I’ve mentioned, one of the things that I don’t like about category theory is how definition-heavy it is. So I’ve been trying to minimize the number of definitions at any time, and show interesting results of using the techniques of category theory as soon as I can.
Well, there are some important definitions which I haven’t talked about yet. And for Yoneda’s lemma to make sense, you really need to see some more examples of how categories express structure. And to be able to show how category theory lets you talk about structure, we need to plough our way through a few more definitions.
Initial Objects
Suppose we have a category, C. An object o ∈ Obj(C) is called an initial object in C if/f (∀ b ∈ Obj(C)), (∃ 1 f : o → b ∈ Arrows(C)).
In english: an object o is initial in C if there’s exactly one arrow from o to each other object in C. We often write “0C” for the initial object of a category C, or just “0” if it’s obvious what category we’re talking about.
There’s a dual notion of a terminal object: an object is terminal if there’s a exactly one arrow from every object in the category to it. Terminals are written “1C” or just “1”.
Given two objects in a category, if they’re both initial, they must be isomorphic. It’s pretty easy to prove: here’s the sketch. Remember the definition of isomorphism in category theory. An isomorphism is an arrow f : a → b, where (∃ g : b → a) such that f &ormd; g = 1b and g º f = 1a. If an object is initial, then there’s an arrow from it to every other object. Including the other initial object. And there’s an arrow back, because the other one is initial. The iso-arrows between the two initials obviously compose to identities.
Categorical Products
The product of two morphisms is a generalization of the cartesian product of two sets. It’s important because products are one of the major ways of building complex structures using simple categories.
Given a category C, and two objects a,b ∈ Obj(C), the categorical product a × b consists of:
- An object p, often written a×b;
- two arrows pa and pb, where p ∈ Obj(C), pa : p → a, and pb : p → b.
- a “pairing” operation, which for every object c ∈ C, maps the pair of arrows f : c → a and g : c → b to an arrow Pairc(f,g) : c → a×b. Pairc(f,g) is often written <f,g>c>.
Pairc must have three properties.
- pa º Pairc(f,g) = f.
- pb º Pairc(f,g) = g.
- (∀ h : c → a×b) Pairc(pa º h, pb º h) = h.
The first two of those properties are the separation arrows, to get from the product to its components; and the third is the merging arrow, to get from the components to the product. We can say the same thing about the relationships in the product in an easier way using a commutative diagram:
Notices of the AMS special issue on Kurt Godel
Harald Hanche-Olsen, in the comments on my earlier post about the Principia Mathematica, has pointed out that this months issue of the Notices of the American Mathematical Society is a special issue in honor of the 100th anniversary of Kurt Gödels birth. The entire issue is available for free online
I haven’t read much of the journal yet; but Martin Davis’s article The Incompleteness Theorem is a really great overview of the theorem abnd the proof, how it works, and what it means.
Category Theory: Natural Transformations and Structure
The thing that I think is most interesting about category theory is that what it’s really fundamentally about is structure. The abstractions of category theory let you talk about structures in an elegant way; and category diagrams let you illustrate structures in a simple visual way. Morphisms express the structure of a category; functors are higher level morphisms that express the structure of relationships between categories.
In my last category theory post, one of the things I mentioned was how category theory lets you explain the idea of symmetry and group actions – which are a kind of structural immunity to transformation, and a definition of transformation – in a much simpler way than it could be talked about without categories.
It turns out that symmetry transformations are just the tip of the iceberg of the kinds of structural things we can talk about using categories. In fact, as I alluded to in my last post, if we create a category of categories, we end up with functors as arrows between categories.
What happens if we take the same kind of thing that we did to get group actions, and we pull out a level, so that instead of looking at the category of categories, focusing on arrows from the specific category of a group to the category of sets, we do it with arrows between members of the category of functors?
We get the general concept of a natural transformation. A natural transformation is a morphism from functor to functor, which preserves the full structure of morphism composition within the categories mapped by the functors.
Suppose we have two categories, C and D. And suppose we also have two functors, F, G : C → D. A natural transformation from F to G, which we’ll call η maps every object x in C to an arrow ηx : F(x) → G(x). ηx has the property that for every arrow a : x → y in C, ηy º F(a) = G(a) º ηx. If this is true, we call ηx the component of η for (or at) x.
That paragraph is a bit of a whopper to interpret. Fortunately, we can draw a diagram to help illustrate what that means. The following diagram commutes if η has the property described in that paragraph.
I think this is one of the places where the diagrams really help. We’re talking about a relatively straightforward property here, but it’s very confusing to write about in equational form. But given the commutative diagram, you can see that it’s not so hard: the path ηy º F(a) and the path G(a) º η<sub compose to the same thing: that is, the transformation η hasn’t changed the structure expressed by the morphisms.
And that’s precisely the point of the natural transformation: it’s a way of showing the relationships between different descriptions of structures – just the next step up the ladder. The basic morphisms of a category express the structure of the category; functors express the structure of relationships between categories; and natural transformations express the structure of relationships between relationships.
Coming soon: a few examples of natural transformation, and then… Yoneda’s lemma. Yoneda’s lemma takes the idea we mentioned before of a group being representable by a category with one object, and generalizes all the way up from the level of a single category to the level of natural transformations.