Category Archives: Good Math

Combining Non-Disjoint Probabilities

In my previous post on probability, I talked about how you need to be careful about covering cases. To understand what I mean by that, it’s good to see some examples.

And we can do that while also introducing an important concept which I haven’t discussed yet. I’ve frequently talked about independence, but equally important is the idea of disjointness.

Two events are independent when they have no ability to influence one another. So two coin flips are independent. Two events are disjoint when they can’t possibly occur together. Flipping a coin, the event “rolled a head” and the event “rolled a tail” are disjoint: if you rolled a head, you can’t roll a tail, and vice versa.

So let’s think about something abstract for a moment. Let’s suppose that we’ve got two events, A and B. We know that the probability of A is 1/3 and the probability of B is also 1/3. What’s the probability of A or B?

Naively, we could say that it’s P(A) + P(B). But that’s not necessarily true. It depends on whether or not the two events are disjoint.

Suppose that it turns out that the probability space we’re working in is rolling a six sided die. There are three basic scenarios that we could have:

  1. Scenario 1: A is the event “rolled 1 or 2”, and B is “rolled 3 or 4”. That is, A and B are disjoint.
  2. Scenario 2: A is the event “rolled 1 or 2”, and B is “rolled 2 or 3”. A and B are different, but they overlap.
  3. Scenario 3: A is the event “rolled 1 or 2”, and B is the event “rolled 1 or 2”. A and B are really just different names for the same event.

In scenario one, we’ve got disjoint events. So P(A or B) is P(A) + P(B). One way of checking that that makes sense is to look at how the probability of events work out. P(A) is 1/3. P(B) is 1/3. The probability of neither A nor B – that is, the probability of rolling either 5 or 6 – is 1/3. The sum is 1, as it should be.

But suppose that we looked at scenario 2. If we made a mistake and added them as if they were disjoint, how would things add up? P(A) is 1/3. P(B) is 1/3. P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total of these three probabilities is 1/3 + 1/3 + 1/2 = 7/6. So just from that addition, we can see that there’s a problem, and we did something wrong.

If we know that A and B overlap, then we need to do something a bit more complicated to combine probabilities. The general equation is:

 P(A cup B) = P(A) + P(B) - P(A cap B)

Using that equation, we’d get the right result. P(A) = 1/3; P(B) =
1/3; P(A and B) = 1/6. So the probability of A or B is 1/3 + 1/3 – 1/6 = 1/2. And P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total is 1, as it should be.

From here, we’ll finally start moving in to some more interesting stuff. Next post, I’ll look at how to use our probability axioms to analyze the probability of winning a game of craps. That will take us through a bunch of applications of the basic rules, as well as an interesting example of working through a limit case.

And then it’s on to combinatorics, which is the main tool that we’ll use for figuring out how many cases there are, and what they are, which as we’ve seen is an essential skill for probability.

Correction, Sigma Algebras, and Mass functions

So, I messed up a bit in the previous post. Let me get that out of the way before we move forward!

In measure theory, you aren’t just working with sets. You’re working with something called σ-algebras. It’s a very important distinction.

The problem is, our intuition of sets doesn’t always work. Sets, as defined formally, are really pretty subtle. We expect certain things to be true, because they make sense. But in fact, they are not implied by the definition of sets. A σ-algebra is, essentially, a well-behaved set – a set whose behavior matches our usual expectations.

To be formal, a sigma algebra over a set S is a collection Σ of subsets of S such that:

  1. Σ is closed over set complement.
  2. Σ is closed over countable union.

The reason why you need to make this restriction is, ultimately, because of the axiom of choice. Using the axiom of choice, you can create sets which are unmeasurable. They’re clearly subsets of a measurable set, and supersets of other measurable sets – and yet, they are, themselves, not measurable. This leads to things like the Banach-Tarski paradox: you can take a measurable set, divide it into non-measurable subsets, and then combine those non-measurable subsets back into measurable sets whose size seem to make no sense. You can take a sphere the size of a baseball, slice it into pieces, and then re-assemble those pieces into a sphere the size of the earth, without stretching them!

These non-measurable sets blow away our expectations about how things should behave. The restriction to σ algebras is just a way of saying that we need to be working in a space where all sets are measurable. When we’re looking at measure theory (or probability theory, where we’re building on measures), we need to exclude non-measurable sets. If we don’t, we’re seriously up a creek without a paddle. If we allowed non-measurable sets, then the probability theory we’re building would be inconsistent, and that’s the kiss of death in mathematics.

Ok. So, with that out of the way, how do we actually use Kolmogorov’s axioms? It all comes down to the idea of a sample space. You need to start with an experiment that you’re going to observe. For that experiment, there are a set of possible outcomes. The set of all possible outcomes is the sample space.

Here’s where, sadly, even axiomatized probability theory gets a bit handwavy. Given the sample space, you can define the structure of the sample space with a function, called the probability mass function, f, which maps each possible event in the sample space to a probability. To be a valid mass function for a sample space S, it’s got to have the following properties:

  1. For each event e in S, f(e) ≥ 0 and f(e) <= 1..
  2. The sum of the probabilities in the sample space must be 1: Sigma_{e in S} f(e) = 1

So we wind up with a sort of circularity: in order to describe the probability of events, we need to start by knowing the probability of events. In fact, this isn’t really a problem: we’re talking about taking something than we observe in the real world, and mapping it into the abstract space of math. Whenever we do that, we need to take our observations of the real world and create an approximation as a mathematical model.

The point of probability theory isn’t to do that primitive mapping. In general, we already understand how rolling a single die works. We know how it should behave, and we know how and why its actual behavior can vary from our expectation. What we want to know is really how many events combine.

We don’t need any special theory to figure out what the probability of rolling a 3 on a six-sided die is: that’s easy, and it’s obvious: it’s 1 in 6. But what’s the probability of winning a game of craps?

If all days of the year 2001 are equally likely, then we don’t need anything fancy to ask what the probability of someone born in 2001’s birthday being July 21st. It’s easy: 1 in 365. But if I’ve got a group of 35 people, what’s the probability of two of them sharing the same birthday?

Both of those questions start with the assignment of a probability mass function, which is trivial. But they involve combining the probabilities given by those mass functions, and use them with Kolmogorov’s axioms to figure out the probabilities of the complicated events.

Kolmogorov's Axioms of Probability

The way that I’ve talked about probability so far is mostly informal. That’s the way that probability theory was treated for a long time. You defined probability spaces over collections of equal probability sets. You combined probability spaces by combining their events into other kinds of equally probable events.

The problem with that should be obvious: it’s circular. You want to define the probability of events; to do that, you need to start with equally probable events, which means that on some level, you already know the probabilities. If you don’t know the probabilities, you can’t talk about them. The reality is somewhat worse than that, because this way of looking at things completely falls apart when you start trying to think about infinite probability spaces!

So what can you do?

The answer is to reformulate probability. Mathematicians knew about this kind of problem for a very long time, but what they mostly just ignored it: probability wasn’t considered a terribly interesting field.

Then, along came Kolmogorov – the same brilliant guy who’s theory of computational complexity is so fascinating to me! Kolmogorov created a new formulation of probability theory. Instead of starting with a space of equally probable discrete events, you start with a measure space.

Before we can look at how Kolmogorov reformulated probability (the Kolmogorov axioms), we need to look at just what a measure space is.

A measure space is just a set with a measure function. So let X be a set. A measure μ on X is a function from a subset of X to a real number: mu: 2^X rightarrow R with the following properties:

  • Measures are non-negative: forall x subseteq X: mu(x) ge 0
  • The measure of the empty set is always 0: mu(emptyset) = 0
  • The measure of a finite sequence of unions is the sum of the individual measuresmu(x + y) = mu(x) + mu(y)

So the idea is pretty simple: a measure space is just a way of defining the size of a subset in a consistent way.

To work with probability, you need a measure space where the measure of the entire set is 1. With that idea in mind, we can put together a proper, formal definition of a probability space that will really allow us to work with, and to combine probabilities in a rigorous way.

Like our original version, a probability space has a set of events, called its event space. We’ll use F to represent the set of all possible events, and e to represent an event in that set.

There are three fundamental axioms of probability, which are going to look really similar to the three axioms of a measure space:

  1. Basic measure: the probability of any event is a positive real number: (Omega is called the unit event, and is the union of all possible events.) Alternatively, the probability of no event occurring is 0: P(emptyset)=0.
  2. Combination: For any two distinct events or sets of events e and f, the probability of e or f is P(e) + P(f): forall e, f subseteq P: e cap f = emptyset Rightarrow P(e cup  f) = P(e) + P(f). This can be extended to any countable sequence of unions.

This is very similar to the informal version we used earlier. But as we’ll see later, this simple formulation from measure theory will give us a lot of additional power.

It’s worth taking a moment to point out two implications of these axioms. (In fact, I’ve seen some presentations that treat some of these as additional axioms, but they’re provable from the first three.

  • Monotonicity: if e subeq f, then P(e) le P(f).
  • Upper Bound: for any event or set of events e, P(e) ge 0 land P(e) le 1.

The brilliance of Kolmogorov was realizing that these rules were everything you need to work out any probability you want – in both finite and infinite spaces. We’ll see that there’s a lot of complexity in the combinatorics of probability, but it will all always ultimately come back to these three rules.

Independence and Combining probabilities.

As I alluded to in my previous post, simple probability is really a matter of observation, to produce a model. If you roll a six sided die over and over again, you’ll see that each face comes up pretty much equally often, and so you model it as 1/6 probability for each face. There’s not really a huge amount of principle behind the basic models: it’s really just whatever works. This is the root of the distinction between interpretations: a frequentist starts with an experiment, and builds a descriptive model based on it, and says that the underlying phenomena being tested has the model as g, a property; a Bayesian does almost the same thing, but says that the model describes the state of their knowledge.

Where probability starts to become interesting in when you combine things. I know the probability of outcomes for rolling one die: how can I use that to see what happens when I roll five dice together? I know the probability of drawing a specific card from a deck: what are the odds of being dealt a particular poker hand?

We’ll start with the easiest part: combining independent probabilities. The probability of two events are independent when there’s no way for the outcome of one to influence the outcome of the other. For example, if you’re flipping a coin several times, the result of one coin flip has no effect on the result of a subsequent flip. On the other hand, dealing 10 cards from a deck is a sequence of dependent events: once you’ve dealt one card, the next deal must come from the remaining cards: you can’t deal the same card twice.

If you know the probability space of your trials, then recognizing an independent situation is easy: if the outcome of one trial doesn’t alter the probability space of other trials, then they’re independent.

Look back at the coin flip example: we know what the probability space of a coin flip looks like: it’s got two, equally probable outcomes. If you’ve flipped a coin once, and you’re going to flip another coin, the result of the first flip can’t do anything that alters the probability space of a subsequent flip.

But if you think about dealing cards, that’s not true. With a standard deck of cards, the initial probability space has 52 outcomes, each of which is equally likely. So the odds of being dealt the 5 of spades is exactly 1/52.

Now, suppose that you got lucky, and you did get dealt the 5 of spades on the first card. What’s the probability of being dealt the 5 of spades’s on the second? If they were independent events, it would still be 1/52. But once you’ve dealt one card, you can’t deal it again. The probability of being dealt the 5 of spades as the second card is 0: it’s impossible. The probability space only has 51 possible outcomes, and the 5 of spades is not one of them. The space has changed. That’s the definition of a dependent event.

When you’re faced with dependent probabilities, you need to figure out how the probability space will be changed, and incorporate that into your computation. Once you’ve incorporated the change in the probability space of the second test, then you’ve got a new independent probability, and you can combine them. Figuring out how to alter the probability space can be extremely difficult, but that’s what makes it interesting.

When you’re dealing with independent events, it’s easy to combine them. There are two basic ways of combining event probabilities,
and they should be familiar from logic: event1 AND event2, and event1 OR event2.

Suppose you’re looking at two test with independent outcomes. I know that the probability of event e is P(e), and the probability of event f is P(f) Then the outcome of e & f – that is, of having e as the outcome of the first trial, and f as the outcome of the second, is P(e)×P(f). The odds of rolling HTTH on a coin is (1/2)*(1/2)*(1/2)*(1/2)=(1/16).

If you’re looking at independent alternatives – that is, the probability of e OR F, you combine the probabilities of the event with addition: P(e) + P(f). So, the odds of drawing any heart from a deck: for each card, it’s 1/52. There are thirteen different hearts. So the odds of drawing a red are 1/52 + 1/52 + … = 13/52 = 1/4.

That still doesn’t get us to the really interesting stuff. We still can’t quite work out something like the odds of being dealt a flush. To get there, we need to learn some combinatorics, which will allow us to formulate the probability spaces that we need for an interesting probability.

Probability Spaces

Sorry for the slowness of the blog lately. I finally got myself back onto a semi-regular schedule when I posted about the Adria Richards affair, and that really blew up. The amount of vicious, hateful bile that showed up, both in comments (which I moderated) and in my email was truly astonishing. I’ve written things which pissed people off before, and I’ve gotten at least my fair share of hatemail. But nothing I’ve written before came close to preparing me for the kind of unbounded hatred that came in response to that post.

I really needed some time away from the blog after that.

Anyway, I’m back, and it’s time to get on with some discrete probability theory!

I’ve already written a bit about interpretations of probability. But I haven’t said anything about what probability means formally. When I say that the probability of rolling a 3 with a pair of fair six-sided dice is 1/18, how do I know that? Where did that 1/6th figure come from?

The answer lies in something called a probability space. I’m going to explain the probability space in frequentist terms, because I think that that’s easiest, but there is (of course) an equivalent Bayesian description.) Suppose I’m looking at a particular experiment. In classic mathematical form, a probability space consists of three components (Ω, E, P), where:

  1. Ω, called the sample space, is a set containing all possible outcomes of the experiment. For a pair of dice, Ω would be the set of all possible rolls: {(1,1), (1,2), (1,3), (1,4), (1,5), (1, 6), (2,1), …, (6, 5), (6,6)}.
  2. E is an equivalence relation over Ω, which partitions Ω into a set of events. Each event is a set of outcomes that are equivalent. For rolling a pair of dice, an event is a total – each event is the set of outcomes that have the same total. For the event “3” (meaning a roll that totalled three), the set would be {(1, 2), (2, 1)}.
  3. P is a probability assignment. For each event e in E, P(e) is a value between 0 and 1, where:

     Sigma_{ein E} P(e) = 1

    (That is, the sum of the probabilities of all of the possible events in the space is exactly 1.)

The probability of an event e being the outcome of a trial is P(e).

So the probability of any particular event as the result of a trial is a number between 0 and 1. What’s it mean? If the probability of event e is p, then if we repeat the trial N times, we expect N*p of those trials to have e as their result. If the probability of e is 1/4, and we repeat the trial 100 times, we’d expect e to be the result 25 times.

But in an important sense, that’s a cop-out. We’ve defined probability in terms of this abstract model, where the third component is the probability. Isn’t that circular?

Not really. For a given trial, we create the probability assignment by observation and/or analysis. The important point is that this is really just a bare minimum starting point. What we really care about in probability isn’t the change associated with a single, simple, atomic event. What we want to do is take the probability associated with a group of single events, and use our understanding of that to allow us to explore a complex event.

If I give you a well-shuffled deck of cards, it’s easy to show that the odds of drawing the 3 of diamonds is 1/52. What we want to do with probability is things like ask: What are the odds of being dealt a flush in a poker hand?

The construction of a probability space gives us a well-defined platform to use for building probabilistic models of more interesting things. Give a probability space of two single dice, we can combine them together to create the probability space of the two dice rolled together. Given the probability space of a pair of dice, we can construct the probability space of a game of craps. And so on.

Probability and Interpretations

I’m going to do some writing about discrete probability theory. Probability is an extremely important area of math. We encounter aspects of it every day. It’s also a very poorly understood area – it’s one that we see abused or just fouled up every day.

I’m going to focus on discrete probability theory. What that means is that we’re going to look at things where the space containing the things that we’re going to look at contains a countable number of elements. The probability of getting a certain sequence of coin flips, or of getting a certain hand of cards are described by discrete probability theory. On the other hand, the odds of a radioactive isotope decaying at a particular time requires continuous probability theory.

Before getting into the details, there’s one important thing to mention. When you’re talking about probability, there are two fundamental schools of interpretetation. There are frequentist interpretations, and there are Bayesian interpretations.

In a frequentist interpretation, when you say the probability of an event is 0.6, what you mean is that if you were to perform a series of experiments precisely reproducing the event, then on average, if you did 100 experiments, the event would occur 60 times. In the frequentist interpretation, the probability is an intrinsic property of the event. For a frequentist, it makes sense to say that there is a “real” probability associated with an event.

In a Bayesian interpretation, when you say that the probability of an event is 0.6, what you mean is that based on your current state of knowledge about the event, you have a 60% certainty that the event will occur. In a strict Bayesian interpretation, the event doesn’t have any kind of intrinsic probability associated with it. The specific event that you’re interested in either will occur, or it won’t. There’s no real probability involved. What probability measures is how certain you are about whether or not it will occur.

For example, think about flipping a fair coin.

A frequentist would say that you can flip a coin many times, and half of the time, it will land on heads. So the probability of a coin flip landing on the head of the coin is 0.5. A Bayesian would say that the coin will land either on heads or on tails. Since you don’t know which, and you have no other information to use to be able to make a better prediction, you can have a certainty of 0.5 that it will land on the head of the coin.

In the real world, I think that most people are really somewhere in between.

I think that all but the most fervent Bayesians do rely on an intuitive notion of the “intrinsic” probability of an event. They may describe it in different terms, but when it comes down to it, they’re using the basic frequentist notion. And I don’t think that you can find a sane frequentist anywhere who won’t use Bayes theorem to update their priors in the face of new information – which is the most fundamental notion in the Bayesian interpretation.

One note before I finish this, and get started on the real meaty posts. In the past, when I’ve talked about probability, people have started stupid flamewars in the comments. People get downright religious about interpretations of probability. There are religious Bayesians, who think that all frequentists are stupid idiots who should be banished from the field of math; likewise, there are religious frequentists who think that Bayesians are all a crop of arrogant know-it-alls who should be sent to Siberia. I am not going to tolerate any of that nonsense. If you feel that you cannot read posts on probability without going into a diatribe about those stupid frequentists/Bayesians and their deliberately stupid ideas, please go away and don’t even read these posts. If you do go into such a diatribe, I will delete your comments without any hesitation.

What the heck is a DNS amplification DoS attack?

A couple of weeks ago, there was a bunch of news about a major DOS attack on Spamhaus. Spamhaus is an online service that maintains a blacklist of mail servers that are known for propagating spam. I’ve been getting questions about what a DoS attack is, and more specifically what a “DNS amplification attack” (the specific attack at the heart of last week’s news) is. This all became a bit more relevant to me last week, because some asshole who was offended by my post about the Adria Richards affair launched a smallish DoS attack against scientopia. (This is why we were interrmitently very slow last week, between tuesday and thursday. Also, to be clear, the DNS amplification attack was used on Spamhaus. Scientopia was hit by a good old fashioned DDoS attack.)

So what is a DoS attack? And what specifically is a DNS amplification attack?

Suppose that you’re a nastly person who wants to take down a website like scientopia. How could you do it? You could hack in to the server, and delete everything. That would kill it pretty effectively, right?

It certainly would. But from the viewpoint of an attacker, that’s not a particularly easy thing to do. You’d need to get access to a privileged account on our server. Even if we’re completely up to date on patches and security fixes, it’s probably possible to do that, but it’s still probably going to be a lot of work. Even for a dinky site like scientopia, getting that kind of access probably isn’t trivial. For a big security-focused site like spamhaus, that’s likely to be close to impossible: there are layers of security that you’d need to get through, and there are people constantly watching for attacks. Even if you got through, if the site has reliable backups, it won’t be down for long, and once they get back up, they’ll patch whatever hole you used to get in, so you’d be back to square one. It’s a lot of work, and there are much easier ways to take down a site.

What you, as an attacker, want is a way to take the site down without having any kind of access to the system. You want a way that keeps the site down for as long as you want it down. And you want a way that doesn’t leave easily traced connections to you.

That’s where the DoS attack comes in. DoS stands for “denial of service”. The idea of a DoS attack is to take a site down without really taking it down. You don’t actually kill the server; you just make it impossible for legitimate users to access it. If the sites users can’t access the site even though the server is technically still up and running, you’ve still effectively killed it.

How do you do that? You overwhelm the server. You target some finite resource on the server, and force it to use up that resource just dealing with requests or traffic that you sent to the server, leaving it with nothing for its legitimate users.

In terms of the internet, the two resources that people typically target are CPU and network bandwidth.

Every time that you send a request to a webserver, the server has to do some computation to process that request. The server has a finite amount of computational capability. If you can hit it with enough requests that it spends all of its time processing your requests, then the site becomes unusable, and it effectively goes down. This is the simplest kind of DoS attack. It’s generally done in a form called a DDoS – distributed denial of server attack, where the attacker users thousands or millions of virus-infected computers to send requests. The server gets hit by a vast storm of requests, and it can’t distinguish the legitimate requests from the ones generated by the attacker. This is the kind of attack that hit Scientopia last week. We were getting streams of a couple of thousands malformed requests per second.

This kind of attack can be very effective. It’s hard – not impossible, but hard – to fight. You need to identify the common traits of the attackers, and set up some kind of filter to discard those requests. From the attacker’s point of view, it’s got one problem: price. Most people don’t have a personal collection of virus-infected machines that they can use to mount an attack. What they actually do is rent machines! Virus authors run services where they’ll use the machines that they’ve to run an attack for you, for a fee. They typically charge per machine-hour. So to keep a good attack going for a long time is expensive! Another problem with this kind of attack is that the amount of traffic that you can inflict on the server per attacker is also used by the client. The client needs to establish a connection to the server. That consumes CPU, network connections, and bandwidth on the client.

The other main DoS vector is network bandwidth. Every server running a website is connected to the network by a connection with a fixed capacity, called it’s bandwidth. A network connection can only carry a certain quantity of information. People love to make fun of the congressman who said that the internet is like a series of tubes, but that’s not really a bad analogy. Any given connection is a lot like a pipe. You can only cram so much information through that pipe in a given period of time. If you can send enough traffic to completely fill that pipe, then the computer on the other end is, effectively, off the network. It can’t receive any requests.

For a big site like spamhaus, it’s very hard to get enough machines attacking to effectively kill the site. The amount of bandwidth, and the number of different network paths connecting spamhaus to the internet is huge! The number of infected machines available for an attack is limited, and the cost of using all of them is prohibitive.

What an attacker would like for killing something like Spamhaus is an attack where the amount of work/cpu/traffic used to generate the attack is much smaller than the amount of work/cpu/traffic used by the server to combat the attack. That’s where amplification comes in. You want to find some way of using a small amount of work/traffic on your attacker machines to cause your target to lost a large amount of work/traffic.

In this recent attack on Spamhaus, they used an amplification attack, that was based on a basic internet infrastructure service called the Domain Name Service (DNS). DNS is the service which is used to convert between the name of a server (like scientopia.org), and its numeric internet address (184.106.221.182). DNS has some technical properties that make it idea for this kind of attack:

  1. It’s not a connection-based service. In most internet services, you establish a connection to a server, and send a request on that connection. The server responds on the same connection. In a connection-based service, that means two things. First, you need to use just as much bandwidth as the target, because if you drop the connection, the server sees the disconnect and stops processing your request. Second, the server knows who it’s connected to, and it always sends the results of a request to the client that requested it. But DNS doesn’t work that way. In DNS, you send a request without a connection, and in the request, you provide an address that the response should be sent to. So you can fake a DNS request, by putting someone else’s address as the “respond-to” address in the request.
  2. It’s possible to set up DNS to create very large responses to very small requests. There are lots of ways to do this. The important thing is that it’s really easy to use DNS in a way that allows you to amplify the amount of data being sent to a server by a factor of 100. In one common form of DNS amplification, you send 60 byte requests, which generate responses larger than 6,000 bytes.

Put these two properties together, and you get a great attack vector: you can send tiny, cheap requests to a server, which don’t cause any incoming traffic on your attacker machine, and which send large quantities of data to your target. Doing this is called a DNS amplification attack: it’s an amplification attack which uses properties of DNS to generate large quantities of data send to your server, using small quantities of data sent by your attackers.

That’s exactly what happened to Spamhaus last week. The attackers used a very common DNS extension, which allowed them to amplify 60 byte requests into 4,000 byte responses, and to send the responses to the spamhaus servers.

There are, of course, more details. (For example, when direct attacks didn’t work, they tried an indirect attack that didn’t target the spamhaus servers, but instead tried to attack other servers that spamhaus relied on.) But this is the gist.

Finally: Gödel’s Proof of Incompleteness!

Finally, we’re at the end of our walkthrough of Gödel great incompleteness proof. As a refresher, the basic proof sketch is:

  1. Take a simple logic. We’ve been using a variant of the Principia Mathematica’s logic, because that’s what Gödel used.
  2. Show that any statement in the logic can be encoded as a number using an arithmetic process based on the syntax of the logic. The process of encoding statements numerically is called Gödel numbering.
  3. Show that you can express meta-mathematical properties of logical statements in terms of arithemetic properties of their Gödel numbers. In particular, we need to build up the logical infrastructure that we need to talk about whether or not a statement is provable.
  4. Using meta-mathematical properties, show how you can create an unprovable statement encoded as a Gödel number.

What came before:

  1. Gödel numbering: The logic of the Principia, and how to encode it as numbers. This was step 1 in the sketch.
  2. Arithmetic Properties: what it means to say that a property can be expressed arithemetically. This set the groundwork for step 2 in the proof sketch.
  3. Encoding meta-math arithmetically: how to take meta-mathematical properties of logical statements, and define them as arithmetic properties of the Gödel numberings of the statements. This was step 2 proper.

So now we can move on to step three, where we actually see why mathematical logic is necessarily incomplete.

When we left off with Gödel, we’d gone through a very laborious process showing how we could express meta-mathematical properties of logical statements as primitive recursive functions and relations. We built up to being able to express one non-primitive recursive property, which describes the property that a given statement is provable:

pred provable(x) =
  some y {
    proofFor(y, x)
  }
}

The reason for going through all of that was that we really needed to show how we could capture all of the necessary properties of logical statements in terms of arithmetic properties of their Gödel numbers.

Now we can get to the target of Gödel’s effort. What Gödel was trying to do was show how to defeat the careful stratification of the Principia’s logic. In the principia, Russell and Whitehead had tried to avoid problems with self-reference by creating a very strict stratification, where each variable or predicate had a numeric level, and could only reason about objects from lower levels. So if natural numbers were the primitive objects in the domain being reasoned about, then level-1 objects would be things like specific natural numbers, and level-1 predicates could reason about specific natural numbers, but not about sets of natural numbers or predicates over the natural numbers. Level-2 objects would be sets of natural numbers, and level-2 predicates could reason about natural numbers and sets of natural numbers, but not about predicates over sets of natural numbers, or sets of sets of natural numbers. Level-3 objects would be sets of sets of natural numbers… and so on.

The point of this stratification was to make self-reference impossible. You couldn’t make a statement of the form “This predicate is true”: the predicate would be a level-N predicate, and only a level N+1 predicate could reason about a level-N predicate.

What Gödel did in the laborious process we went through in the last post is embed a model of logical statements in the natural numbers. That’s the real trick: the logic is designed to work with a set of objects that are a model of the natural numbers. By embedding a model of logical statements in the natural numbers, he made it possible for a level-1 predicate (a predicate about a specific natural number) to reason about any logical statement or object. A level-1 predicate can now reason about a level-7 object! A level-1 predicate can reason about the set defined by a level-1 predicate: a level-1 predicate can reason about itself!.

Now, we can finally start getting to the point of all of this: incompleteness! We’re going to use our newfound ability to nest logical statements into numbers to construct an unprovable true statement.

In the last post, one of the meta-mathematical properties that we defined for the Gödel-numbered logic was immConseq, which defines when some statement x is an immediate consequence of a set of statements S. As a reminder, that means that x can be inferred from statements in S in one inferrence step.

We can use that property to define what it means to be a consequence of a set of statements: it’s the closure of immediate consequence. We can define it in pseudo-code as:

def conseq(κ) = {
  K = κ + axioms
  added_to_k = false
  do {
    added_to_k = false
    for all c in immConseq(K) {
      if c not in K {
        add c to K
        added_to_k = true
      }
    }
  } while added_to_k
  return K
}

In other words, Conseq(κ) is the complete set of everything that can possibly be inferred from the statements in κ and the axioms of the system. We can say that there’s a proof for a statement x in κ if and only if x ∈ Conseq(κ).

We can the idea of Conseq use that to define a strong version of what it means for a logical system with a set of facts to be consistent. A system is ω-consistent if and only if there is not a statement a such that: a ∈ Conseq(κ) ∧ not(forall(v, a)) ∈ Conseq(κ).

In other words, the system is ω-consistent as long as it’s never true that both a universal statement and it. But for our purposes, we can treat it as being pretty much the same thing. (Yes, that’s a bit hand-wavy, but I’m not trying to write an entire book about Gödel here!)

(Gödel’s version of the definition of ω-consistency is harder to read than this, because he’s very explicit about the fact that Conseq is a property of the numbers. I’m willing to fuzz that, because we’ve shown that the statements and the numbers are interchangable.)

Using the definition of ω-consistency, we can finally get to the actual statement of the incompleteness theorem!

Gödel’s First Incompleteness Theorem: For every ω-consistent primitive recursive set κ of formulae, there is a primitive-recursive predicate r(x) such that neither forall(v, r) nor not(forall(v, r)) is provable.

To prove that, we’ll construct the predicate r.

First, we need to define a version of our earlier isProofFigure that’s specific to the set of statements κ:

pred isProofFigureWithKappa(x, kappa) = {
  all n in 1 to length(x) {
    isAxiom(item(n, x)) or
    item(n, x) in kappa or
    some p in 0 to n {
      some q in 0 to n {
        immedConseq(item(n, x), item(p, x), item(q, x))
      }
    }
  } and length(x) > 0
}

This is the same as the earlier definition – just specialized so that it ensures that every statement in the proof figure is either an axiom, or a member of κ.

We can do the same thing to specialize the predicate proofFor and provable:

pred proofForStatementWithKappa(x, y, kappa) = {
  isProofFigureWithKappa(x, kappa) and
  item(length(x), x) = y
}

pred provableWithKappa(x, kappa) = {
  some y {
    proofForStatementWithKappa(y, x, kappa)
  }
}

If κ is the set of basic truths that we can work with, then provable in κ is equivalent to provable.

Now, we can define a predicate UnprovableInKappa:

pred NotAProofWithKappa(x, y, kappa) = {
  not (proofForKappa(x, subst(y, 19, number(y))))
}

Based on everything that we’ve done so far, NotAProofWithKappa is primitive recursive.

This is tricky, but it’s really important. We’re getting very close to the goal, and it’s subtle, so let’s take the time to understand this.

  • Remember that in a Gödel numbering, each prime number is a variable. So 19 here is just the name of a free variable in y.
  • Using the Principia’s logic, the fact that variable 19 is free means that the statement is parametric in variable 19. For the moment, it’s an incomplete statement, because it’s got an unbound parameter.
  • What we’re doing in NotAProofWithKappa is substituting the numeric coding of y for the value of y‘s parameter. When that’s done, y is no longer incomplete: it’s unbound variable has been replaced by a binding.
  • With that substitution, NotAProofWithKappa(x, y, kappa) is true when x does not prove that y(y) is true.

What NotAProofWithKappa does is give us a way to check whether a specific sequence of statements x is not a proof of y.

We want to expand NotAProofWithKappa to something universal. Instead of just saying that a specific sequence of statements x isn’t a proof for y, we want to be able to say that no possible sequence of statements is a proof for y. That’s easy to do in logic: you just wrap the statement in a “∀ x ( )”. In Gödel numbering, we defined a function that does exactly that. So the universal form of provability is: ∀ a (NotAProofWithKappa(a, y, kappa)).

In terms of the Gödel numbering, if we assume that the Gödel number for the variable a is 17, and the variable y is numbered as 19, we’re talking about the statement p = forall(17, ProvableInKappa(17, 19, kappa).

p is the statement that for some logical statement (the value of variable 19, or y in our definition), there is no possible value for variable 17 (a) where a proves y in κ.

All we need to do now is show that we can make p become self-referential. No problem: we can just put number(p) in as the value of y in UnprovableInKappa. If we let q be the numeric value of the statement UnprovableInKappa(a, y), then:

r = subst(q, 19, p)

i = subst(p, 19, r)

i says that there is no possible value x that proves p(p). In other words, p(p) is unprovable: there exists no possible proof that there is no possible proof of p!

This is what we’ve been trying to get at all this time: self-reference! We’ve got a predicate y which is able to express a property of itself. Worse, it’s able to express a negative property of itself!

Now we’re faced with two possible choices. Either i is provable – in which case, κ is inconsistent! Or else i is unprovable – in which case κ is incomplete, because we’ve identified a true statement that can’t be proven!

That’s it: we’ve shown that in the principia’s logic, using nothing but arithmetic, we can create a true statement that cannot be proven. If, somehow, it were to be proven, the entire logic would be inconsistent. So the principia’s logic is incomplete: there are true statements that cannot be proven true.

We can go a bit further: the process that we used to produce this result about the Principia’s logic is actually applicable to other logics. There’s no magic here: if your logic is powerful enough to do Peano arithmetic, you can use the same trick that we demonstrated here, and show that the logic must be either incomplete or inconsistent. (Gödel proved this formally, but we’ll just handwave it.)

Looking at this with modern eyes, it doesn’t seem quite as profound as it did back in Gödel’s day.

When we look at it through the lens of today, what we see is that in the Principia’s logic, proof is a mechanical process: a computation. If every true statement was provable, then you could take any statement S, and write a program to search for a proof of either S or ¬ S, and eventually, that program would find one or the other, and stop.

In short, you’d be able to solve the halting problem. The proof of the halting problem is really an amazingly profound thing: on a very deep level, it’s the same thing as incompleteness, only it’s easier to understand.

But at the time that Gödel was working, Turing hadn’t written his paper about the halting problem. Incompletess was published in 1931; Turing’s halting paper was published in 1936. This was a totally unprecedented idea when it was published. Gödel produced one of the most profound and surprising results in the entire history of mathematics, showing that the efforts of the best mathematicians in the world to produce the perfection of mathematics were completely futile.

Passwords, Hashing, and Salt

Over on twitter, some folks were chatting about the latest big security botch. A major service, called Evernote, had a security breach where a password file was stolen. Evernote has handled the situation quite well, being open about what happened, and explaining the risks.

In their description of the breach, they said that the stolen passwords were “both hashed and salted”. Apparently this sounds funny to people outside of software. (Amazing how jargon becomes so ingrained that I didn’t even notice the fact that it could be interpreted in a funny way until it was pointed out to me!)

Anyway, since discussion of this is going around, I thought I’d explain just what password hashing and salting means.

Let’s start at the beginning. You’re some kind of system that wants to have password security. Obviously, you need to save the passwords somewhere, right?

As we’ll see, that’s only partially correct – but let’s go with it for now. You need to store something that lets you check if a user supplied the right password, so you need to store something.

The most naive approach is create a file (or database, or whatever) that contains the usernames and passwords of all of your users. Something like:

alice:abc
mark:pass
joe:123
jen:hello

Suppose you were a thief, and you wanted to crack this password file, what would you do? You’d try to steal that file! If you can get hold of that password file, then you’d have all of the passwords for all of the users of the system.

That means that this is a terrible way of storing the passwords. One step, and a thief has completely breached your system. We don’t want that. So what should we do?

First, we could encrypt the file.

This might seem like a good idea at first. If a thief were the steal the file, the wouldn’t be able to find your user’s passwords without figuring out the encryption key! It’s going to take a lot of work to figure out that key!

The first problem with this is that any password can be cracked given enough time and power. If there’s only one encryption key for the entire file, then it’s worth investing a lot of time and power into breaking it – and once it’s broken, then everything is revealed.

The second problem is: how does your system check a user’s password? It needs to decrypt the file! That means that the encryption key must be available to your system! So all that a thief needs to do is figure out where your system is getting the key from. You’ve got your entire security system for all of your users set up with a single point of protection, and somewhere in your system, everything that you need to break that protection is available!

What can we do to improve this? The answer is something called crypto graphic hashing.

Cryptographic hashing is a combination of two concepts: hashing, and one-way functions.

A really simple example of a not-very-good hash function of a string would be something like: convert all of the characters in the string to their numeric values, and exclusive-or the binary representation of those bits. With that hash function, you could take a string like “ABCD”, and convert it to the numeric values of the characters ([65, 66, 67, 68]), and then x-or them together (1000001 xor 1000010 xor 1000011 xor 1000100 = 0000100) for a result of 4. Real practical hash functions are more complicated.

For example, at least some versions of Java use the following as the default hash for a string of characters:

text{hash}(s) = left(sum_{i in text{length}(s)} 31^{length(s) - i - 1}*s[i] right) mod 2^{32}

There’s a class of mathematical functions called one-way functions. A one way function is a function f, where given x, it’s easy to compute f(x), but given f(x) it’s extremely difficult (or even impossible) to compute x.

If we combine those two, we have what’s called a crpytogrphic hash function: a function that takes an arbitrary input string, and converts it to a number, in a way where it’s very difficult to figure out what the input string that produced the number was. That’s great for storing passwords! We don’t store the password at all. We just store the hash-value produced from the password. Then when a user comes and logs in, we take their password, hash it, and compute the result to the stored hash.

Instead of the file with explicit passwords, we get something like:

alice:7a2d28fc
mark:dfd4e1c6
joe:ed849ee1
jen:bb76e739

This is much better than storing the encrypted password. There is no encryption key that a thief can use to decrypt the password. Even if a thief knows the hash values of your user’s passwords, they can’t get in to the system! And your system actually never stores the actual values of the user’s passwords – just their hashcodes!

So again, let’s look at this from the perspective of a thief. How can a thief break into a system with hashed passwords?

If they don’t know what hash function you’re using, then they’re completely stuck. Sadly, they can probably figure it out. Designing new crpytographic hash functions is hard. Implementing cryptographic hash functions correctly is hard. As a result, most people just use a hash function from a library. That means that for a thief, it’s usually pretty easy to figure out what hash function is being used by a system.

Once they know what hash function you used, their only choice to break your system is to try to guess the passwords. That is, they can guess passwords, compute their hash codes, and search through your password file to see if any of the users password hashes matches. If they find one, they’re gold!

In fact, there’s a common strategy based on this idea called a rainbow table. A rainbox table is a list of common passwords, and the numeric value that they hash to with a common crptographic hash value. Something like:

Password String Hash value
pass 1b93eb12
password a4532c47
abc 7a2d28fc

If you can somehow steal the passwords file, then with a rainbow table, you can find users with common passwords. For example, in the table above, you can see that the hashcode “7a2d28fc” occurs in the passwords file for the username “alice”, and it’s also in the rainbow table for the password “abc”. So a thief could determing that alice’s password was “abc”. Even with the best crpytographic hash, all it takes is one idiot user who uses “password” as their password, and your system’s security is breached.

Salting passwords addresses that problem. In a salting strategy, you don’t hash a user’s password by itself: you combine it with some additional data, and then hash that combination. The additional information is called the salt..

You can use lots of different things for the salt. There’s a complex set of tradeoffs in the exact salting strategy, which are beyond the scope of this post, but a few examples include:

  1. Always use a fixed salt string. This is weak, but better than nothing. It’s got a similar weakness to the encrypted password system: you only need one salt to give you a handle on breaking all of the passwords, and that one salt needs to be in the system.
  2. Add a random piece of data for each password. The catch here is that you need to store the salt data for each password. This is what unix passwords used to use. They added 12 random bits to each password. In the passwords file, they stored the salt and the hashed password. The weakness of this is that the salt is right there with the password entry. But because each user has a different salt, that means that any attempt to breach the system needs to look at each user separately.
  3. Salt on metadata: that is, take information about the user that isn’t part of their username, and use that as the salt. For example, you could use a person’s birthday as the salt for their account.

If each user has a different salt, then even if you’ve got terrible passwords, a thief needs to do a lot of work to try to break your system. Even with a rainbow-table like strategy, they can’t compute the hashcode for a given common password once, and then search the password hash list for that code – they need to recompute it for each possible salt value!

What salting does is, effectively, increase the amount of effort needed to break the passwords. If you add 12 bits of salt, then a rainbow table needs 4096 times more entries to find common passwords! If your salt is long enough, then it can make it effectively impossible to create a rainbox table at all. If they try to attack you without a rainbow table, a 12 bit salt means that your attacker needs to attack the passwords of each of your users seperately! Even if they know the value of the salt, you’ve made it much harder for them to breach your security.

The Meta of Gödel

As you may be figuring out, there’s a reason why I resisted walking through Gödel’s proof of incompleteness for so long. Incompeteness isn’t a simple proof!

To refresh your memory, here’s a sketch of the proof:

  1. Take a simple logic. We’ve been using a variant of the Principia Mathematica’s logic, because that’s what Gödel used.
  2. Show that any statement in the logic can be encoded as a number using an arithmetic process based on the syntax of the logic. The process of encoding statements numerically is called Gödel numbering.
  3. Show that you can express meta-mathematical properties of logical statements in terms of arithemetic properties of their Gödel numbers. In particular, we need to build up the logical infrastructure that we need to talk about whether or not a statement is provable.
  4. Using meta-mathematical properties, show how you can create an unprovable statement encoded as a Gödel number.

What we’ve done so far is the first two steps, and part of the third. In this post, we saw the form of the Principia’s logic that we’re using, and how to numerically encode it as a Gödel numbering. We’ve start started on the third point in this post, by figuring out just what it means to say that things are encoded arithmetically. Now we can get to the part where we see how to encode meta-mathematical properties in terms of arithmetic properties of the Gödel numbering. In this post, we’re going to build up everything we need to express syntactic correctness, logical validity, and provability in terms of arithmetical properties of Gödel numbers. (And, as a reminder, I’ve been using this translation on Gödel’s original paper on incompleteness.)

This is the most complex part of the incompleteness proof. The basic concept of what we’re doing is simple, but the mechanics are very difficult. What we want to do is define a set of predicates about logical statements, but we want those predicates to be expressed as arithmetic properties of the numerical representations of the logical statements.

The point of this is that we’re showing that done in the right way, arithmetic is logic – that doing arithmetic on the Gödel numbers is doing logical inference. So what we need to do is build up a toolkit that shows us how to understand and manipulate logic-as-numbers using arithmetic. As we saw in the last post, primitive recursion is equivalent to arithmetic – so if we can show how all of the properties/predicates that we define are primitive recursive, then they’re arithmetic.

This process involves a lot of steps, each of which is building the platform for the steps that follow it. I struggled quite a bit figuring out how to present these things in a comprehensible way. What I ended up with is writing them out as code in a pseudo-computer language. Before inventing this language, I tried writing actual executable code, first in Python and then in Haskell, but I wasn’t happy with the clarity of either one.

Doing it in an unimplemented language isn’t as big a problem as you might think. Even if this was all executable, you’re not going to be able to actually run any of it on anything real – at least not before you hair turns good and gray. The way that this stuff is put together is not what any sane person would call efficient. But the point isn’t to be efficient: it’s to show that this is possible. This code is really all about searching; if we wanted to be efficient, this could all be done in a different representation, with a different search method that was a lot faster – but that wolud be harder to understand.

So, in the end, I threw together a simple language that’s easy to read. This language, if it were implemented, wouldn’t really even be Turing complete – it’s a primitive recursive language.

Basics

We’ll start off with simple numeric properties that have no obvious connection to the kinds of meta-mathematical statements that we want to talk about, but we’ll use those to define progressively more and more complex and profound properties, until we finally get to our goal.

# divides n x == True if n divides x without remainder.
pred divides(n, x) = x mod n == 0

pred isPrime(0) = False
pred isPrime(1) = False
pred isPrime(2) = True
pred isPrime(n) = {
  all i in 2 to n {
    not divides(i, n)
  }
}

fun fact(0) = 1
fun fact(n) = n * fact(n - 1)

Almost everything we’re going to do here is built on a common idiom. For anything we want to do arithmetically, we’re going to find a bound – a maximum numeric value for it. Then we’re going to iterate over all of the values smaller than that bound, searching for our target.

For example, what’s the nth prime factor of x? Obviously, it’s got to be smaller than x, so we’ll use x as our bound. (A better bound would be the square root of x, but it doesn’t matter. We don’t care about efficiency!) To find the nth prime factor, we’ll iterate over all of the numbers smaller than our bound x, and search for the smallest number which is prime, which divides x, and which is larger than the n-1th prime factor of x. We’ll translate that into pseudo-code:

fun prFactor(0, x) = 0
fun prFactor(n, x) = {
  first y in 1 to x {
    isPrime(y) and divides(y, x) and prFactor(n - 1, x) < y
  }
}

Similarly, for extracting values from strings, we need to be able to ask, in general, what's the nth prime number? This is nearly identical to prFactor above. The only difference is that we need a different bound. Fortunately, we know that the nth prime number can't be larger than the factorial of the previous prime plus 1.

fun nthPrime(0) = 0
fun nthPrime(n) = {
  first y in 1 to fact(nthPrime(n - 1)) + 1  {
    isPrime(y) and y > nthPrime(n - 1))
  }
}

In composing strings of Gödel numbers, we use exponentiation. Given integers x and n, xn, we can obviously compute them via primitive recursion. I'll define them below, but in the rest of this post, I'll write them as an operator in the language:

fun pow(n, 0) = 1
fun pow(n, i) = n * pow(n, i - 1)

String Composition and Decomposition

With those preliminaries out of the way, we can get to the point of defining something that's actually about one of the strings encoded in these Gödel numbers. Given a number n encoding a string, item(n, x) is the value of the nth character of x. (This is slow. This is really slow! We're getting to the limit of what a very powerful computer can do in a reasonable amount of time. But this doesn't matter. The point isn't that this is a good way of doing these things: it's that these things are possible. To give you an idea of just how slow this is, I started off writing the stuff in this post in Haskell. Compiled with GHC, which is a very good compiler, item to extract the 6th character of an 8 character string took around 10 minutes on a 2.4Ghz laptop.)

fun item(n, x) = {
  first y in 1 to x {
    divides(prFactor(n, x) ** y, y) and
      not divides(prFactor(n, x)**(y+1), x)
  }
}

Given a string, we want to be able to ask how long it is; and given two strings, we want to be able to concatenate them.

fun length(x) = {
  first y in 1 to x {
    prFactor(y, x) > 0 and prFactor(y + 1, x) == 0
  }
}

fun concat(x, y) = {
  val lx = length(x)
  val ly = length(y)

  first z in 1 to nthprime(lx + ly)**(x + y) {
    (all n in 1 to lx {
        item(n, z) == item(n, x)
     }) and (all n in 1 to ly {
        item(n + lx, z) == item(n, y)
      })
  }
}

fun concatl([]) = 0
fun concatl(xs) = {
  concat(head(xs), concatl(tail(xs)))
}

fun seq(x) = 2**x

We want to be able to build statements represented as numbers from other statements represented as numbers. We'll define a set of functions that either compose new strings from other strings, and to check if a particular string is a particular kind of syntactic element.

# x is a variable of type n.
pred vtype(n, x) = {
  some z in 17 to x {
    isPrime(z) and x == n**z
  }
}

# x is a variable
pred isVar(x) = {
  some n in 1 to x {
    vtype(n, x)
  }
}

fun paren(x) =
  concatl([gseq(11), x, gseq(13)])

# given the Gödel number for a statement x, find
# the Gödel number for not x.
fun gnot(x) =
  concat(gseq(5), paren(x))

# Create the number for x or y.
fun gor(x, y) =
  concatl([paren(x), seq(7), paren(y)])

# Create the number for 'forall x(y)'.
fun gforall(x, y) =
  concatl([seq(9), seq(x), paren(y)])

# Create the number for x with n invocations of the primitive
# successor function.
fun succn(0, x) = x
fun succn(n, x) = concat(seq(3), succn(n - 1, x))

# Create the number n using successor and 0.
fun gnumber(n) = succn(n, seq(1))

# Check if a statement is type-1.
pred stype_one(x) = {
  some m in 1 to x {
     m == 1 or (vtype(1, m) and x == succn(n, seq(m))
  }
}

# Check if a statement is type n.
pred fstype(1, x) = stype_one(x)
pred fstype(n, x) =
  some v in 1 to x {
    vtype(n, v) and R(v)
  }
}

That last function contains an error: the translation of Gödel that I'm using says R(v) without defining R. Either I'm missing something, or the translator made an error.

Formulae

Using what we've defined so far, we're now ready to start defining formulae in the basic Principia logic. Forumlae are strings, but they're strings with a constrained syntax.

pred elFm(x) = {
  some y in 1 to x {
    some z in 1 to x {
      some n in 1 to x {
        stype(n, y) and stype(n+1, z) and x == concat(z, paren(y))
      }
    }
  }
}

All this is doing is expressing the grammar rule in arithmetic form: an elementary formula is a predicate: P(x), where x is a variable on level n, and P is a variable of level x + 1.

The next grammar rule that we encode this way says how we can combine elementary formulae using operators. There are three operators: negation, conjunction, and universal quantification.

pred op(x, y, z) = {
  x == gnot(y) or
  x == gor(y, z) or
  (some v in 1 to x { isVar(v) and x == gforall(v, y) })
}

And now we can start getting complex. We're going to define the idea of a valid sequence of formulae. x is a valid sequence of formulae when it's formed from a collection of formulae, each of which is either an elementary formula, or is produced from the formulae which occured before it in the sequence using either negation, logical-or, or universal quantification.

In terms of a more modern way of talking about it, the syntax of the logic is a grammar. A formula sequence, in this system, is another way of writing the parse-tree of a statement: the sequence is the parse-tree of the last statement in the sequence.

pred fmSeq(x) = {
  all p in 0 to length(x) {
    elFm(item(n, x)) or
      some p in 0 to (n - 1) {
        some q in 0 to (n - 1) {
          op(item(n,x), item(p, x), item(q, x))
        }
      }
  }
}

The next one bugs me, because it seems wrong, but it isn't really! It's a way of encoding the fact that a formula is the result of a well-defined sequence of formulae. In order to ensure that we're doing primitive recursive formulae, we're always thinking about sequences of formulae, where the later formulae are produced from the earlier ones. The goal of the sequence of formula is to produce the last formula in the sequence. What this predicate is really saying is that a formula is a valid formula if there is some sequence of formulae where this is the last one in the sequence.

Rephrasing that in grammatical terms, a string is a formula if there is valid parse tree for the grammar that produces the string.

pred isFm(x) = {
  some n in 1 to nthPrime(length(x)**2)**(x*length(x)**2) {
    fmSeq(n)
  }
}

So, now, can we say that a statement is valid because it's parsed according to the grammar? Not quite. It's actually a familiar problem for people who write compilers. When you parse a program in some language, the grammar doesn't usually specify variables must be declared before they're used. It's too hard to get that into the grammar. In this logic, we've got almost the same problem: the grammar hasn't restricted us to only use bound variables. So we need to have ways to check whether a variable is bound in a Gödel-encoded formula, and then use that to check the validity of the formula.

# The variable v is bound in formula x at position n.
pred bound(v, n, x) = {
  isVar(v) and isFm(x) and
  (some a in 1 to x {
    some b in 1 to x {
      some c in 1 to x {
        x == concatl([a, gforall(v, b), c]) and
        isFm(b) and
        length(a) + 1 ≤ n ≤ length(a) + length(forall(v, b))
      }
    }
  })
}

# The variable v in free in formula x at position n
pred free(v, n, x) = {
  isVar(v) and isFm(x) and
  (some a in 1 to x {
    some b in 1 to x {
      some c in 1 to x {
        v == item(n, x) and n ≤ length(x) and not bound(v, n, x)
      }
    }
  })
}

pred free(v, x) = {
  some n in 1 to length(x) {
    free(v, n, x)
  }
}

To do logical inference, we need to be able to do things like replace a variable with a specific infered value. We'll define how to do that:

# replace the item at position n in x with y.
fun insert(x, n, y) = {
  first z in 1 to nthPrime(length(x) + length(y))**(x+y) {
    some u in 1 to x {
      some v in 1 to x {
        x == concatl([u, seq(item(n, x)), v]) and
        z == concatl([u, y, v]) and
        n == length(u) + 1
      }
    }
  }
}

There are inference operations and validity checks that we can only do if we know whether a particular variable is free at a particular position.

# freePlace(k, v, k) is the k+1st place in x (counting from the end)
# where v is free.
fun freePlace(0, v, x) = {
  first n in 1 to length(x) {
    free(v, n, x) and
    not some p in n to length(x) {
      free(v, p, x)
    }
  }
}

fun freePlace(k, v, x) = {
  first n in 1 to freePlace(n, k - 1, v) {
    free(v, n, x) and
    not some p in n to freePlace(n, k - 1, v) {
      free(v, p, x)
    }
  }
}

# number of places where v is free in x
fun nFreePlaces(v, x) = {
  first n in 1 to length(x) {
    freeplace(n, v, x) == 0
  }
}

In the original logic, some inference rules are defined in terms of a primitive substitution operator, which we wrote as subst[v/c](a) to mean substitute the value c for the variable c in the statement a. We'll build that up on a couple of steps, using the freePlaces function that we just defined.

# Subst1 replaces a single instance of v with y.
fun subst'(0, x, v, y) = x
fun subst1(0k, x, v, y) =
  insert(subst1(k, x, v, y), freePlace(k, v, x), y)

# subst replaces all instances of v with y
fun subst(x, v, y) = subst'(nFreePlaces(v, x), x, v, y)

The next thing we're going to do isn't, strictly speaking, absolutely necessary. Some of the harder stuff we want to do will be easier to write using things like implication, which aren't built in primitive of the Principia logic. To write those as clearly as possible, we'll define the full suite of usual logical operators in terms of the primitives.

# implication
fun gimp(x, y) = gor(gnot(x), y)

# logical and
fun gand(x, y) = gnot(gor(gnot(x), gnot(y)))

# if/f
fun gequiv(x, y) = gand(gimp(x, y), gimp(y, x))

# existential quantification
fun gexists(v, y) = not(gforall(v, not(y)))

Axioms

The Peano axioms are valid logical statements, so they have Gödel numbers in this system. We could compute their value, but why bother? We know that they exist, so we'll just give them names, and define a predicate to check if a value matches them.

The form of the Peano axioms used in incompleteness are:

  1. Zero: ¬(succ(x1) = 0)
  2. Uniqueness: succ(x1) = succ(y1) Rightarrow x = y
  3. Induction: x2(0) ∧ ∀x1(x2(x1)⇒ x2(succ(x1))) ⇒ ∀x1(x2(x1))
const pa1 = ...
const pa2 = ...
const pa3 = ...

pred peanoAxiom(x) =
  (x == pa1) or (x == pa2) or (x == pa3)

Similarly, we know that the propositional axioms must have numbers. The propositional
axioms are:

  1. p lor p Rightarrow p
  2. p Rightarrow p lor q
  3. p lor q Rightarrow p lor q
  4. (p Rightarrow q) Rightarrow (r lor p Rightarrow r lor q)

I'll show the translation of the first - the rest follow the same pattern.

# Check if x is a statement that is a form of propositional
# axiom 1: y or y => y
pred prop1Axiom(x) =
  some y in 1 to x {
    isFm(x) and x == imp(or(y, y), y)
  }
}

pred prop2Axiom(x) = ...
pred prop3Axiom(x) = ...
pred prop4Axiom(x) = ...
pred propAxiom(x) = prop2Axiom(x) or prop2Axiom(x) or
    prop3Axiom(x) or prop4Axiom(x)

Similarly, all of the other axioms are written out in the same way, and we add a predicate isAxiom to check if something is an axiom. Next is quantifier axioms, which are complicated, so I'll only write out one of them - the other follows the same basic scheme.

The two quantifier axioms are:

  1. forall v(a) Rightarrow text{subst}[v/c](a)
  2. forall v(b lor a) Rightarrow (b or forall v(a))
quantifier_axiom1_condition(z, y, v) = {
  not some n in 1 to length(y) {
    some m in 1 to length(z) {
      some w in 1 to z {
         w == item(m, z) and bound(w, n, y) and free(v, n, y)
      }
    }
  }
}

pred quantifier1Axiom(x) = {
  some v in 1 to x {
    some y in 1 to x {
      some z in 1 to x {
        some n in 1 to x {
          vtype(n, v) and stype(n, z) and
          isFm(y) and
          quantifier_axiom1_condition(z, y, v) and
          x = gimp(gforall(v, y), subst(y, v, z))
        }
      }
    }
  }
}

quanitifier_axiom2 = ...
isQuantifierAxiom = quantifier1Axiom(x) or quantifier2Axiom(x)

We need to define a predicate for the reducibility axiom (basically, the Principia's version of the ZFC axiom of comprehension). The reducibility axiom is a schema: for any predicate a, exists u (forall v (u(v) Leftrightarrow a. In our primitive recursive system, we can check if something is an instance of the reducibility axiom schema with:

pred reduAxiom(x) =
  some u in 1 to x {
    some v in 1 to x {
      some y in 1 to x {
        some n in 1 to x {
          vtype(n, v) and
          vtype(n+1, u) and
          not free(u, y) and
          isFm(y) and
          x = gexists(u, gforall(v, gequiv(concat(seq(u), paren(seq(v))), y)))
        }
      }
    }
  }
}

Now, the set axiom. In the logic we're using, this is the axiom that defines set equality. It's written as forall x_1 (x_2(x_1) Leftrightarrow y_2(y_1) Rightarrow x_2 = x_1). Set equality is defined for all types of sets, so we need to have one version of axiom for each level. We do that using type-lifting: we say that the axiom is true for type-1 sets, and that any type-lift of the level-1 set axiom is also a version of the set axiom.

fun typeLift(n, x) = {
  first y in 1 to x**(x**n) {
    all k in 1 to length(x) {
      item(k, x) ≤ 13 and item(k, y) == item(k, v) or
      item(k, x) > 13 and item(k, y) = item(k, x) * prFactor(1, item(k, x))**n
    }
  }
}

We haven't defined the type-1 set axiom. But we just saw the axiom above, and it's obviously a simple logical statement. That mean that it's got a Gödel number. Instead of computing it, we'll just say that that number is called sa1. Now we can define a predicate to check if something is a set axiom:

val sa1 = ...
pred setAxiom(x) =
  some n in 1 to x {
    x = typeLift(n, sa)
  }
}

We've now defined all of the axioms of the logic, so we can now create a general predicate to see if a statement fits into any of the axiom categories:

pred isAxiom(x) =
  peanoAxiom(x) or propAxiom(x) or quantifierAxom(x) or
  reduAxiom(x) or setAxiom(x)

Proofs and Provability!

With all of the axioms expressible in primitive recursive terms, we can start on what it means for something to be provable. First, we'll define what it means for some statement x to be an immediate consequence of some statements y and z. (Back when we talked about the Principia's logic, we said that x is an immediate consequence of y and z if either: y is the formula z ⇒ x, or if c is the formula ∀v.x).

pred immConseq(x, y, z) = {
  y = imp(z, x) or
  some v in 1 to x {
    isVar(v) and x = forall(v, y)
  }
}

Now, we can use our definition of an immediate consequence to specify when a sequence of formula is a proof figure. A proof figure is a sequence of statements where each statement in it is either an axiom, or an immediate consequence of two of the statements that preceeded it.

pred isProofFigure(x) = {
  (all n in 0 to length(x) {
    isAxiom(item(n, x)) or
    some p in 0 to n {
      some q in 0 to n {
        immConseq(item(n, x), item(p, x), item(q, x))
      }
    }
  }) and
  length(x) > 0
}

We can say that x is a proof of y if x is proof figure, and the last statement in x is y.

pred proofFor(x, y) =
  isProofFigure(x) and
  item(length(x), x) == y

Finally, we can get to the most important thing! We can define what it means for something to be provable! It's provable if there's a proof for it!

pre provable(x) =
  some y {
    proofFor(y, x)
  }
}

Note that this last one is not primitive recursive! There's no way that we can create a bound for this: a proof can be any length.

At last, we're done with these definition. What we've done here is really amazing: now, every logical statement can be encoded as a number. Every proof in the logic can be encoded as a sequence of numbers: if something is provable in the Principia logic, we can encode that proof as a string of numbers, and check the proof for correctness using nothing but (a whole heck of a lot of) arithmetic!

Next post, we'll finally get to the most important part of what Gödel did. We've been able to define what it means for a statement to be provable - we'll use that to show that there's a way of creating a number encoding the statement that something is not provable. And we'll show how that means that there is a true statement in the Principia's logic which isn't provable using the Principia's logic, which means that the logic isn't complete.

In fact, the proof that we'll do shows a bit more than that. It doesn't just show that the Principia's logic is incomplete. It shows that any consistent formal system like the Principia, any system which is powerful enough to encode Peano arithmetic, must be incomplete.