Category Archives: Good Math

Defining Properties Arithmetically (part 1): Gödel and Primitive Recursion

When I left off, we’d seen how to take statements written in the logic of the Principia Mathematica, and convert them into numerical form. What we need to see now is how to get meta-mathematical.

We want to be able to write second-order logical statements. The basic trick to incompleteness is that we’re going to use the numerical encoding of statements to say that a predicate or relation is represented by a number. Then we’re going to write predicates about predicates by defining predicates on the numerical representations of the first-order predicates. That’s going to let us create a true statement in the logic that can’t be proven with the logic.

To do that, we need to figure out how to take our statements and relations represented as numbers, and express properties of those statements and relations in terms of arithmetic. To do that, we need to define just what it means to express something arithmetically. Gödel did that by defining “arithmetically” in terms of a concept called primitive recursion.

I learned about primitive recursion when I studied computational complexity. Nowadays, it’s seen as part of theoretical computer science. The idea, as we express it in modern terms, is that there are many different classes of computable functions. Primitive recursion is one of the basic complexity classes. You don’t need a Turing machine to compute primitive recursive functions – they’re a simpler class.

The easiest way to understand primitive recursion is that it’s what you get in a programming language with integer arithmetic, and simple for-loops. The only way you can iterate is by repeating things a bounded number of times. Primitive recursion has a lot of interesting properties: the two key ones for our purposes here are: number theoretic proofs are primitive recursive, and every computation of a primitive recursive function is guaranteed to complete within a bounded amount of time.

The formal definition of primitive recursion, the way that Gödel wrote it, is quite a bit more complex than that. But it means the same thing.

We start with what it means to define a formula via primitive recursion. (Note the language that I used there: I’m not explaining what it means for a function to be primitive recursive; I’m explaining what it means to be defined via primitive recursion.) And I’m defining formulae, not functions. In Gödel’s proof, we’re always focused on numerical reasoning, so we’re not going to talk about programs or algorithms, we’re going to about the definition of formulae.

A formula phi(x_1, x_2, ..., x_n) is defined via primitive recursion if, for some other formulae psi and mu:

  • Base: phi(0, x_2, ..., x_n) = psi(x_2, ..., x_n)
  • Recursive: phi(i+1, x_2, ..., x_n) = mu(i, phi(i, x_2, ..., x_n), x_2, ..., x_n).

So, basically, the first parameter is a bound on the number of times that phi can invoked recursively. When it’s 0, you can’t invoke phi any more.

A formula is primitive recursive if it defined from a collection of formulae phi_1, ..., phi_n where any formula phi_i is defined via primitive recursion from phi_1, ..., phi_{i-1}, or the primitive succ function from Peano arithmetic.

For any formula phi_i in that sequence, the degree of the formula is the number of other primitive recursive formulae used in its definition.

Now, we can define a primitive recursive property: R(x_1, ..., x_n) is primitive recursive if and only if there exists a primitive recursive function phi such that phi(x_1, ..., x_n) = 0.

With primitive recursive formulae and relations defined, there’s a bunch of theorems about how you can compose primitive recursive formulae and relations:

  1. Every function or relation that you get by substituting a primitive recursive function for a variable in a primitive recursive function/relation is primitive recursive.
  2. If R and S are primitive relations, then ¬R, R∧S, R∨S are all primitive recursive.
  3. If phi(x_1, ..., x_n) and psi(x_1, ..., x_n) are primitive recursive functions, then the relation R(x_1, ..., x_n) Leftrightarrow (phi(x_1, ..., x_n) = psi(x_1, ..., x_n) is also primitive recursive.
  4. Let xv and zv be finite-length tuples of variables. If the function phi(xv) and the relation R(y, zv) are primitive recursive, then so are the relations:
    • S(xv, zv) Leftrightarrow  (exists y le phi(xv). R(y, zv))
    • T(xv, zv) Leftrightarrow  (forall y le A(xv). R(y, zv))
  5. Let xv and zv be finite-length tuples of variables. And let text{argmin}[y le f(x).R(x)] be the smallest value of x for which y le f(x) and R(x) is true, or 0 if there is no such value. Then if the function phi(xv) and the relation R(y, zv) are primitive recursive, then so is the function P(xv, zv) = (text{argmin}[y le A(xv). R(y, zv))].

By these definitions, addition, subtraction, multiplication, and integer division are all primitive recursive.

Ok. So, now we’ve got all of that out of the way. It’s painful, but it’s important. What we’ve done is come up with a good formal description of what it means for something to be an arithmetic property: if we can write it as a primitive recursive relation or formula, it’s arithmetic.

So now, finally, we’re ready to get to the really good part! Now that we know what it means to define something arithmetically, we can see how to define meta-mathematical properties and formulae arithmetically. Next post, hopefully tomorrow, I’ll start showing you arithmetic expressions of meta-math!

G&oum;del Numbering: Encoding Logic as Numbers

The first step in Gödel’s incompleteness proof was finding a way of taking logical statements and encoding them numerically. Looking at this today, it seems sort-of obvious. I mean, I’m writing this stuff down in a text file – that text file is a stream of numbers, and it’s trivial to convert that stream of numbers into a single number. But when Gödel was doing it, it wasn’t so obvious. So he created a really clever mechanism for numerical encoding. The advantage of Gödel’s encoding is that it makes it much easier to express properties of the encoded statements numerically.

Before we can look at how Gödel encoded his logic into numbers, we need to look at the logic that he used. Gödel worked with the specific logic variant used by the Principia Mathematica. The Principia logic is minimal and a bit cryptic, but it was built for a specific purpose: to have a minimal syntax, and a complete but minimal set of axioms.

The whole idea of the Principia logic is to be purely syntactic. The logic is expected to have a valid model, but you shouldn’t need to know anything about the model to use the logic. Russell and Whitehead were deliberately building a pure logic where you didn’t need to know what anything meant to use it. I’d really like to use Gödel’s exact syntax – I think it’s an interestingly different way of writing logic – but I’m working from a translation, and the translator updated the syntax. I’m afraid that switching between the older Gödel syntax, and the more modern syntax from the translation would just lead to errors and confusion. So I’m going to stick with the translation’s modernization of the syntax.

The basic building blocks of the logic are variables. Already this is a bit different from what you’re probably used to in a logic. When we think of logic, we usually consider predicates to be a fundamental thing. In this logic, they’re not. A predicate is just a shorthand for a set, and a set is represented by a variable.

Variables are stratified. Again, it helps to remember where Russell and Whitehead were coming from when they were writing the Principia. One of their basic motivations was avoiding self-referential statements like Russell’s paradox. In order to prevent that, they thought that they could create a stratified logic, where on each level, you could only reason about objects from the level below. A first-order predicate would be a second-level object could only reason about first level objects. A second-order predicate would be a third-level object which could reason about second-level objects. No predicate could ever reason about itself or anything on its on level. This leveling property is a fundamental property built into their logic. The way the levels work is:

  • Type one variables, which range over simple atomic values, like specific single natural numbers. Type-1 variables are written as a1, b1.
  • Type two variables, which range over sets of atomic values, like sets of natural numbers. A predicate, like IsOdd, about specific natural numbers would be represented as a type-2 variable. Type-2 variables are written a2, b2, …
  • Type three variables range over sets of sets of atomic values. The mappings of a function could be represented as type-3 variables: in set theoretic terms, a function is set of ordered pairs. Ordered pairs, in turn, can be represented as sets of sets – for example, the ordered pair (1, 4) would be represented by the set { {1}, {1, 4} }. A function, in turn, would be represented by a type-4 variable – a set of ordered pairs, which is a set of sets of sets of values.

Using variables, we can form simple atomic expressions, which in Gödel’s terminology are called signs. As with variables, the signs are divided into stratified levels:

  • Type-1 signs are variables, and successor expressions. Successor expressions are just Peano numbers written with “succ”: 0, succ(0), succ(succ(0)), succ(a1), etc.
  • Signs of any type greater than 1 are just variables of that type/level.

Now we can assemble the basic signs into formulae. Gödel explained how to build formulae in a classic logicians form, which I think is hard to follow, so I’ve converted it into a grammar:

 elementary_formula → signn+1(signn)
 formula → ¬(elementary_formula)
 formula → (elementary_formula) or (elementary_formula)
 formula → ∀ signn (elementary_formula)

That’s all that’s really included in Gödel’s logic. It’s enough: everything else can be defined in terms of combinatinos of these. For example, you can define logical and in terms of negation and logical or: (a ∧ b) ⇔ ¬ (¬ a ∨ ¬ b).

Next, we need to define the axioms of the system. In terms of logic the way I think of it, these axioms include both “true” axioms, and the inference rules defining how the logic works. There are five families of axioms.

<

ol>

  • First, there’s the Peano axioms, which define the natural numbers.
    1. ¬(succ(x1) = 0): 0 is a natural number, and it’s not the successor of anything.
    2. succ(x1) = succ(y1) ⇒ x1 = y1: Successors are unique.
    3. (x2(0) ∧ ∀ x1 (x2(x1) ⇒ x2(succ(x1)))) ⇒ ∀ x1(x2(x1)): induction works on the natural numbers.
  • Next, we’ve got a set of basic inference rules about simple propositions. These are defined as axiom schemata, which can be instantiated for any set of formalae p, q, and r.
    <ol>
       <li> <em>p &or; p &rArr; p</em></li>
       <li> <em>p &rArr; p &or; q</em></li>
       <li> <em>p &or; q &rArr; q &or; p</em></li>
       <li> <em>(p &rArr; q) &rArr; (p &or; r) &rArr; (q &or; r)</em></li>
    
  • Axioms that define inference over quantification. v is a variable, a is any formula, b is any formula where v is not a free variable, and c is a sign of the same level as v, and which doesn’t have any free variables that would be bound if it were inserted as a replacement for v.
    1. ∀ v (a) ⇒ subst[v/c](a): if formula a is true for all values of v, then you can substitute any specific value c for v in a, and a must still be true.
    2. (∀ v (b ∨ a)) ⇒ (b ∨ ∀ v (a))
  • The Principia’s version of the set theory axiom of comprehension: &exists; u (∀ v ( u(v) ⇔ a )).
  • <li> And last but not least, an axiom defining set equivalence:       <em>&forall; x<sub>i</sub> (x<sub>i+1</sub>(x<sub>i</sub>) &hArr; y<sub>i+1</sub>(y<sub>i</sub>)) &rArr; x<sub>i+1</sub> = y<sub>i+1</sub></em>   </li>
    

    So, now, finally, we can get to the numbering. This is quite clever. We’re going to use the simplest encoding: for every possible string of symbols in the logic, we’re going to define a representation as a number. So in this representation, we are not going to get the property that every natural number is a valid formula: lots of natural numbers won’t be. They’ll be strings of nonsense symbols. (If we wanted to, we could make every number be a valid formula, by using a parse-tree based numbering, but it’s much easier to just let the numbers be strings of symbols, and then define a predicate over the numbers to identify the ones that are valid formulae.)

    We start off by assigning numbers to the constant symbols:

    Symbols Numeric Representation
    0 1
    succ 3
    ¬ 5
    7
    9
    ( 11
    ) 13

    Variables will be represented by powers of prime numbers, for prime numbers greater that 13. For a prime number p, p will represent a type one variable, p2 will represent a type two variable, p3 will represent a type-3 variable, etc.

    Using those symbol encodings, we can take a formula written as symbols x1x2x3…xn, and encode it numerically as the product 2x13x25x2…pnxn, where pn is the nth prime number.

    For example, suppose I wanted to encode the formula: ∀ x1 (y2(x1)) ∨ x2(x1).

    First, I’d need to encode each symbol:

    1. “∀” would be 9.
    2. “x1“” = 17
    3. “(” = 11
    4. “y2” = 192 = 361
    5. “(” = 11
    6. “x1” = 17
    7. “)” = 13
    8. “∨” = 7
    9. “x2” = 172 = 289
    10. “(” = 11
    11. “x1” = 17
    12. “)” = 13
    13. “)” = 13

    The formula would thus be turned into the sequence: [9, 17, 11, 361, 11, 17, 13, 7, 289, 11, 17, 13, 13]. That sequence would then get turned into a single number 29 317 511 7361 1111 1317 1713 197 23289 2911 3117 3713 4113, which according to Hugs is the number (warning: you need to scroll to see it. a lot!):

    1,821,987,637,902,623,701,225,904,240,019,813,969,080,617,900,348,538,321,073,935,587,788,506,071,830,879,280,904,480,021,357,988,798,547,730,537,019,170,876,649,747,729,076,171,560,080,529,593,160,658,600,674,198,729,426,618,685,737,248,773,404,008,081,519,218,775,692,945,684,706,455,129,294,628,924,575,925,909,585,830,321,614,047,772,585,327,805,405,377,809,182,961,310,697,978,238,941,231,290,173,227,390,750,547,696,657,645,077,277,386,815,869,624,389,931,352,799,230,949,892,054,634,638,136,137,995,254,523,486,502,753,268,687,845,320,296,600,343,871,556,546,425,114,643,587,820,633,133,232,999,109,544,763,520,253,057,252,248,982,267,078,202,089,525,667,161,691,850,572,084,153,306,622,226,987,931,223,193,578,450,852,038,578,983,945,920,534,096,055,419,823,281,786,399,855,459,394,948,921,598,228,615,703,317,657,117,593,084,977,371,635,801,831,244,944,526,230,994,115,900,720,026,901,352,169,637,434,441,791,307,175,579,916,097,240,141,893,510,281,613,711,253,660,054,258,685,889,469,896,461,087,297,563,191,813,037,946,176,250,108,137,458,848,099,487,488,503,799,293,003,562,875,320,575,790,915,778,093,569,309,011,025,000,000,000.

    Next, we’re going to look at how you can express interesting mathematical properties in terms of numbers. Gödel used a property called primitive recursion as an example, so we’ll walk through a definition of primitive recursion, and show how Gödel expressed primitive recursion numerically.

    Gödel's Incompleteness

    I’ve mentioned Gödel’s incompleteness theorems many times on this blog, but I’ve never actually written about them in detail. I was asking, on twitter, for topics that readers would be interested in, and one thing that came up is actually go through the proof of incompleteness. I’m happy to take the time to do that incompleteness is one of the most beautiful and profound proofs that I’ve ever seen.

    It takes a fair bit of effort though, so it’s going to take a couple of posts just to get through the preliminaries. What I’m going to do is work with this translation of the original paper where Gödel published his first incompleteness proof. Before we can get to the actual proof, we need to learn a bit about the particular kind of logic that he used in his proof.

    The story of incompleteness is really pretty interesting. In the late 19th and early 20th centuries, mathematicians started to get really interested in formal foundations. Why, exactly, this happened is somewhat of a debate, but the story I find most likely is that it’s a result of Cantor and his naive set theory.

    Set theory came along, and it seemed like such a great idea. It seems incredibly simple, and it makes many proofs in many different subject areas appear to be really simple. But there’s a huge problem: it’s not consistent.

    The problem is what’s come to be known as Russell’s paradox. I’ve explained Russell’s paradox a bunch of times on the blog. For this post, I’m going to do it a bit differently. As part of my prep for this series of posts, I’ve been reading the book Gödel’s Proof, and in that book, they have a way of rephrasing Russell’s paradox which I think clarifies it nicely, so I’m going to try their version.

    If you think about the set of all sets, you can partition it into two non-overlapping subsets. There’s a subset of normal sets, and a class of abnormal sets. What makes a set abnormal is that it’s self-referential – an abnormal set contains itself as a member.

    So considered the set of all normal sets. Is it, itself, a normal set?

    The answer is: argh! Because if the set of all normal sets is normal, then it must be a member of itself. But if it’s a member of itself, it can’t be normal. Similarly, if you start off saying that the set of all normal sets isn’t normal, then by definition, it is normal. No matter what you do, you’re screwed.

    What Russell’s paradox did was show that there was a foundational problem in math. You could develop what appeared to be a valid mathematicial structure and theory, only to later discover that all the work you did was garbage, because there was some non-obvious fundamental inconsistency in how you defined it. But the way that foundations were treated simple wasn’t strong or formal enough to be able to detect, right up front, whether you’d hard-wired an inconsistency into your theory. So foundations had to change, to prevent another Cantor incident.

    So, eventually, along came two mathematicians, Russell and Whitehead, and they created an amazing piece of work called the Principia Mathematica. The principia was supposed to be an ideal, perfect mathematical foundation. The Principia was supposed to have two key properties: it was supposed to consistent, and it was supposed to be complete.

    • Consistent meant that the statement would not allow any inconsistencies of any kind. If you used the logic and the foundantions of the Principia, you couldn’t even say anything like Russell’s paradox: you couldn’t even write it as a valid statement.
    • Complete meant that every true statement was provably true, every false statement was provably false, and every statement was either true or false.

    A big part of how it did this was by creating a very strict stratification of logic. You could reason about a specific number, using level-0 statements. You could reason about sets of numbers using level-1 statements. And you could reason about sets of sets of numbers using level-2 statements. In a level-1 statement, you could make meta-statements about level-0 properties, but you couldn’t reason about level-1. In level-2, you could reason about level-1 and level-0, but not about level-2. This meant that you couldn’t make a set like the set of normal sets – because that set is formed using a predicate – and when you’re forming a set like “the set of proper sets”, the sets you can reason about are a level below you. You can form a level-1 set using a level-1 statement about level-0 objects. But you can’t make a level-1 statement about level-1 objects! So self-referential systems would be completely impossible in the Principia’s logic.

    As a modern student of math, it’s hard to understand what a profound thing they were trying to do. We’ve grown up learning math long after incompleteness became a well-known fact of life. (I read “Gödel Escher Bach” when I was a college freshman – well before I took any particularly deep math classes – so I knew about incompleteness before I knew enough to really understand what completeness woud have meant!) The principia would have been the perfection of math, a final ultimate perfect system. There would have been nothing that we couldn’t prove, nothing in math that we couldn’t know!

    What Gödel did was show that using the Principia’s own system, and it’s own syntax, that not only was the principia itself flawed, but that any possible effort like the principia would inevitably be flawed!

    With the incompleteness proof, Gödel showed that even in the Principia, even with all of the effort that it made to strictly separate the levels of reasoning, that he could form self-referential statements, and that those self-referential statements were both true and unprovable.

    The way that he did it was simply brilliant. The proof was a sequence of steps.

    1. He showed that using Peano arithmetic – that is, the basic definition of natural numbers and natural number arithmetic – that you could take any principia-logic statement, and uniquely encode it as a number – so that every logical statement was a number, and ever number was a specific logical statement.
    2. Then using that basic mechanic, he showed how you could take any property defined by a predicate in the principia’s logic, and encode it as a arithmetic property of the numbers. So a number encoded a statement, and the property of a number could be encoded arithmetically. A number, then, could be both a statement, and a definition of an arithmetic property of a stament, and a logical description of a logical property of a statement – all at once!
    3. Using that encoding, then – which can be formed for any logic that can express Peano arithmetic – he showed that you could form a self-referential statement: a number that was a statement about numbers including the number that was statement itself. And more, it could encode a meta-property of the statement in a way that was both true, and also unprovable: he showed how to create a logical property “There is no proof of this statement”, which applied to its own numeric encoding. So the statement said, about itself, that it couldn’t be proven.

    The existence of that statement meant that the Principia – and any similar system! – was incomplete. Completeness means that every true statement is provably true within the system. But the statement encodes the fact that it cannot be proven. If you could prove it, the system would be inconsistent. If you can’t, it’s consistent, but incomplete.

    We’re going to go through all of that in detail. But to do that in a way where we can really understand it, we’re going to need to start by looking at the logic used by the Principia. Then we’ll look at how to encode Principia-logical statements as natural numbers using a technique called Gödel numbering, and how to express predicates numerically. And then, finally, we can look at how you can form the classic Gödel statement that shows that the Principia’s system of logic is incomplete.

    Least Square Linear Regression

    There’s one topic I’ve been asked about multiple times, but which I’ve never gotten around to writing about. It happens to be one of the first math things that my dad taught me about: linear regression.

    Here’s the problem: you’re doing an experiment. You’re measuring one quantity as you vary another. You’ve got a good reason to believe that there’s a linear relationship between the two quantities. But your measurements are full of noise, so when you plot the data on a graph, you get a scattershot. How can you figure out what line is the best match to your data, and how can you measure how good the match is?

    When my dad taught me this, he was working for RCA manufacturing semiconductor chips for military and satellite applications. The focus of his work was building chips that would survive in the high-radiation environment of space – in the jargon, he was building radiation hard components. They’d put together a set of masks for an assembly line, and do a test run. Then they’d take the chips from that run, and they’d expose them to gamma radiation until they failed. That would give them a good way of estimating the actual radiation hardness of the run, and whether it was good enough for their customers. Based on a combination of theory and experience, they knew that the relationship they cared about was nearly linear: for a given amount of radiation, the number of specific circuitry failures was proportional to the amount of gamma exposure.

    graph For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration.

    The way that we’d usually approach this is called least square linear regression. The idea is that what we want do do is create a line where the square of the vertical distance between the chosen line and the measured data points is a minimum.

    For the purposes of this, we’ll say that one quantity is the independent value, and we’ll call that x, and the other quantity is the dependent variable, and we’ll call that y. In theory, the dependent variable, as its name suggests depends on the independent variable. In fact, we don’t always really know which value depends on the other, so we do our best to make an intelligent guess.

    So what we want to do is find a linear equation, y = mx + b where the mean-square distance is minimal. All we need to do is find values for m (the slope of the line) and b (the point where the line crosses the y axis, also called the y intercept). And, in fact, b is relatively easy to compute once we know the slope of the line. So the real trick is to find the slope of the line.

    The way that we do that is: first we compute the means of x and y, which we’ll call \overline{x} and \overline{y}. Then using those, we compute the slope as:

     m = \frac{\Sigma_{i=1}^n (x-\hat{x})(y-\hat{y})}{\Sigma_{i=1}^{n} (x-\hat{x})^2}

    Then for the y intercept: b = \hat{y} - m\hat{x}.

    In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slop in the figure is 2.54, and the y-intercept is 18.4.

    Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or r.

     r = \frac{\left(\Sigma (x-\hat{x})\right) \left(Sigma (y - \hat{y})\right)}{\sqrt{ \left(\Sigma (x-\hat{x})^2\right) \left(\Sigma (y - \hat{y})^2\right) }}

    If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.

    For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.

    When you see people use linear regression, there are a few common errors that you’ll see all the time.

    • No matter what your data set looks like, linear regression will find a line. It won’t tell you “Oops, I couldn’t find a match”. So the fact that you fit a line means absolutely nothing by itself. If you’re doing it right, you start off with a hypothesis based on prior plausibility for a linear relation, and you’re using regression as part of a process to test that hypothesis.
    • You don’t get to look at the graph before you do the analysis. What I mean by that is, if you look at the data, you’ll naturally notice some patterns. Humans are pattern seekers – we’re really good at noticing them. And almost any data set that you look at carefully enough will contain some patterns purely by chance. If you look at the data, and there’s a particular pattern that you want to see, you’ll probably find a way to look at the data that produces that pattern. For example, in the first post on this blog, I was looking at a shoddy analysis by some anti-vaxxers, who were claiming that they’d found an inflection point in the rate of autism diagnoses, and used linear regression to fit two lines – one before the inflection, one after. But that wasn’t supported in the data. It was random – the data was very noisy. You could fit different lines to different sections by being selective. If you picked one time, you’d get a steeper slope before that time, and a shallower one after. But by picking different points, you could get a steeping slope after. The point is, when you’re testing the data, you need to design the tests before you’ve seen the data, in order to keep your bias out!
    • A strong correlation doesn’t imply linear correlation. If you fit a line to a bunch of data that’s not really linear, you can still get a strong positive (or negative) correlation. Correlation is really testing whether the data is increasing the way you’d expect it to, not whether it’s truly linear. Random data will have a near-zero correlation. Data where the dependent variable doesn’t vary consistently with the independent will have near-zero correlation. But there are plenty of ways of getting data where the dependent and independent variables increase together that produce a strong correlation. You need to do other things to judge the strength of the fit. (I might do some more posts on this kind of thing to cover some of that.)

    Hensel's Lemma: Newton's Method for p-Adic numbers

    This is, sadly, the last part of my posts about P-adic arithmetic. It’s not for lack of information, but rather for lack of competence on my part. The area of math that I’m worst at is analysis, and an awful lot of the interesting things that you can do with p-adic numbers is analysis.

    For an explanation of what p-adic numbers are, and how arithmetic on them works, you can refer to my first post on the p-adics, and for an explanation of p-adic metrics and distance, you can look at this post.

    Today, we’re going to look at Hensel’s lemma. This is really cool. Hensel’s lemma shows you a simple way of finding polynomial roots in P-adic arithmetic. It’s sometimes called Newton’s method for p-adics, which is both a good description of how it’s used (but a complete misnomer because the lemma is a description of of a property, and Newton’s method is a description of a procedure).

    Hensel’s lemma says that if you’ve got a polynomial, and for a prime number p, you can find a root using modulo arithmetic using base p, then you can use the base-p root to find the corresponding root using modulo base p2, and given the root using base p2, you can find it for modulo p3, and so on!

    So far, this is just a statement about modulo arithmetic – not about p-adic arithmetic. But what ends up happening is that the result modulo p gives you a first approximation of the root in p-adic. Using the process defined by Hensel’s lemma, you can take that root and extend it to a better approximation. It ends up being the case that each time you lift the result to a higher exponent of p, you’re performing the exactly analog of one step in Newton’s method of finding roots! But it’s actually better than that. In you look at Hensel’s lemma in the p-adic numbers, then each step extends the approximation by exactly one digit!

    Let’s look at the lemma in formal terms:

    Suppose that you have a polynomial f(x).

    If there is an integer, i and a prime number p such that for f(i) = 0 in mod p, j such that f(j) = 0 in mod p^2. In general, if there’s an exponent k where f(i) = 0 in mod p^k, then there’s a j where f(j) = 0 in mod p^{k+1}.

    In fact, we can do even better than that! If we know the root i for an exponent k, then we can easily compute j using a process called lifting:

    (In this, is the first derivative of f at i.)

    You can see, looking at that equation, that the derivative of f at i can’t be zero mod p^k, or else the denominator of the fraction would be zero, and everything would go kablooie in your face!

    In P-adic arithemetic, this becomes absolutely amazing. If i is the root mod p^k, then the root mod p^{k+1} is:

    It can’t be simpler. And you can clearly see the relation to Newton’s method.

    For fun, let’s try looking at an example: let’s take the square root of 2 in 7-adic. In base-10, the square root of 2 is roughly 1.4142135624. In p-adic, it’s going to bo something completely different. We’ll start by phrasing it as a polynomial: x^2 - 2 = 0. So, what’s the solution to that mod-7? 3: in mod-7, 3*3=2.

    So, our first approximation of a square root of 2 is 3.

    We can extend to the next digit using Hensel’s lemma, and we get j = 3 - 7/6 =   (9-2)/6 = 7/6. We’re looking at mod 7 arithmetic here, so 6 = -1. So j = 3 + 7 = 10 = 13.

    Repeated, we’d get 213, 6213, and so on.

    Types Gone Wild! SKI at Compile-Time

    Over the weekend, a couple of my Foursquare coworkers and I were chatting on twitter, and one of my smartest coworkers, a great guy named Jorge Ortiz, pointed out that type inference in Scala (the language we use at Foursquare, and also pretty much my favorite language) is Turing complete.

    Somehow, I hadn’t seen this before, and it absolutely blew my mind. So I asked Jorge for a link to the proof. The link he sent me is a really beautiful blog post. It doesn’t just prove that Scala type inference is Turing complete, but it does it in a remarkably beautiful way.

    Before I get to the proof, what does this mean?

    A system is Turing complete when it can perform any possible computation that could be performed on any other computing device. The Turing machine is, obviously, Turing complete. So is lambda calculus, the Minsky machine, the Brainfuck computing model, and the Scala programming language itself.

    If type inference is Turing complete, then that means that you can write a Scala program where, in order to type-check the program, the compiler has to run an arbitrary program to completion. It means that there are, at least theoretically, Scala programs where the compiler will take forever – literally forever – to determine whether or not a given program contains a type error. Needless to say, I consider this to be a bad thing. Personally, I’d really prefer to see the type system be less flexible. In fact, I’d go so far as to say that this is a fundamental error in the design of Scala, and a big strike against it as a language. Having a type-checking system which isn’t guaranteed to terminate is bad.

    But let’s put that aside: Scala is pretty entrenched in the community that uses it, and they’ve accepted this as a tradeoff. How did the blog author, Michael Dürig, prove that Scala type checking is Turing complete? By showing how to implement a variant of lambda calculus called SKI combinator calculus entirely with types.

    SKI calculus is seriously cool. We know that lambda calculus is Turing complete. It turns out that for any lambda calculus expression, there’s a way rewriting it without any variables, and without any lambdas at all, using three canonical master functions. If you’ve got those three, then you can write anything, anything at all. The three are called S, K, and I.

    • The S combinator is: S x y z = x z (y z).
    • The K combinator is: K x y = x.
    • The I combinator is: I x = x.

    They come from intuitionistic logic, where they’re fundamental axioms that describe how intuitionistic implication works. K is the rule A Rightarrow (B Rightarrow A); S is the rule (A Rightarrow (B Rightarrow C)) Rightarrow ((A Rightarrow B) Rightarrow C); and I is (A Rightarrow A).

    Given any lambda calculus expression, you can rewrite it as a chain of SKIs. (If you’re interested in seeing how, please just ask in the comments; if enough people are interested, I’ll write it up.) What the author of the post id is show how to implement the S, K, and I combinators in Scala types.

    trait Term {
      type ap[x <: Term] <: Term
      type eval <: Term
    }
    

    He’s created a type Term, which is the supertype of any computable fragment written in this type-SKI. Since everything is a function, all terms have to have two methods: one of them is a one-parameter “function” which applies the term to a parameter, and the second is a “function” which simplifies the term into canonical form.

    He implements the S, K, and I combinators as traits that extend Term. We’ll start with the simplest one, the I combinator.

    trait I extends Term {
      type ap[x <: Term] = I1[x]
      type eval = I
    }
    
    trait I1[x <: Term] extends Term {
      type ap[y <: Term] = eval#ap[y]
      type eval = x#eval
    }
    

    I needs to take a parameter, so its apply type-function takes a parameter x, and returns a new type I1[x] which has the parameter encoded into it. Evaluating I1[x] does exactly what you’d want from the I combinator with its parameter – it returns it.

    The apply “method” of I1 looks strange. What you have to remember is that in lambda calculus (and in the SKI combinator calculus), everything is a function – so even after evaluating I.ap[x] to some other type, it’s still a type function. So it still needs to be applicable. Applying it is exactly the same thing as applying its parameter.

    So if have any type A, if you write something like var a : I.ap[A].eval, the type of a will evaluate to A. If you apply I.ap[A].ap[Z], it’s equivalent to taking the result of evaluating I.ap[A], giving you A, and then applying that to Z.

    The K combinator is much more interesting:

    // The K combinator
    trait K extends Term {
      type ap[x <: Term] = K1[x]
      type eval = K
    }
    
    trait K1[x <: Term] extends Term {
      type ap[y <: Term] = K2[x, y]
      type eval = K1[x]
    }
    
    trait K2[x <: Term, y <: Term] extends Term {
      type ap[z <: Term] = eval#ap[z]
      type eval = x#eval
    }
    

    It's written in curried form, so it's a type trait K, which returns a type trait K1, which takes a parameter and returns a type trait K2.

    The implementation is a whole lot trickier, but it's the same basic mechanics. Applying K.ap[X] gives you K1[X]. Applying that to Y with K1[X].ap[Y] gives you K2[K, Y]. Evaluating that gives you X.

    The S combinator is more of the same.

    // The S combinator
    trait S extends Term {
      type ap[x <: Term] = S1[x]
      type eval = S
    }
    
    trait S1[x <: Term] extends Term {
      type ap[y <: Term] = S2[x, y]
      type eval = S1[x]
    }
    
    trait S2[x <: Term, y <: Term] extends Term {
      type ap[z <: Term] = S3[x, y, z]
      type eval = S2[x, y]
    }
    
    trait S3[x <: Term, y <: Term, z <: Term] extends Term {
      type ap[v <: Term] = eval#ap[v]
      type eval = x#ap[z]#ap[y#ap[z]]#eval
    }
    
    
    

    Michid then goes on to show examples of how to use these beasts. He implements equality testing, and then shows how to test if different type-expressions evaluate to the same thing. And all of this happens at compile time. If the equality test fails, then it's a type error at compile time!

    It's a brilliant little proof. Even if you can't read Scala syntax, and you don't really understand Scala type inference, as long as you know SKI, you can look at the equality comparisons, and see how it works in SKI. It's really beautiful.

    The Investors vs. the Tabby

    There’s an amusing article making its rounds of the internet today, about the successful investment strategy of a cat named Orlando..

    A group of people at the Observer put together a fun experiment.
    They asked three groups to pretend that they had 5000 pounds, and asked each of them to invest it, however they wanted, in stocks listed on the FTSE. They could only change their investments at the end of a calendar quarter. At the end of the year, they compared the result of the three groups.

    Who were the three groups?

    1. The first was a group of professional investors – people who are, at least in theory, experts at analyzing the stock market and using that analysis to make profitable investments.
    2. The second was a classroom of students, who are bright, but who have no experience at investment.
    3. The third was an orange tabby cat named Orlando. Orlando chose stocks by throwing his toy mouse at a
      targetboard randomly marked with investment choices.

    As you can probably guess by the fact that we’re talking about this, Orlando the tabby won, by a very respectable margin. (Let’s be honest: if the professional investors came in first, and the students came in second, no one would care.) At the end of the year, the students had lost 160 pounds on their investments. The professional investors ended with a profit of 176 pounds. And the cat ended with a profit of 542 pounds – more than triple the profit of the professionals.

    Most people, when they saw this, had an immediate reaction: “see, those investors are a bunch of idiots. They don’t know anything! They were beaten by a cat!”
    And on one level, they’re absolutely right. Investors and bankers like to present themselves as the best of the best. They deserve their multi-million dollar earnings, because, so they tell us, they’re more intelligent, more hard-working, more insightful than the people who earn less. And yet, despite their self-alleged brilliance, professional investors can’t beat a cat throwing a toy mouse!

    It gets worse, because this isn’t a one-time phenomenon: there’ve been similar experiments that selected stocks by throwing darts at a news-sheet, or by rolling dice, or by picking slips of paper from a hat. Many times, when people have done these kinds of experiments, the experts don’t win. There’s a strong implication that “expert investors” are not actually experts.

    Does that really hold up? Partly yes, partly no. But mostly no.

    Before getting to that, there’s one thing in the article that bugged the heck out of me: the author went out of his/her way to make sure that they defended the humans, presenting their performance as if positive outcomes were due to human intelligence, and negative ones were due to bad luck. In fact, I think that in this experiment, it was all luck.

    For example, the authors discuss how the professionals were making more money than the cat up to the last quarter of the year, and it’s presented as the human intelligence out-performing the random cat. But there’s no reason to believe that. There’s no evidence that there’s anything qualitatively different about the last quarter that made it less predictable than the first three.

    The headmaster at the student’s school actually said “The mistakes we made earlier in the year were based on selecting companies in risky areas. But while our final position was disappointing, we are happy with our progress in terms of the ground we gained at the end and how our stock-picking skills have improved.” Again, there’s absolutely no reason to believe that the students stock picking skills miraculously improved in the final quarter; much more likely that they just got lucky.

    The real question that underlies this is: is the performance of individual stocks in a stock market actually predictable, or is it dominantly random. Most of the evidence that I’ve seen suggests that there’s a combination; on a short timescale, it’s predominantly random, but on longer timescales it becomes much more predictable.

    But people absolutely do not want to believe that. We humans are natural pattern-seekers. It doesn’t matter whether we’re talking about financial markets, pixel-patterns in a bitmap, or answers on a multiple choice test: our brains look for patterns. If you randomly generate data, and you look at it long enough, with enough possible strategies,
    you’ll find a pattern that fits. But it’s an imposed pattern, and it has no predictive value. It’s like the images of jesus on toast: we see patterns in noise. So people see patterns in the market, and they want to believe that it’s predictable.

    Second, people want to take responsibility for good outcomes, and excuse bad ones. If you make a million dollars betting on a horse, you’re going to want to say that it was your superiour judgement of the horses that led to your victory. When an investor makes a million dollars on a stock, of course he wants to say that he made that money because he made a smart choice, not because he made a lucky choice. But when that same investor loses a million dollars, he doesn’t want to say that the lost a million dollars because he’s stupid; he wants to say that he lost money because of bad luck, of random factors beyond his control that he couldn’t predict.

    The professional investors were doing well during part of the year: therefore, during that part of the year, they claim that their good performance was because they did a good job judging which stocks to buy. But when they lost money during the last quarter? Bad luck. But overall, their knowledge and skills paid off! What evidence do we have to support that? Nothing: but we want to assert that we have control, that experts understand what’s going on, and are able to make intelligent predictions.

    The students performance was lousy, and if they had invested real money, they would have lost a tidy chunk of it. But their teacher believes that their performance in the last quarter wasn’t luck – it was that their skills had improved. Nonsense! They were lucky.

    On the general question: Are “experts” useless for managing investments?

    It’s hard to say for sure. In general, experts do perform better than random, but not by a huge margin, certainly not by as much as they’d like us to believe. The Wall Street Journal used to do an experiment where they compared dartboard stock selection against human experts, and against passive investment in the Dow Jones Index stocks over a one-year period. The pros won 60% of the time. That’s better than chance: the experts knowledge/skills were clearly benefiting them. But: blindly throwing darts at a wall could beat experts 2 out of 5 times!

    When you actually do the math and look at the data, it appears that human judgement does have value. Taken over time, human experts do outperform random choices, by a small but significant margin.

    What’s most interesting is a time-window phenomenon. In most studies, the human performance relative to random choice is directly related to the amount of time that the investment strategy is followed: the longer the timeframe, the better the humans perform. In daily investments, like day-trading, most people don’t do any better than random. The performance of day-traders is pretty much in-line with what you’d expect from probability from random choice. Monthly, it’s still mostly a wash. But if you look at yearly performance, you start to see a significant difference: humans do typically outperform random choice by a small but definitely margin. If you look at longer time-frames, like 5 or ten years, then you start to see really sizeable differences. The data makes it look like daily fluctuations of the market are chaotic and unpredictable, but that there are long-term trends that we can identify and exploit.

    Define Distance Differently: the P-adic norm

    As usual, sorry for the delay. Most of the time when I write this blog, I’m writing about stuff that I know about. I’m learning about the p-adic numbers as I write these posts, so it’s a lot more work for me. But I think many of you will be happy to hear about the other reason for the delay: I’ve been working on a book based on some of my favorite posts from the history of this blog. It’s nearly ready! I’ll have more news about getting pre-release versions of it later this week!

    But now, back to p-adic numbers!

    In the last post, we looked a bit at the definition of the p-adic numbers, and how p-adic arithmetic works. But what makes p-adic numbers really interesting and valuable is metrics.

    Metrics are one of those ideas that are simultaneously simple and astonishingly complicated. The basic concept of a metric is straightforward: I’ve got two numbers, and I want to know how far apart they are. But it becomes complicated because it turns out that there are many different ways of defining metrics. We tend to think of metrics in terms of euclidean geometry and distance – the words that I to describe metrics “how far apart” come from our geometric intuition. In math, though, you can’t ever just rely on intuition: you need to be able to define things precisely. And precisely defining a metric is difficult. It’s also fascinating: you can create the real numbers from the integers and rationals by defining a metric, and the metric will reveal the gaps between the rationals. Completing the metric – filling in those gaps – gives you the real numbers. Or, in fact, if you fill them in differently, the p-adic numbers.

    To define just what a metric is, we need to start with fields and norms. A field is an abstract algebraic structure which describes the behavior of numbers. It’s an abstract way of talking about the basic structure of numbers with addition and multiplication operations. I’ve talked about fields before, most recently when I was debunking the crackpottery of E. E. Escultura here.

    A norm is a generalization of the concept of absolute value. If you’ve got a field F, then a norm of F is a function, the norm of is a function | cdot | from values in F to non-negative numbers.

    1. | x | = 0 if and only if x = 0.
    2.  |x y| = |x| |y|
    3. |x + y| le |x| + |y|

    A norm on F can be used to define a distance metric d(x, y) between x and y in F as | x - y|.

    For example, the absolute value is clearly a norm over the real numbers, and it defines the euclidean distance between them.

    So where do the gaps come from?

    You can define a sequence a of values in F as a = { a_i }
    for some set of values i. There’s a special kind of sequence called
    a Cauchy sequence, which is a sequence where lim_{i,j rightarrow infty} |a_n - a_m| = 0.

    You can show that any Cauchy sequence converges to a real number. But even if every element of a Cauchy sequence is a rational number, it’s pretty easy to show that many (in fact, most!) Cauchy sequences do not converge to rational numbers. There’s something in between the rational numbers which Cauchy sequences of rational numbers can converge to, but it’s not a rational number. When we talk about the gaps in the rational numbers, that’s what we mean. (Yes, I’m hand-waving a bit, but getting into the details
    would be a distraction, and this is the basic idea!)

    When you’re playing with number fields, the fundamental choice that you get is just how to fill in those gaps. If you fill them in using a metric based on a Euclidean norm, you get the real numbers. What makes the p-adic numbers is just a different norm, which defines a different metric.

    The idea of the p-adic metric is that there’s another way of describing the distance between numbers. We’re used to thinking about distance measured like a ruler on a numberline, which is what gives us the reals. For the p-adics, we’re going to define distance in a different way, based on the structure of numbers. The way that the p-adic metric works is based on how a number is built relative to the prime-number base.

    We define the p-adic metric in terms of the p-adic norm exactly the way that we defined Euclidean distance in terms of the absolute value norm. For the p-adic number, we start off with a norm on the integers, and then generalize it. In the P-adic integers, the norm of a number is based around the largest power of the base that’s a factor of that number: for an integer x, if p^n is the largest power of p that’s a factor of x, then the the p-adic norm of x (written |x|_p) is p^{-n}. So the more times you multiply a number by the p-adic base, the smaller the p-adic norm of that number is.

    The way we apply that to the rationals is to extend the definition of p-factoring: if p is our p-adic base, then we can define the p-adic norm of a rational number as:

    • |0|_p = 0
    • For other rational numbers x: |x|_p = p^{-text{ord}_p(x)} where:
      • If x is a natural number, then text{ord}_p(x) is the exponent of the largest power of p that divides x.
      • If x is a rational number a/b, then text{ord}(a/b) = ord(a) - ord(b).

    Another way of saying that is based on a property of rational numbers and primes. For any prime number p, you can take any rational number x, and represent it as a p-based ratio p^nfrac{a}{b}, where neither a nor b is divisible by p. That representation is unique – there’s only one possible set of values for a, b, and n where that’s true. In that case, p-adic norm of x,
    |x|_p == p^{-n}.

    Ok, that’s a nice definition, but what on earth does it mean?

    Two p-adic numbers x and y are close together if x - y is divisible by a large power of p.

    In effect, this is the exact opposite of what we’re used to. In the real numbers written out in decimal for as a series of digits, the metric says that the more digits numbers have in common moving from left to right, the closer together they are. So 9999 and 9998 are closer than 9999 and 9988.

    But with P-adic numbers, it doesn’t work that way. The P-adic numbers are closer together if, moving right to left, they have a common prefix. The distance ends up looking very strange. In 7-adic, the distance between 56666 and 66666 is smaller than the distance between 66665 and 66666!

    As strange as it looks, it does make a peculiar kind of sense. The p-adic distance is measuring a valuable and meaningful kind of distance between numbers – their distance in terms of
    their relationship to the base prime number p. That leads to a lot of interesting stuff, much of which is, to be honest, well beyond my comprehension! For example, the Wiles proof of Fermat’s last theorem uses properties of the P-adic metric!

    Without getting into anything as hairy as FLT, there are still ways of seeing why the p-adic metric is valuable. Next post, we’ll look at something called Hensel’s lemma, which both shows how something like Newton’s method for root-finding works in the p-adic numbers, and also shows some nice algebraic properties of root-finding that aren’t nearly as clear for the real numbers.

    P-adic arithmetic

    I’ve been having a pretty rough month. The day after thanksgiving,
    I was diagnosed with shingles. Shingles is painful – very, very painful. Just the friction of clothing brushing over the shingles when I move caused so much pain that I spent weeks being as immobile as possible.

    Needless to say, this has been extremely boring. When I wasn’t fuzzed out on pain-killers, I’ve been doing some reading lately on something called p-adic numbers. p-adics are a very strange kind of number: They’re strange-looking, strange-acting, and strangely useful.

    P-adics are an alternative to the real numbers. In fact, in a way, they are the alternative to real numbers.

    If you go back to number theory, there’s a reason why the rational numbers aren’t sufficient, and why we have to extend them to the reals by adding irrational numbers. If you take all of the possible rational numbers, you find that there are gaps – places where we know that there must be a number, and yet, if we limit ourselves to fractions, there isn’t anything to fit. For example, we know that there must be some number where if we multiply it by itself, it’s equal to 2. It’s more than 1 4/10ths, by about 1/100th. But it’s more than 141/100ths, but about 4/1,000ths. It’s more that 1,414/1,000s, by about 2/10,000ths. No matter how far we go, there’s no fraction that’s exactly right. But we know that there’s a number there! If we look at those gaps carefully, we’ll find that most numbers are actually in those gaps! The real numbers and the p-adics are both ways of creating number systems that allow us to define the numbers that fit in to those gaps.

    The easiest way to understand p-adic numbers is think about how we
    represent real numbers in base-P. In real numbers, we represent them using the powers of the base. So, for example, we’d write 24.31 in base-5 to mean 2*51 + 4*50 + 3*5-1 + 1*5-2. Using our normal notation for real numbers, we fill in the gaps by writing numbers as an integer part, followed by a decimal point, followed by a fractional part. For real numbers that aren’t rationals, we say that the fractional part goes on forever. So the square root of two starts 1.4142135624, and continues on and on, forever, without repeating. That gives us the real numbers. In that system, every number can be written using a finite number of digits to the left of the decimal point, and an infinite number of digits to the right.

    P-adic numbers are exactly the opposite: every P-adic number has a finite number of digits to the right of the decimal, but it can have an infinite number to the left!

    Defining p-adic numbers starts off being pretty similar to how we compute the representation of numbers in a standard numeric base. Take a number for your base, like 10. For a number n in base 10, take n modulo 10. That’s the right-most digit of your number. Divide by ten, dropping the fractional part, and take the result modulo 10, and that’s the second-rightmost digit. Continue this until there’s nothing left.

    For example, take the number 222 in base-10. If we wanted to represent that in base-7, we’d do:

    1. If we divide 222 by 7, we get 31 with a remainder of 5. So the rightmost digit is 5.
    2. We take 31, and divide it by 7, giving us 4, with a remainder of 3. So the second digit is 3.
    3. We’re left with 4, so the last digit is 4.

    So – 222 in base-7 is 435. It’s the same in 7-adic. For a particular base B, all positive integers are written the same in both base-B and B-adic. Integer arithmetic that doesn’t involve negative numbers is also the same.

    There’s one reallybig catch, and it’s one that would make a lot of cranks (like E. E. Escultura) very happy: for real numbers, the decimal notation (for example) is a just a representation – 35 in base 10 is written differently from 43 in base 8 but they’re the same actual number, just written differently. 35 in 10-adic is not the same number as 43 in 8-adic. As we’ll see when we get to metrics, they’re quite different. Each p-adic base produces a distinct system of p-adic numbers, and you can’t convert between them as freely as you can in the conventional reals. Decimal notation and hexidecimal notation are writing numbers in the same number system; 2-adic and 3-adic are different number systems!

    The first essential difference between p-adic numbers and the reals comes when you try to do arithmetic.

    As I said above, for integers, if you don’t
    do anything with negative numbers, P-adic arithmetic is the same as real number integer arithmetic. In fact, if you’re working with P-adic numbers, there are no negative numbers at all! In a P-adic number system, subtracting 1 from 0 doesn’t give you -1. It “rolls over” like an infinite odometer. So for 7-adic, 0-1 = …666666666! That means that arithmetic gets a bit different. Actually, it’s really pretty easy: you just do arithmetic from the right to the left, exactly the way that you’re used to.

    For addition and subtraction, P-adic works almost like normal real-number arithmetic using decimals. Addition is basically what you know from decimal arithmetic. Just go right to left, adding digits, and carrying to the left.

    So, for example, in 5-adic, if you have a number …33333, and 24, to add them, you’d go through the following steps.

    1. 3 + 4 is 7, which is 12 in base-5. So the first digit of the sum is 2, and we carry 1.
    2. 3 + 2 is 6, plus the carried one is 6 – so again, 12 in base-5. So the second digit is also 2, and we carry 1.
    3. 3 + 0 is 3, plus the carried one is 4, se the third digit is 4.
    4. For all the rest of the infinite digits streaming off to the left, it’s 3+0=3.

    So the sum is …3333422.

    To do subtraction, it’s still basically the same as what you’re used to from decimal. There’s just one simple change: infinite borrowing. In normal subtraction, you can borrow from the position to your left if there’s anything to your left to borrow from. For example, in decimal, if you wanted to subtract 9 from 23, you’d borrow 10 from the 2, then subtract 9 from 13, giving you a result of 14. But if you wanted to substract 3-9, you couldn’t borrow, because there’s nothing to the left to borrow from. In p-adic, you can always borrow. If you’re subtracting 3-9 in 10-adic, then you can borrow from the 0 to the left of the 3. Of course, there’s nothing there – so it needs to borrow from its left. And so on – giving you an infinite string of 9s. So 3-9 in 10-adic gives you ….999999994.

    Let’s do a full example: …33333 – 42 in 5-adic.

    1. As always, we start from the right. 3 – 2 = 1, so the first digit is 1.
    2. Since 3 is smaller than 4, we need to borrow 1 – so we have 13 base 5, which is 8. 8 – 4 = 4. So
      the second digit is 4.
    3. For the third digit, we just subtract the borrowed 1, so it’s 2.

    So the result is …3333241.

    Multiplication and division get even stranger in P-adic. Because we can’t have an infinite number of digits to the right of the decimal, p-adic ends up representing fractions using infinite numbers of digits on the right of the decimal. And that means that we get collisions between fractions and negative numbers. (This should start to give you a clue why each p-adic base is really a different number system: the correspondance between roll-over negatives and infinitely long fractions is different in each base.) It’s easiest to see how this works by looking at a concrete example.

    The fraction 1/3 can’t be written as finite-length string in base-5. In 5-adic, that means we can’t write it using digits to the right of the decimal point – we would need an infinite number of digits, and we’re not allowed to do that. Instead, we need to write it with an infinite number of digits to the left! 1/3 in 5-adic is: …1313132.

    Looks crazy, but it does work: if you do a tabular multiplication, right to left, multiplying …1313132 by 3 gives you one! Let’s work it through:

    • Start from the right: the rightmost digit is 2. 3*2 is 6, which is 11 in base 5; so the rightmost digit is 1, and you carry a one.
    • The next digit is 3: 3 times 3 is – 9, plus the carried 1, gives 10 – which is 20 in base-5, so the next digit is a 0, and you carry 2.
    • The next digit is 1: 3*1 is 3 plus the carried 2 = 5; so 0, carry 1.

    And so on – ….13131313132 * 3 = 1, so ….131313132 == 1/3 in 5-adic.

    How can we compute the value of 1/3? Just like decimal real numbers: division.

    Division in p-adics is actually easy. The trick is that like all of the other arithmetic, it goes from right to left. Suppose we want to divide N by M. To make it easy, we’ll talk about the digits of M and N using subscripts. The rightmost digit of a number X is X1; the second-rightmost is X2, etc. The multiplication algorithm is:

    1. Start at the rightmost digit of both numbers.
    2. Find the smallest number d which, multiplied by M, has as Ni as its rightmost digit.
    3. Subtract d*Mi from N.
    4. Drop the trailing last digits from N, giving N’.
    5. Now divide N’ by M, and put d on the right.

    Let’s walk through 1/3 in 5-adic:

    • The rightmost digit of 1 is 1.
    • What, multiplied by 3 will have a trailing digit of 1 in base-5? 2*3=6, which is 11 in base-5. So d = 2.
    • Now we subtract the “11” from 1 – which is really …00001. So it becomes …444440.
    • We drop the trailing 0, and N’ is …4444.
    • So now we divide …4444 by 3. What’s the smallest number which, multiplied by 3, ends in a 4 in base-5? 3*3=9, which is 14 in base-5. So the next digit of the result is 3.
    • Now, we subtract 14 from …4444. That gives us …44430. Drop the zero, and it’s …4443
    • Next digit is a 1, leaving …444

    Crazy, huh? There is one catch about division: it only really works if the p-base in your p-adic system is a prime number. Otherwise, you get trouble – your p-adic system of numbers isn’t a field if p is non-prime.

    If you think about it, arithmetic with the P-adics is actually simpler than it is with conventional real numbers. Everything goes right to left. It’s all more uniform. In fact, p-adic has been seriously proposed as a number representation for computer hardware, because the hardware is much easier to build when everything can be done uniformly right to left!

    There’s a lot more wierdness in the p-adics. We haven’t even started to talk about metrics and distances in p-adic numbers. That’s both where the p-adics get even stranger, and where they actually get useful. That’ll be the next post, hopefully within a day or two!

    Debunking Two Nate Silver Myths

    I followed our election pretty closely. My favorite source of information was Nate Silver. He’s a smart guy, and I love the analysis that he does. He’s using solid math in a good way to produce excellent results. But in the aftermath of the election, I’ve seen a lot of bad information going around about him, his methods, and his result.

    First: I keep seeing proclamations that “Nate Silver proves that big data works”.

    Rubbish.

    There is nothing big data about Nate’s methods. He’s using straightforward Bayesian methods to combine data, and the number of data points is remarkably small.

    Big data is one of the popular jargon keywords that people use to appear smart. But it does actually mean something. Big data is using massive quantities of information to find patterns: using a million data points isn’t really big data. Big data means terabytes of information, and billions of datapoints.

    When I was at Google, I did log analysis. We ran thousands of machines every day on billions of log records (I can’t say the exact number, but it was in excess of 10 billion records per day) to extract information. It took a data center with 10,000 CPUs running full-blast for 12 hours a day to process a single days data. Using that data, we could extract some obvious things – like how many queries per day for each of the languages that Google supports. We could also extract some very non-obvious things that weren’t explicitly in the data, but that were inferrable from the data – like probable network topologies of the global internet, based on communication latencies. That’s big data.

    For another example, look at this image produced by some of my coworkers. At foursquare, we about five million points of checkin data every day, and we’ve got a total of more than 2 1/2 billion data points. By looking at average checkin densities, and then comparing that to checkin densities after the hurricane, we can map out precisely where in the city there was electricity, and where there wasn’t. We couldn’t do that by watching one person, or a hundred people. But by looking at the patterns in millions and millions of records, we can. That is big data.

    This doesn’t take away from Nate’s accomplishment in any way. He used data in an impressive and elegant way. The fact is, he didn’t need big data to do this. Elections are determined by aggregate behavior, and you just don’t need big data to predict them. The data that Nate used was small enough that a person could do the analysis of it with paper and pencil. It would be a huge amount of work to do by hand, but it’s just nowhere close to the scale of what we call big data. And trying to do big data would have made it vastly more complicated without improving the result.

    Second: there are a bunch of things like this.

    The point that many people seem to be missing is that Silver was not simply predicting who would win in each state. He was publishing the odds that one or the other candidate would win in each statewide race. That’s an important difference. It’s precisely this data, which Silver presented so clearly and blogged about so eloquently, that makes it easy to check on how well he actually did. Unfortunately, these very numbers also suggest that his model most likely blew it by paradoxically underestimating the odds of President Obama’s reelection while at the same time correctly predicting the outcomes of 82 of 83 contests (50 state presidential tallies and 32 of 33 Senate races).

    Look at it this way, if a meteorologist says there a 90% chance of rain where you live and it doesn’t rain, the forecast wasn’t necessarily wrong, because 10% of the time it shouldn’t rain – otherwise the odds would be something other than a 90% chance of rain. One way a meteorologist could be wrong, however, is by using a predictive model that consistently gives the incorrect probabilities of rain. Only by looking a the odds the meteorologist gave and comparing them to actual data could you tell in hindsight if there was something fishy with the prediction.

    Bzzt. Sorry, wrong.

    There are two main ways of interpreting probability data: frequentist, and Bayesian.

    In a frequentist interpretation, saying that an outcome of an event has a probability X% of occuring, you’re saying that if you were to run an infinite series of repetitions of the event, then on average,
    the outcome would occur in X out of every 100 events.

    The Bayesian interpretation doesn’t talk about repetition or observation. What it says is: for any specific event, it will have one outcome. There is no repetition. But given the current state of information available to me, I can have a certain amount of certainty about whether or not the event will occur. Saying that I assign probability P% to an event doesn’t mean that I expect my prediction to fail (100-P)% of the time. It just means that given the current state of my knowledge, I expect a particular outcome, and the information I know gives me that degree of certainty.

    Bayesian statistics and probability is all about state of knowledge. The fundamental, defining theorem of Bayesian statistics is Bayes theorem, which tells you, given your current state of knowledge and a new piece of information, how to update your knowledge based on what the new information tells you. Getting more information doesn’t change anything about whether or not the event will occur: it will occur, and it will have either one outcome or the other. But new information can allow you to improve your prediction and your certainty of that prediction’s correctness.

    The author that I quoted above is being a frequentist. In another section of his articple, he’s more specific:

    …The result is P= 0.199, which means there’s a 19.9% chance that it rained every day that week. In other words, there’s an 80.1% chance it didn’t rain on at least one day of the week. If it did in fact rain everyday, you could say it was the result of a little bit of luck. After all, 19.9% isn’t that small a chance of something happening.

    That’s frequentist intepretation of the probability – which makes sense, since as a physicist, the author is mainly working with repeated experiments – which is a great place for frequentist interpretation. But looking at the same data, a Bayesian would say: “I have an 19.9% certainty that it will rain today”. Then they’d go look outside, see the clouds, and say “Ok, so it looks like rain – that means that I need to update my prediction. Now I’m 32% certain that it will rain”. Note that nothing about the weather has changed: it’s not true that before looking at the clouds, 80.1 percent of the time it wouldn’t rain, and after looking, that changed. The actual fact of whether or not it will rain on that specific day didn’t
    change.

    Another way of looking at this is to say that a frequentist believes that a given outcome has an intrinstic probability of occurring, and that our attempts to analyze it just bring us closer to the true probability; whereas a Bayesian says that there is no such thing as an intrinsic probability, because every event is different. All that changes is our ability to make predictions with confidence.

    One last metaphor, and I’ll stop. Think about playing craps, where you’re rolling two six sided dice.
    For a particular die, a frequentist would say “A fair die has a 1 in 6 chance of coming up with a 1”. A
    Bayesian would say “If I don’t know anything else, then my best guess is that I can be 16% certain that a 1
    will result from a roll.” The result is the same – but the reasoning is different. And because of the difference in reasoning, you can produce different predictions.

    Nate Silver’s predictions of the election are a beautiful example of Bayesian reasoning. He watched daily polls, and each time a new poll came out, he took the information from that poll, weighted it according to the historical reliability of that poll in that situation, and then used that to update his certainty. So based on his data, Nate was 90% certain that his prediction was correct.