DNS and Yesterday's Outage

As I’m sure at least some of you noticed, Scientopia – along with a huge number of other sites – was unreachable for most of yesterday. I say “unreachable” instead of “down” because the site was up. But if you wanted to get to it, and you didn’t know it’s numeric address, you couldn’t reach it. That’s because your computer uses a service called DNS to find things.

So what’s this DNS thing? And what brought down so much of the net yesterday?

DNS stands for “domain name service”. What it does is provide a bridge between the way that you think of the name of a website, and the way that your computer actually talks to the website. You think of a website by a URL, like http://scientopia.org/blogs/goodmath. That URL basically says “the thing named ”blogs/goodmath” on a server named ”scientopia.org”, which you can access using a system called ”http”.”.

The catch is that your computer can’t just send a message to something named “scientopia.org”. Scientopia is a server in a datacenter somewhere, and to send a message to it, it needs to know a numeric address. Numeric addresses are usually written as a set of four numbers, each ranging from 0 to 255. For example, scientopia.org is currently 184.106.221.182.

You don’t want to have to remember those four meaningless numbers in order to connect to scientopia. But there’s more to it than just remembering: those numbers can change. The physical computer that scientopia works on has been moved around several times. It lives in a datacenter somewhere, and as that datacenter has grown, and hardware has needed to be either expanded or replaced, the numeric address has changed. Even though I’m the administrator of the site, there’ve been a couple of times where it’s changed, and I can’t pinpoint exactly when!

The reason that all of that works is DNS. When you tell your computer to connect to another computer by name, it uses DNS to ask “What’s the numeric address for this name?”, and once it gets that numeric address, it uses that to connect to the other computer.

When I first got access to the internet in college, it was shortly before DNS first got deployed. At the time, there was one canonical computer somewhere in California. Every other computer on the (then) arpanet would contact that computer every night, and download a file named “hosts.txt”. hosts.txt contained a list of every computer on the internet, and its numeric address. That was clearly not working any more, and DNS was designed as the replacement. When they decided to replace hosts.txt, they wanted to replace it with something much more resilient, and something that didn’t need to have all of the data in one place, and which didn’t rely on any single machine in order to work.

The result was DNS. It’s a fascinating system. It’s the first real distributed system that I ever learned about, and I still think it’s really interesting. I’m not going to go into all of the details: distributed systems always wind up with tons of kinks, to deal with the complexity of decentralized information in the real world. But I can explain the basics pretty easily.

The basic way that DNS works is really simple. You take a domain name. For example, my personal machine is named “sunwukong”, and my family network is “phouka.net” – so the hostname that I’m typing this on right now is “sunwukong.phouka.net”. To look it up, what happens is you take the host name, and split it at the dots. So for the example, that’s [“sun-wukong”, “phouka”, “net”].

You start with the last element at the list, which is called the top-level-domain, and you contact a special server called the root nameserver and ask “who do I ask for addresses ending with this tld? (“Who do I ask for ‘.net’?”) It sends a response to you, giving you the address of another nameserver. Then you contact that, and ask it “who do I ask for the next element”? (“Who do I ask for phouka?”) Finally, when you get to the last one, the address that it gives you is the address of the machine you’re looking for, instead of the name of another DNS server. At each level of the heirarchy, there’s a server or set of servers that’s registered with the level above to be the canonical server for some set of names. The “.com” servers are registered with the root level servers. The next level of domains is registered with the TLD servers. And so on.

So, to look up sunwukong.phouka.net, what happens is:

  1. You send a request to the root nameserver, and ask for the server for “.net”?
  2. The root server responds, giving you the “.net” server.
  3. You send a request to the .net server address that you just got, asking “Who can tell
    me about phouka?”
  4. The .net server responds, and gives you an address for a DNS server that can tell
    you where things are in phouka.net.
  5. You ask the “phouka.net” server for the address of sunwukong
  6. It finally gives you the address of your server.

This distributes the information nicely. You can have lots of root name servers – all they need to be able to do is find DNS servers for the list of top-level domains. And there can be – and are – thousands and thousands of nameservers for addresses in those top-level domains. And then each administrator of a private network can have a top-level nameserver for the computers that they directly manage. In this mess, any nameserver at any level can disappear, and all that will be affected are the names that it manages.

The problem is, there’s no one place to get information; and more importantly, every time you need to talk to another computer, you need to go through this elaborate sequence of steps.

To get around that last issue, you’ve got something called caches, on multiple levels. A cache is a copy of the canonical data, which gets kept around for some period of time. In the DNS system, you don’t need to talk to the canonical DNS server for a domain if someone has the data they need in a cache. You can (and generally do) talk to cacheing DNS servers. With a cacheing DNS server, you tell it the whole domain name that you want to look up, and it does the rest of the lookup process, and gives you the address you’re looking for. Every time it looks something up, it remembers what it did, so that it doesn’t need to ask again. So when you look up a “.com” address, your cacheing service will remember who it can ask for “.com” addresses. So most of the time, you only talk to one server, which does the hard work, and does it in a smart way so that it doesn’t need to unnecessarily repeat requests.

So now, we can get to what happened yesterday. GoDaddy, one of the major american DNS registries, went down. Exactly why it went down is not clear – a group of jerks claimed to have done a denial-of-service attack against them; but the company claims to have just had a configuration glitch. Honestly, I’m inclined to believe that it was the hackers, but I don’t know for sure.

But – the canonical server for scientopia.org was gone. So if you needed to look up scientopia.org and it wasn’t in your cache, then your cacheing nameserver would go to the .org server, and ask for scientopia; the .org nameserver would return the address of GoDaddy’s nameserver, and that wouldn’t respond. So you’d never reach us.

That’s also why some people were able to reach the site, and others weren’t. If your cacheing server had cached the address for scientopia, then you’d get the server address, and since the server was up the whole time, you’d connect right up, and everything would work.

Nulls

So, my post on monads apparently set of a bit of a firestorm over my comments about avoiding null pointer exceptions. To show you what I mean, here’s a link to one of the more polite and reasonable posts.

Ok. So. Let me expand a bit.

The reason that I think it’s so great that I don’t get NPEs when I use an option with something like Option isn’t because it makes me a super-programmer who’s better than the lowly slime who deal with NPEs. It’s because it changes how that I write code, in a way that helps me avoid making mistakes – mistakes that I make because I’m just a fallible idiot of a human.

There are two fundamental things about an option type, like what we have in Scala, that make a huge difference.

First, it narrows the field of errors. When I’m programming in Java, any call that returns a pointer could return a null. The language makes no distinction between a function/expression that could return a null, and one that can’t. That means that when I get an NPE, the source of that null pointer could be anything in the dynamic slice leading to the error. With an option type, I’ve got two kinds of functions: functions that always return a non-null value, and functions that sometimes return a non-null value, and sometimes return a None. That’s incredibly valuable.

Second, it forces me to explicitly deal with the None case. In Java, programmers constantly build code without null checks, because they know that a function won’t return null. And then it does, and ker-splat. With an option type, I have no choice: I have to explicitly deal with the potential error case. Sure, I can forcibly code around it – in Scala, I can use Option.get, which will turn into an error analagous to an NPE. But it forces me to make that choice, and make it explicitly.

Even if I’m going the stupid, brute-force route and assuming that I know, without fail, that a function is going to return a non-null value… Consider an example:

   Java: :
  T g = f.doSomething()
  g.doSomethingElse()

   Scala:
  val g: Option[T] = f.doSomething()
  g.get.doSomethingElse()

The scala case has to explicitly deal with the fact that it’s dealing with a potentially empty value, and using a statement that asserts the non-emptiness.

But in reality, if you’re a decent programmer, you never use .get to directly access an option. (The only exception is in cases where you call the .get in a context dominated by a non-empty test; but even then, it’s best to not, to avoid errors when the surrounding code is modified.) In real code, you pretty much always explicitly de-option a value using a function like getOrElse:

val f: User = getUser("markcc").getOrElse(new User("markcc"))

As I hope it has become plain, the point of avoiding NPEs through option-like type structures isn’t that somehow it makes the entire category of unexpected result value disappear. It’s that it changes the way that you code to distinguish where those errors can occur, and to force you to deal with them.

I think that ultimately, things like this are really just manifestations of the good-old static vs dynamic type wars. Type errors in a dynamically typed language are really just unexpected value errors. Strong typing doesn’t stop you from making those errors. It just takes a bunch of common cases of those errors, and converts them from a run-time error to a compile-time error. Whether you want them to be run-time or compile-time depends on the project your working on, on your development team, and on your personal preferences.

I find in practice that I get many fewer errors by being forced to explicitly declare when a value might be null/None, and by being required to explicitly deal with the null/None case when it might occur. I’ve spent much less time debugging that kind of error in my year at foursquare than in the 15 years of professional development that I did before. That’s not because I magically became a better programmer a year ago when I joined foursquare. It’s because I’m using a better tool that helps me avoid mistakes.

Monads and Programming

Sorry things have been so slow around here. I know I keep promising that I’m going to post more frequently, but it’s hard. Life as an engineer at a startup is exhausting. There’s so much work to do! And the work is so fun – it’s easy to let it eat up all of your time.

Anyway… last good-math post ’round these parts was about monoids and programming. Today, I’m going to talk about monads and programming!

If you recall, monoids are an algebraic/categorical construct that, when implemented in a programming language, captures the abstract notion of foldability. For example, if you’ve got a list of numbers, you can fold that list down to a single number using addition. Folding collections of values is something that comes up in a lot of programming problems – capturing that concept with a programming construct allows you to write code that exploits the general concept of foldability in many different contexts.

Monads are a construct that have become incredibly important in functional programming, and they’re very, very poorly understood by most people. That’s a shame, because the real concept is actually simple: a monad is all about the concept of sequencing. A monad is, basically, a container that you can wrap something in. Once it’s wrapped, you can form a sequence of transformations on it. The result of each step is the input to the next. That’s really what it’s all about. And when you express it that way, you can begin to see why it’s such an important concept.

I think that people are confused by monads for two reasons:

  1. Monads are almost always described in very, very abstract terms. I’ll also get into the abstract details, but I’ll start by elaborating on the simple description I gave above.
  2. Monads in Haskell, which are where most people are introduced to them, are very confusing. The basic monad operations are swamped with tons of funny symbols, and the basic monad operations are named in incredibly misleading ways. (“return” does almost the exact opposite of what you expect return to do!)

In programming terms, what’s a monad?

Basically, a monadic computation consists of three pieces:

  1. A monadic type, M which is a parametric type wrapper that can wrap a value of any type.
  2. An operation which can wrap a value in M.
  3. An operation which takes a function that transforms a value wraped in M into another value (possibly with a different type) wrapped in M.

Whenever you describe something very abstractly like this, it can seem rather esoteric. But this is just a slightly more formal way of saying what I said up above: it’s a wrapper for a series of transformations on the wrapped value.

Let me give you an example. At foursquare, we do all of our server programming in Scala. In a year at foursquare, I’ve seen exactly one null pointer exception. That’s amazing – NPEs are ridiculously common in Java programming. But in Scala at foursquare, we don’t allow nulls to be used at all. If you have a value which could either be an instance of A, or no value, we use an option type. An Option[T] can be either Some(t: T) or None.

So far, this is nice, but not really that different from a null. The main difference is that it allows you to say, in the type system, whether or not a given value might be null. But: Option is a monad. In Scala, that means that I can use map on it. (map is one of the transformation functions!)

	val t: Option[Int] = ...
	val u: Option[String] = t.map( _ + 2 ).map(_.toString)

What this does is: if t is Some(x), then it adds two to it, and returns Some(x+2); then it takes the result of the first map, and converts it toa string, returning an Option[String]. If t is None, then running map on it always returns None. So I can write code which takes care of the null case, without having to write out any conditional tests of nullness – because optionality is a monad.

In a good implementation of a monad, I can do a bit more than that. If I’ve got a Monad[T], I can use a map-like operation with a function that takes a T and returns a Monad[T].

For an example, we can look at lists – because List is a monad:

val l: List[Int] = List(1, 2, 3)
l.flatMap({ e => List( (e, e), (e+1, e+2) )  })
res0: List[(Int, Int)] = List((1,1), (2,3), (2,2), (3,4), (3,3), (4,5))

The monad map operation does a flatten on the map steps. That means a lot of things. You can see one in the rather silly example above.

You can take values, and wrap them as a list. THen you can perform a series of operations on those elements of a list – sequencing over the elements of the list. Each operation, in turn, returns a list; the result of the monadic computation is a single list, concatenating, in order, the lists returned by each element. In Scala, the flatMap operation captures the monadic concept: basically, if you can flatmap something, it’s a monad.

Let’s look at it a bit more specifically.

  1. The monadic type: List[T].
  2. A function to wrap a value into the monad: the constructor function from List def apply[T](value: T): List[T]
  3. The map operation: def flatMap[T, U](op: T => List[U]): List[U].

(In the original version of this post, the I put the wrong type in flatMap in the list above. In the explanation demonstrating flatMap, the type is correct. Thanks to John Armstrong for catching that!)

You can build monads around just about any kind of type wrapper where it makes sense to map over the values that it wraps: collections, like lists, maps, and options. Various kinds of state – variable environments (where the wrapped values are, essentially, functions from identifiers to values), or IO state. And plenty of other things. Anything where you perform a sequence of operations over a wrapped value, it’s a monad.

Now that we have some understanding of what this thing we’re talking about it, what is it in mathematical terms? For that, we turn to category theory.

Fundamentally, in category theory a monad is a category with a particular kind of structure. It’s a category with one object. That category has a collection of arrows which (obviously) are from the single object to itself. That one-object category has a functor from the category to itself. (As a reminder, a functor is an arrow between categories in the category of (small) categories.)

The first trick to the monad, in terms of theory, is that it’s fundamentally about the functor: since the functor maps from a category to the same category, so you can almost ignore the category; it’s implicit in the definition of the functor. So we can almost treat the monad as if it were just the functor – that is, a kind of transition function.

The other big trick is closely related to that. For the programming language application of monads, we can think of the single object in the category as the set of all possible states. So we have a category object, which is essentially the collection of all possible states; and there are arrows between the states representing possible state transitions. So the monad’s functor is really just a mapping from arrows to different arrows – which basically represents the way that changing the state causes a change in the possible transitions to other states.

So what a monad gives us, in terms of category theory, in a conceptual framework that captures the concept of a state transition system, in terms of transition functions that invisibly carry a state. When that’s translated into programming languages, that becomes a value that implicitly takes an input state, possibly updates it, and returns an output state. Sound familiar?

Let’s take a moment and get formal. As usual for category theory, first there are some preliminary definitions.

  1. Given a category, C, 1C is the identity functor from C to C.
  2. Given a category C with a functor T : CC, T2 = T º T.
  3. Given a functor T, 1T : TT is the natural transformation from T to T.

Now, with that out of the way, we can give the complete formal definition of a monad. Given a category C, a monad on C is a triple: (T:CC, η:1CT, μ:T2T), where T is a functor, and η and μ are natural transformations. The members of the triple must make the following two diagrams commute.

monad-prop1.jpg

Commutativity of composition with μ


monad-prop2.jpg

Commutativity of composition with η

What these two diagrams mean is that successive applications of the state-transition functor over C behave associatively – that any sequence of composing monadic functors will result in a functor with full monadic structure; and that the monadic structure will always preserve. Together, these mean that any sequence of operations (that is, applications of the monad functor) are themselves monad functors – so the building a sequence of monadic state transformers is guaranteed to behave as a proper monadic state transition – so what happens inside of the monadic functor is fine – to the rest of the universe, the difference between a sequence and a single simple operation is indistinguishable: the state will be consistently passed from application to application with the correct chaining behavior, and to the outside world, the entire monadic chain looks like like a single atomic monadic operation.

Now, what does this mean in terms of programming? Each element of a monadic sequence in Haskell is an instantiation of the monadic functor – that is, it’s an arrow between states – a function, not a simple value – which is the basic trick to monads. They look like a sequence of statements; in fact, each statement in a monad is actually a function from state to state. And it looks like we’re writing sequence code – when what we’re actually doing is writing function compositions – so that when we’re done writing a monadic sequence, what we’ve actually done is written a function definition in terms of a sequence of function compositions.

Understanding that, we can now clearly understand why we need the return function to use a non-monad expression inside of a monadic sequence – because each step in the sequence needs to be an instance of the monadic functor; an expression that isn’t an instance of the monadic functor couldn’t be composed with the functions in the sequence. The return function is really nothing but a function that combines a non-monadic expression with the id functor.

In light of this, let’s go back and look at the definition of Monad in the Haskell standard prelude.

class  Functor f  where
  fmap              :: (a -> b) -> f a -> f b

class  Monad m  where
  (>>=)  :: m a -> (a -> m b) -> m b
  (>>)   :: m a -> m b -> m b
  return :: a -> m a
  fail   :: String -> m a

  -- Minimal complete definition:
  --      (>>=), return
  m >> k  =  m >>= _ -> k
  fail s  = error s

The declaration of monad is connected with the definition of functor – if you look, you can see the connection. The fundamental operation of Monad is “>>=” – the chaining operation, which is basically the haskell version of the map operation, which is type m a -> (a -> m b) -> m b is deeply connected with the fmap operation from Functor‘s fmap operation – the a in m a is generally going to be a type which can be a Functor. (Remember what I said about haskell and monads? I really prefer map and flatMap to >> and >>=).

So the value type wrapped in the monad is a functor – in fact, the functor from the category definition! And the “>>=” operation is just the functor composition operation from the monad definition.

A proper implementation of a monad needs to follow some fundamental rules – the rules are, basically, just Haskell translations of the structure-preserving rules about functors and natural transformations in the category-theoretic monad. There are two groups of laws – laws about the Functor class, which should hold for the transition function wrapped in the monad class; and laws about the monadic operations in the Monad class. One important thing to realize about the functor and monad laws is that they are not enforced – in fact, cannot be enforced! – but monad-based code using monad implementations that do not follow them may not work correctly. (A compile-time method for correctly verifying the enforcement of these rules can be shown to be equivalent to the halting problem.)

There are two simple laws for Functor, and it’s pretty obvious why they’re fundamentally just strucure-preservation requirements. The functor class only has one operation, called fmap, and the two functor laws are about how it must behave.

  1. fmap id = id
    (Mapping ID over any structured sequence results in an unmodified sequence)
  2. fmap (f . g) = (fmap f) . (fmap g)
    (“.” is the function composition operation; this just says that fmap preserves the structure to ensure that that mapping is associative with composition.)

The monad laws are a bit harder, but not much. The mainly govern how monadic operations interact with non-monadic operations in terms of the “return” and “>>=” operations of the Monad class.

  1. return x >>= f = f x
    (injecting a value into the monad is basically the same as passing it as a parameter down the chain – return is really just the identity functor passing its result on to the next step. I hate the use of “return”. In a state functor, in exactly the right context, it does sort-of look like a return statement in an imperative language. But in pretty much all real code, return is the function that wraps a value into the monad.)
  2. f >>= return = f
    (If you don’t specify a value for a return, it’s the same as just returning the result of the previous step in the sequence – again, return is just identity, so passing something into return shouldn’t affect it.)
  3. seq >>= return . f = fmap f seq
    (composing return with a function is equivalent to invoking that function on the result of the monad sequence to that point, and wrapping the result in the monad – in other words, it’s just composition with identity.)
  4. seq >>= (x -> f x >>= g) = (seq >>= f) >>= g
    (Your implementation of “>>=” needs to be semantically equivalent to the usual translation; that is, it must behave like a functor composition.)

There's always more Cantor crackpottery!

I’m not the only one who gets mail from crackpots!

A kind reader forwarded me yet another bit of Cantor crackpottery. It never ceases to amaze me how many people virulently object to Cantor, and how many of them just spew out the same, exact, rubbish, somehow thinking that they’re different than all the others who made the same argument.

This one is yet another in the representation scheme. That is, it’s an argument that I can write out all of the real numbers whose decimal forms have one digit after the decimal point; then all of the reals with two digits; then all of them with 3 digits; etc. This will produce an enumeration, therefore, there’s a one-to-one mapping from the naturals to the reals. Presto, Cantor goes out the window.

Or not.

As usual, the crank starts off with a bit of pomposity:

Dear Colleague,

My mathematic researshes lead me to resolve the continuum theory of Cantor, subject of controversy since a long time.

This mail is made to inform the mathematical community from this work, and share the conclusions.

You will find in attachment extracts from my book “Théorie critique fondamentale des ensembles de Cantor”,

Inviting you to contact me,

Francis Collot,
Member of the American mathematical society
Membre de la société mathématique de France
Member of the Bulletin of symbolic logic
Director of éditions européennes

As a quick aside, I love how he signs he email “Member of the AMS”, as if that were something meaningful. The AMS is a great organization – but anyone can be a member. All you need to do is fill out a form, and write them a check. It’s not something that anyone sane or reasonable brags about, because it doesn’t mean anything.

Anyway, let’s move on. Here’s the entirety of his proof. I’ve reproduced the formatting as well as I could; the original document sent to me was a PDF, so the tables don’t cut-and-paste.

The well-order on the set of real numbers result from this remark that it is possible to build, after the comma, a set where each subset has the same number of ordered elements (as is ordered the subset 2 : 10 …13 … 99).

Each successive integer is able to be followed after the comma (in french the real numbers have one comma after the integer) by an increasing number of figures.

0,0 0,10 0,100
0,1 0,11 0,101
0,2 0,12 0,102
0,9 0,99 0,999

It is the same thing for each successive interger before the comma.

1 2 3

So it is the 2 infinite of real number.

For this we use the binary notation.

But Cantor and his disciples never obtained this simple result.

After that, the theory displays that the infinity is the asymptote of the two branches of the hyperbole thanks to an introduction of trigonometry notions.

The successive numbers which are on cotg (as 1/2, 1/3, 1/4, 1/5) never attain 0 because it would be necessary to write instead (1/2, 1/3, 1/4, 1/4 ).

The 0 of the cotg is also the origin of the asymptote, that is to say infinite.

The beginning is, pretty much, a typical example of the representational crankery. It’s roughly a restatement of, for example, John Gabriel and his decimal trees. The problem with it is simple: this kind of enumeration will enumerate all of the real numbers with finite length representations. Which means that the total set of values enumerated by this won’t even include all of the rational numbers, much less all of the real numbers.

(As an interesting aside: you can see a beautiful example of what Mr. Collot missed by looking at Conway’s introduction to the surreal numbers, On Numbers and Games, which I wrote about here. He specifically deals with this problem in terms of “birthdays” and the requirement to include numbers who have an infinite birthday, and thus an infinite representation in the surreal numbers.)

After the enumeration stuff, he really goes off the rails. I have no idea what that asymptote nonsense is supposed to mean. I think part of the problem is that mr. Collot isn’t very good at english, but the larger part of it is that he’s an incoherent crackpot.

Mathematical Illiteracy in the NYT

I know I’m late to the game here, but I can’t resist taking a moment to dive in to the furor surrounding yesterday’s appalling NY Times op-ed, “Is Algebra Necessary?”. (Yesterday was my birthday, and I just couldn’t face reading something that I knew would make me so angry.)

In case you haven’t seen it yet, let’s start with a quick look at the argument:

A typical American school day finds some six million high school students and two million college freshmen struggling with algebra. In both high school and college, all too many students are expected to fail. Why do we subject American students to this ordeal? I’ve found myself moving toward the strong view that we shouldn’t.

My question extends beyond algebra and applies more broadly to the usual mathematics sequence, from geometry through calculus. State regents and legislators — and much of the public — take it as self-evident that every young person should be made to master polynomial functions and parametric equations.

There are many defenses of algebra and the virtue of learning it. Most of them sound reasonable on first hearing; many of them I once accepted. But the more I examine them, the clearer it seems that they are largely or wholly wrong — unsupported by research or evidence, or based on wishful logic. (I’m not talking about quantitative skills, critical for informed citizenship and personal finance, but a very different ballgame.)

Already, this is a total disgrace. The number of cheap fallacies in this little excerpt is just amazing. To point out a couple:

  1. Blame the victim. We do a lousy job teaching math. Therefore, math is bad, and we should stop teaching it.
  2. Obfuscation. The author really wants to make it look like math is really terribly difficult. So he chooses a couple of silly phrases that take simple mathematical concepts, and express them in ways that make them sound much more complicated and difficult. It’s not just algebra: It’s polynomial functions and parametric equations. What do those two terms really mean? Basically, “simple algebra”. “Parametric equations” means “equations with variables”. “Polynomial equations” means equations that include variables with exponents. Which are, of course, really immensely terrifying and complex things that absolutely never come up in the real world. (Except, of course, in compound interest, investment, taxes, mortgages….)
  3. Qualification. The last paragraph essentially says “There are no valid arguments to support the teaching of math, except for the valid ones, but I’m going to exclude those.”

and from there, it just keeps getting worse. The bulk of the argument can be reduced to the first point above: lots of students fail high-school level math, and therefore, we should give up and stop teaching it. It repeats the same thing over and over again: algebra is so terribly hard, students keep failing it, and it’s just not useful (except for all of the places where it is)

One way of addressing the stupidity of this is to just take what the moron says, and try applying it to any other subject:

A typical American school day finds some six million high school students and two million college freshmen struggling with grammatical writing. In both high school and college, all too many students are expected to fail. Why do we subject American students to this ordeal? I’ve found myself moving toward the strong view that we shouldn’t.

My question extends beyond just simple sentence construction and applies more broadly to the usual english sequence – from basic sentence structure and grammar through writing full-length papers and essays. State regents and legislators — and much of the public — take it as self-evident that every young person should be made to master rhetoric, thesis construction, and logical synthesis.

Would any newspaper in the US, much less one as obsessed with its own status as the New York Times, ever consider publishing an article like that claiming that we shouldn’t bother to teach students to write? It’s an utter disgrace, but in America, land of the mathematically illiterate, this is an acceptable, respectable argument when applied to mathematics.

Is algebra really so difficult? No.

But more substantial, is it really useful? Yes. Here’s a typical example, from real life. My wife and I bought our first house back in 1997. Two years later, the interest rate had gone down by a substantial amount, so we wanted to refinance. We had a choice between two refinance plans: one had an interest rate 1/4% lower, but required pre-paying of 2% of the principal in a special lump interest payment. Which mortgage should we have taken?

The answer is, it depends on how long we planned to own the house. The idea is that we needed to figure out when the amount of money saved by the lower interest rate would exceed the 2% pre-payment.

How do you figure that out?

Well, the amortization equation describing the mortgage is:

m = p frac{i(i+1)^2}{(i+1)^n - 1}

Where:

  • m is the monthly payment on the mortgage.
  • p is the amount of money being borrowed on the loan.
  • i is the interest rate per payment period.
  • n is the number of payments.

Using that equation, we can see the monthly payment. If we calculate that for both mortgages, we get two values, m_1 and m_2. Now, how many months before it pays off? If D is the amount of the pre-payment to get the lower interest rate, then D = k(m_2 - m_1), where k is the number of months – so it would take k=frac{D}{m_2 - m_1} months. It happened that for us, k worked out to around 60 – that is, about 5 years. We did make the pre-payment. (And that was a mistake; we didn’t stay in that house as long as we’d planned.)

But whoa…. Quadratic equations!

According to our idiotic author, expecting the average american to be capable of figuring out the right choice in that situation is completely unreasonable. We’re screwing over students all over the country by expecting them to be capable of dealing with parametric polynomial equations like this.

Of course, the jackass goes on to talk about how we should offer courses in things like statistics instead of algebra. How on earth are you going to explain bell curves, standard deviations, margins of error, etc., without using any algebra? The guy is so totally clueless that he doesn’t even understand when he’s using it.

Total crap. I’m going to leave it with that, because writing about this is just making me so damned angry. Mathematical illiteracy is just as bad

Weekend Recipe: Catfish Banh Mi!

I’m a huge fan of Banh Mi, the amazing sandwich that came from the fusion of Vietnamese and French food during the 20th century. Banh Mi is, basically, a sandwich made on a Vietnamese baguette (which is like a french baguette, except that it’s got some rice flour mixed with the wheat, giving it a crisper texture), topped with a spicy mayo (or some other sauce), some kind of protein, pickled vegetables, and cilantro.

I only learned about these glorious things recently. When I started working out foursquare last september, our office was in the East Village in Manhattan, just a couple of blocks away from a terrific banh mi restaurant, called Baoguette. Baoguette was one of our standard lunch joints. It’s a tacky little run-down hole-in-the-wall place – which is exactly the kind of place you want for what is, basically, street food.

In January, we moved to a new office space down in Soho. It’s a great space – I love our new office. But I miss my banh mi. So since the big move, I’ve been trying to figure out how to replicate my favorite sandwich. And today, I finally did it: I made a home-made catfish banh mi that is, I think, good enough to call an unqualified success. And I’m going to share how I did it with you!

The first thing you need is really good bread. In an ideal world, you want a baguette from a Vietnamese bakery. But since those are hard to find, go french. A really good french baguette is the closest thing – it’s got the right basic texture: a crisp but not crunchy crust.

Next, the veggies. You need to make these at least a couple of hours ahead, ideally at least a day.

You take a daikon radish (also called a chinese turnip), and julienne it, into strips about 1/4 inch by 1/8th of an inch by 3 inches or so. Do the same thing to a couple of carrots. Sprinkle with about a tablespoon of salt. (Don’t worry; it’s not going to turn out super-salty – we’re trying to extract water from the turnip.) Set it aside for about 1/2 an hour, and then drain off the liquid that will have accumulated.

Now, mix together about 1/2 cup of rice vinegar, and about 3/8ths of a cup of sugar. When you’ve got the sugar nicely dissolved, dump this over your veggies, and add cold water until the veggies are just covered. Add in a couple of fronds of cilantro, and a sliced chili pepper, and put the whole shebang into the fridge. Give it a stir once in a while. It’ll be ready to eat in about 4 hours, and it’ll be perfect in a day. After that, drain off the liquid, and put it into a tightly covered container to keep for about a week.

Now, finally, we’re up to the catfish. You really do want catfish for this. Catfish has a great, most, tender texture, and a strong enough flavor to stand up to what we’re going to do to it. Take a good size catfish filet, and cut it in half down the rib line, and then in half again perpendicular to the ridge line. Depending on how big your bread is, you’ll want between 2 and 4 of these catfish strips per sandwich.

Finely mince a clove of garlic, and a couple of slices of fresh ginger. Crush them up with just a bit of salt, to get them really pureed. Then mix them with about one teaspoon each of fish sauce and soy sauce, and about 1/4 teaspoon of korean chili pepper or cayenne (or more, if you like to spicier. I actually did more like 1/2 a teaspoon, but that’s a lot!). Toss the fish into this mixture, to get it nice and coated, and let it marinade for about 10 minutes.

Now, you want to prepare your bread. Then cut a section of the bread about 8 inches long, and then slice it open about 2/3rds of the way, and rip out some of the interior to make room for fillings!

Now, sauce: The sauce is simple: half and half mayo and sriracha. Whip some of that up. Give your bread a good layer of the sriracha mayo all around. On top of that, lay a sprig of fresh cilantro. (In the past, I’ve tried using a homemade aioli, but this is one of the rare cases where the premade is actually just as good.)

Now you’re almost ready to cook the fish. Mix together equal parts of general purpose flour and cornstarch, and give the fish a light coat. Heat up a frying pan on medium heat, and add just enough oil to cover the bottom of the pan. (We’re frying here, but not deep-frying!). When the oil is hot, put in the fish, and cook it for about 1 1/2 minutes on each side. You want it to be just barely cooked through.

Put some fish inside the bread, then add a generous helping of the pickled vegetables, and top it all of with more sriracha. And then prepare to eat one of the best sandwiches you’ve ever tasted!

(I’d show you a picture, but we totally devoured these. There’s nothing left to photograph.)

The American Heat Wave and Global Warming

Global warming is a big issue. If we’re honest and we look carefully at the data, it’s beyond question that the atmosphere of our planet is warming. It’s also beyond any honest question that the preponderance of the evidence is that human behavior is the primary cause. It’s not impossible that we’re wrong – but when we look at the real evidence, it’s overwhelming.

Of course, this doesn’t stop people from being idiots.

But what I’m going to focus on here isn’t exactly the usual idiots. See, here in the US, we’re in the middle of a dramatic heat wave. All over the country, we’ve been breaking heat daily temperature records. As I write this, it’s 98 degrees outside here in NY, and we’re expecting another couple of degrees. Out in the west, there are gigantic wildfires, cause by poor snowfall last winter, poor rainfall this spring, and record heat to dry everything out. So: is this global warming?

We’re seeing lots and lots of people saying yes. Or worse, saying that it is, because of the heat wave, while pretending that they’re not really saying that it is. For one, among all-too-many examples, you can look at Bad Astronomy here. Not to rag too much on Phil though, because hes just one among about two dozen different example of this that I’ve seen in the last 3 days.

Weather 10 or twenty degree above normal isn’t global warming. A heat wave, even a massive epic heat wave, isn’t proof that global warming is real, any more than an epic cold wave or blizzard is evidence that global warming is fake.

I’m sure you’ve heard many people say weather is not climate. But for human beings, it’s really hard to understand just what that really means. Climate is a world-wide long-term average; weather is instantaneous and local. This isn’t just a nitpick: it’s a huge distinction. When we talk about global warming, what we’re talking about is the year-round average temperature changing by one or two degrees. A ten degree variation in local weather doesn’t tell us anything about the worldwide trend.

Global warming is about climate. And part of what that means is that in some places, global warming will probably make the weather colder. Cold weather isn’t evidence against global warming. Most people realize that – which is why we all laugh when gasbags like Rush Limbaugh talk about how a snowstorm “proves” that global warming is a fraud. But at the same time, we look at weather like what we have in the US, and conclude that “Yes, global warming is real”. But we’re making the same mistake.

Global warming is about a surprisingly small change. Over the last hundred years, global warming is a change of about 1 degree celsius in the global average temperature. That’s about 1 1/2 degrees fahrenheit, for us Americans. It seems miniscule, and it’s a tiny fraction of the temperature difference that we’re seeing this summer in the US.

But that tiny difference in climate can cause huge differences in weather. As I mentioned before, it can make local weather either warmer or colder – not just by directly warming the air, but by altering wind and water currents in ways that create dramatic changes.

For example, global warming could, likely, make Europe significantly colder. How? The weather in western Europe is greatly affected by an ocean water current called the atlantic conveyor. The conveyor is a cyclic ocean current, where (driven in part by the jet stream), warm water flows north from the equator in a surface current, cooling as it goes, until it finally sinks and starts to cycle back south in a deep underwater current. This acts as a heat pump, moving energy from the equator north and east to western Europe. This is why Western Europe is significantly warmer than places at the same latitude in Eastern North America.

Global warming could alter the flow of the atlantic conveyor. (We don’t know if it will – but it’s one possibility, which makes a good example of something counter-intuitive.) If the conveyor is slowed, so that it transfers less energy, Europe will get colder. How could the conveyor be slowed? By ice-melt. The conveyor works as a cycle because of the differences in density between warm and cold water: cold water is denser than warm water, so the cold water sinks as it cools. It warms in the tropics, gets pushed north by the jet stream, cools along the way and gradually sinks.

But global warming is melting a lot of artic and glacier ice, which produces freshwater. Freshwater is less dense than saltwater. So the freshwater, when it dilutes the cold water at the northern end of the conveyor, it reduces its density relative to the pure salt-water – and that reduces the tendency of the cold water to sink, which could slow the conveyor.

There are numerous similar phenomena that involve changes in ocean currents and wind due to relatively small temperature variations. El Nino and La Nina, conveyor changes, changes in the moisture-carrying capacity of wind currents to carry – they’re all caused by relatively small changes – changes well with the couple of degrees of variatio that we see occuring.

But we need to be honest and careful. This summer may be incredibly hot, and we had an unsually warm winter before it – but we really shouldn’t try to use that as evidence of global warming. Because if you do, when some colder-than-normal weather occurs somewhere, the cranks and liars that want to convince people that global warming is an elaborate fraud will use that the muddle things – and when they do, it’ll be our fault when people fall for it, because we’ll be the ones who primed them for that argument. As nice, as convenient, as convincing as it might seem to draw a correlation between a specific instance of extreme weather and global warming, we really need to stop doing it.

The Meaning of the Higgs

This isn’t exactly my area of expertise, but I’ve gotten requests by both email and twitter to try to explain yesterday’s news about the Higgs’ boson.

The questions.

  • What is this Higgs’ boson thing?
  • How did they find it?

  • What does the five sigma stuff mean?
  • Why do they talk about it as a “Higgs’-like particle”?

So, first things first. What is a Higgs’ boson?

When things in the universe interact, they usually don’t actually interacts by touching each other directly. They interact through forces and fields. What that means is a bit tricky. I can define it mathematically, but it won’t do a bit of good for intuition. But the basic idea is that space itself has some properties. A point in space, even when it’s completely empty, it has some properties.

Outside of empty space, we have particles of various types. Those particles interact with each other, and with space itself. Those interactions are what end up producing the universe we see and live in.

Fields are, essentially, a property of space. A field is, at its simplest, a kind of property of space that is defined at every point in space.

When particles interact with fields, they can end up exchanging energy. They do that through a particular kind of particle, called an exchange particle. For example, think about an electromagnetic field. An electron orbits an atomic nucleus, due to forces created by the electromagnetic fields of the electrons and protons. When an electron moves to a lower-energy orbital, it produces a photon; when it absorbs a photon, it can jump to a higher orbital. The photon is the exchange particle for the electromagnetic field. Exchange particles are instances of a kind of particle called a boson.

So.. one of the really big mysteries of physics is: why do some particles have mass, and other particles don’t? That is, some particles, like protons, have masses. Others, like photons, don’t. Why is that?

It’s quite a big mystery. Based on our best model – called the standard model – we can predict all of the basic kinds of particles, and what their masses should be. But we didn’t have a clue about why there’s mass at all!

So, following the usual pattern in particle physics, we predict that there’s a field. Particles moving through that field, if they interact with the field, experience a sort of drag. That drag is mass. So – just like particles like neutrinos aren’t affected by electromagnetic fields, some particles like photons won’t have mass because they don’t interact with the field that produces mass. We call that field the Higgs’ field.

(The previous paragraph formerly contained an error. The higgs field produces mass, not gravity. Just a stupid typo; my fingers got ahead of my brain.)

So physicists proposed the existence of the Higgs’ field. But how could they test it?

It’s a field. Fields have exchange particles. What would the exchange particles of the Higgs’ field be? Exchange particles are bosons, so this one is, naturally, called a Higgs’ boson. So if the Higgs’ field exists, then it will have an exchange particle. If the standard model of physics is right, then we can use it to predict the mass that that boson must have.

So – if we can find a particle whose mass matches what we predict, and it has the correct properties for a mass-field exchange particle, then we can infer that the Higgs’ field is real, and is the cause of mass.

How did they find the Higgs’ boson?

We have a pretty good idea of what the mass of the Higgs’ boson must be. We can describe that mass in terms of a quantity of energy. (See the infamous Einstein equation!) If we can take particles that we can easily see and manipulate, and we can accelerate them up to super-super high speed, and collide them together. If the energy of a collision matches the mass of a particle, it can create that kind of particle. So we slam together, say, two protons at high enough energy, we’ll get a Higgs’ boson.

But things are never quite that easy. There are a bunch of problems. First, the kind of collision that can produce a Higgs’ doesn’t always produce one. It can produce a variety of results, depending on the specifics of the collision as well as purely random factors. Second, it produces a lot more than just a Higgs’. We’re talking about an extremely complex, extremely high energy collision, with a ton of complex results. And third, the Higgs’ boson isn’t particularly stable. It doesn’t really like to exist. So like many unstable things in particle physics, it decays, producing other particles. And many of those particles are themselves unstable, and decay into other particles. What we can observe is the last products of the collision, several steps back from the Higgs’. But we know what kind of things the Higgs’ can decay into, and what they can decay into, etc.

So, we slam these things together a couple of thousand, or a couple of million times. And we look at the results. We look at all of the results of all of the collisions. And we specifically look for a bump: if there’s really a specific collision energy level at which Higgs’ bosons are produced, then we’ll see a bump in the number of Higgs’ decay products that are produced by collisions at that energy. And what the announcement yesterday showed is that that’s exactly what they saw: a bump in the observations inside the expected range of values of the mass of a Higgs’ boson.

The bump

What does five sigmas mean?

Whenever we’re making observations of a complex phenomenon, there are all sorts of things that can confound our observations. There are measurement errors, calculation errors, random noise, among many other things. So we can’t just look at one, or two, or ten data points. We need to look at a lot of data. And when you’ve got a lot of data, there’s always a chance that you’ll see what appears to be a pattern in the data, which is really just the product of random noise

For example, there are some people who’ve won the lottery multiple times. That seems crazy – it’s so unlikely to win once! To win multiple times seems crazy. But probabilistically, if you keep observing lotteries, you’ll find repeat winners. Or you’ll find apparent patterns in the winning numbers, even though they’re being drawn randomly.

We don’t want to be fooled by statistics. So we create standards. We can compute how unlikely a given pattern would be, if it were occuring do to pure randomness. We can’t even absolutely rule out randomness, but for any degree of certainty, we can determine just how unlikely a given observation is to be due to randomness.

We describe that in terms of standard deviations. An observation of a phenomenon has a roughly 68% chance of being measured within one standard deviation (one sigma) of the actual value, or a roughly 32% chance of being observed outside of one sigma. At two sigmas, there’s only a roughly 5% chance of being outside. At three sigmas out, you’re down to a roughly 0.3% chance of randomly observing an event outside. The odds continue that way.

So, the Higgs’ hunters computed probabilities of observing the data that they found if they assumed that there was no Higgs’. The amount of data that they found exceeded 5 sigmas away from what you would expect by random chance if there was no Higgs’. That translates as damned unlikely. The ultimate choice of 5 sigmas is arbitrary, but it’s accepted as a threshold for certainty in particle physics. At five sigmas, we realistically rule out random chance.

Why do they keep saying Higgs’-like particle?

Remember up above, I said: “So – if we can find a particle whose mass matches what we predict, and it has the correct properties for a mass-field exchange particle, then we can infer that the Higgs’ field is real, and is the cause of mass”? There are two thing we need to show to conclude that we’ve found the mediator of the Higgs’ field. There needs to be a particle with the right mass, and it needs to have the properties of a mass-mediator. What we’ve got right now is an observation that yes, there is a particle at the mass we’d expect for a Higgs’. But we don’t have observations yet of any properties of the particle other than its mass. Assuming the standard model is right, the odds of finding another particle with that mass is damned unlikely, but the standard model could be wrong. It’s not likely at this point, but people like to be careful. So at this point, to be precise, we’ve observed a Higgs’-like particle – a particle that according to all of the observations we’ve made so far appears to be a Higgs’; but until we observe some properties other than mass, we can’t be absolutely certain that it’s a Higgs’.

Turing Machines – what they are, what they aren't.

It’s the anniversary of the birth of Alan Turing. So there’s a ton of people writing commemorative stories. And naturally, a ton of those stories get it wrong. And they get it wrong in a very sad way.

Of course, when you talk about Turing, you talk about Turing machines. Despite the fact that Turing did lots of stuff besides just that machine, it’s always the machine that people focus on. Partly that’s laziness – the machine has his name after all, so it’s one of the first things you find if you Google Turing. It’s also the easiest thing to write about.

What do they say about the Turing machine? It’ “the simplest computing device”. It’s the “basis for modern computers”. It’s “the theoretical model for the microchips in your laptop”. It’s the “mathematical description of your computer”. None of those things are true. And they all both over and under-state what Turing really did. In terms of modern computers, the Turing machine’s contribution to the design of real computers is negligible if not non-existent. But his contrubition to our understanding of what computers can do, and our understanding of how mathematics really works – they’re far, far more significant than the architecture of any single machine.

The Turing machine isn’t a model of real computers. The computer that you’re using to read this has absolutely nothing to do with the Turing machine. As a real device, the turing machine is absolutely terrible.

The turing machine is a mathematical model not of computers, but of computation. That’s a really important distinction. The Turing machine is an easy to understand model of a computing device. It’s definitely not the simplest model. There are simpler computing devices (for example, I think that the rule 111 machine is simpler) – but their simplicitly makes them harder to understand.

The reason that the Turing machine is so important comes down to two important facts. First, which machine you use to talk about computation doesn’t matter. There’s a limit to what a mechanical device can do. There are lots of machines out there – but ultimately, no machine can go past the limit. Any machine that can reach that limit is, for the purposes of the theory of computation, pretty much the same. When we talk about studying computation, what we’re talking about is the set of things that can be done by a machine – not by a specific machine, but by any machine. The specific choice of machine isn’t important. And that’s the point: computation is computation. That’s what Turing figured out.

The Turing machine is a brilliant creation. It’s a simple machine. It’s really easy to understand. And it’s easy to tweak – that is, it’s easy to do experiments where you can modify the machine, and see what effect it has.

So let’s take a step back, and see: what is a Turing machine?

The Turing machine is a very simple kind of theoretical computing device. In fact, it’s almost downright trivial. But according to everything that we know and understand about computation, this trivial device is capable of any computation that can be performed by any other computing device.

The basic idea of the Turing machine is very simple. It’s a machine that runs on top of a tape, which is made up of a long series of little cells, each of which has a single character written on it. The machine is a read/write head that moves over the tape, and which can store a little bit of information. Each step, the machine looks at the symbol on the cell under the tape head, and based on what it sees there, and whatever little bit of information it has stored, it decides what to do. The things that it can do are change the information it has store, write a new symbol onto the current tape cell, and move one cell left or right.

That’s really it. People who like to make computing sound impressive often have very complicated explanations of it – but really, that’s all there is to it. The point of it was to be simple – and simple it certainly is. And yet, it can do anything that’s computable.

To really understand how that trivial machine can do computations, it helps to be a bit formal. In formal terms, we talk about a turing machine as a tuple: (S, s0, A, T), where:

  • S is a finite list of possible states that the machine can be in. The state is the information that the machine can store in the head to make decisions. It’s a very limited amount of information; you can think of it as either a specific list of states, or a small set of small numbers. For most Turing machines, we’ll use it as a specific list of states. (You’ll see what I mean in a minute.)
  • s0S is the initial state – the state that the machine will be in when it starts a computation.
  • A is the machine’s alphabet, which is the set of symbols which can appear on the machine’s tape.
  • T is the machines transition function. This is the real heart of the machine, where the computation is defined. It’s a function from the machine state and the alphabet character on the current tape cell to the action that the machine should take. The action is a triple consisting of a new value for the machine’s state, a character to write onto the current tape cell, and a direction to move the tape head – either left or right.

So, for example, let’s look at a simple machine. This is one of the classic examples: a Turing machine that does subtraction using unary numbers. A unary number “N” is written as a series of N 1s. For the program, we’ll give the machine an input in the format “N-M=” written onto the tape; after running the machine, the tape will contain the value of M subtracted from N. So, for example, we could use “1111-11=” as an input; the output would be “11”.

The alphabet is the characters “1”, ” ” (blank space), “-” (minus sign), and “=” (equal sign). The machine has four states: {“scanright”, “eraseone”, “subone”, “skip”}. It starts in the state “scanright”. It’s transition function is given by the following table:

FromState Symbol ToState WriteChar Dir
scanright space scanright space right
scanright 1 scanright 1 right
scanright minus scanright minus right
scanright equal eraseone space left
eraseone 1 subone equal left
eraseone minus HALT space n/a
subone 1 subone 1 left
subone minus skip minus left
skip space skip space left
skip 1 scanright space right

What this machine does is move to the right until it sees the equal sign; it erases the equal sign and moves to the left, erases one digit off of the second number, and replaces it with the equal sign (so the second number is reduced by one, and the equal sign is moved over one position). Then it scans back to the minus sign (which separates the two numbers), and erases one digit of the first number, and then switches back to scanning to the right for the equal. So one at a time, it erases one digit from each of the two numbers. When it reaches the equal sign, and starts going back to erase a digit from the second number, and hits the “-” before it finds a digit, that means its done, so it stops. Let’s look at a simple execution trace. In the trace, I’ll write the machine state, followed by a colon, followed by the tape contents surrounded by “[]”, with the current tape cell surrounded by “{}”.

	scanright:  [ {1}1111111-111= ]"
	scanright:  [ 1{1}111111-111= ]"
	scanright:  [ 11{1}11111-111= ]"
	scanright:  [ 111{1}1111-111= ]"
	scanright:  [ 1111{1}111-111= ]"
	scanright:  [ 11111{1}11-111= ]"
	scanright:  [ 111111{1}1-111= ]"
	scanright:  [ 1111111{1}-111= ]"
	scanright:  [ 11111111{-}111= ]"
	scanright:  [ 11111111-{1}11= ]"
	scanright:  [ 11111111-1{1}1= ]"
	scanright:  [ 11111111-11{1}= ]"
	scanright:  [ 11111111-111{=} ]"
	eraseone :  [ 11111111-11{1}  ]"
	subone   :  [ 11111111-1{1}=  ]"
	subone   :  [ 11111111-{1}1=  ]"
	subone   :  [ 11111111{-}11=  ]"
	skip     :  [ 1111111{1}-11=  ]"
	scanright:  [ 1111111 {-}11=  ]"
	scanright:  [ 1111111 -{1}1=  ]"
	scanright:  [ 1111111 -1{1}=  ]"
	scanright:  [ 1111111 -11{=}  ]"
	eraseone :  [ 1111111 -1{1}   ]"
	subone   :  [ 1111111 -{1}=   ]"
	subone   :  [ 1111111 {-}1=   ]"
	skip     :  [ 1111111{ }-1=   ]"
	skip     :  [ 111111{1} -1=   ]"
	scanright:  [ 111111 { }-1=   ]"
	scanright:  [ 111111  {-}1=   ]"
	scanright:  [ 111111  -{1}=   ]"
	scanright:  [ 111111  -1{=}   ]"
	eraseone :  [ 111111  -{1}    ]"
	subone   :  [ 111111  {-}=    ]"
	skip     :  [ 111111 { }-=    ]"
	skip     :  [ 111111{ } -=    ]"
	skip     :  [ 11111{1}  -=    ]"
	scanright:  [ 11111 { } -=    ]"
	scanright:  [ 11111  { }-=    ]"
	scanright:  [ 11111   {-}=    ]"
	scanright:  [ 11111   -{=}    ]"
	eraseone :  [ 11111   {-}     ]"
	Halt:       [ 11111  { }-     ]"
	

See, it works!

One really important thing to understand here is that we do not have a program. What we just did was define a Turing machine that does subtraction. The machine does not take any instructions: the states and the transition function are an intrinsic part of the machine. So the only thing this machine can do is to subtract.

The real genius of Turing wasn’t just defining this simple computing machine. It was realizing the concept of the program: you could design a Turing machine whose input tape contained a description of a Turing machine – that is a program – followed by an input to the program. This single machine – the Universal Turing machine – could simulate any other Turing machine; one machine which could perform any computation.

Turing machines are a lot of fun to play with. Figuring out how to do things can be fascinating. Figuring out how to define a Universal turing machine’s transition function is an amazing thing to do; it’s astonishing how simple the universal machine can be!

As I said earlier – you can’t make a mechanical computing device that does anything that a Turing machine can’t do. One of the beauties of the Turing machine is that it lets you explore that. You can try adding and removing features to the basic machine, and see what happens.

For example: if you can do lots of great stuff with a Turing machine with one tape, what if you had a two-tape turing machine? That is, take the basic turing machine, and say that it’s got two tapes, each with a read/write head. Each state transition rule on this machine depends on the pair of values found on the two tapes. For now, we’ll say that the tapes move together – that is, the transition rule says “move the heads right” or “move the heads left”.

Seems like this should represent a real increase in power, right? No. Take a single-tape turing machine. Take the alphabet for tape one, and call it A1; take the alphabet for tape 2, and call it A2. We can create a single-tape turing machine whose alphabet is the cross-product of A1 and A2. Now each symbol on the tape is equivalent of a symbol on tape 1 and a symbol on tape 2. So we’ve got a single-tape machine which is equivalent to the two-tape machine. Bingo.

We can lift the restriction on the heads moving together, but it’s a lot more work. A two-tape machine can do things a lot faster than a one-tape, and the simulation will necessarily adapt to that. But it’s entirely doable. How about a two-dimensional tape? We can simulate that pretty easily with a two-tape machine, which means we can do it with a one-tape machine. For a two tape machine, what we do is map the two-D tape onto the one-D-tape, as seen in the diagram below – so that cell 0 on the one-D tape corresponds to cell (0,0) on the two tape; cell (0,1) on the two-D corresponds to cell 1 on the one-D; cell (1,1) on the 2-D is cell 2 on the 1-D; etc. Then we use the second tape for the book-keeping necessary to do the equivalent of T-D tape moves. And we’ve got a two-D turing machine simulated with a two-tape one-D; and we know that we can simulate a two-tape one-D with a one-tape one-D.

This is, to me, the most beautiful thing about the Turing machine. It’s not just a fundamental theoretical construction of a computing device; it’s a simple construction of a computing device that’s really easy to experiment with. Consider, for a moment, lambda calculus. It’s more useful that a Turing machine for lots of purposes – we write real programs in lambda calculus, when no one would build a real application using a Turing machine program. But imagine how you’d try things like the alternate constructions of the Turing machine? It’s a whole lot harder to build experiments like those in lambda calculus. Likewise for Minsky machines, Markov machines, etc.

For your enjoyment, I’ve implemented a Turing machine programming language. You feed it a Turing machine description, and an input string, and it will give you a trace of the machines execution like the one above. Here’s the specification of the subtraction machine written in my little Turing language:

states { "scanright" "eraseone" "subtractOneFromResult" "skipblanks" } initial "scanright"
alphabet { '1' ' ' '=' '-' } blank ' '
trans from "scanright" to "scanright" on (' ') write ' ' move right
trans from "scanright" to "scanright" on ('1') write '1' move right
trans from "scanright" to "scanright" on ('-') write '-' move right
trans from "scanright" to "eraseone" on ('=') write ' ' move left
trans from "eraseone" to "subtractOneFromResult" on ('1') write '=' move left
trans from "eraseone" to "Halt" on ('-') write ' ' move left
trans from "subtractOneFromResult" to "subtractOneFromResult" on ('1') write '1' move left
trans from "subtractOneFromResult" to "skipblanks" on ('-') write '-' move left
trans from "skipblanks" to "skipblanks" on (' ') write ' ' move left
trans from "skipblanks" to "scanright" on ('1') write ' ' move right

I think it’s pretty clear as a syntax, but it still needs explanation.

  • The first line declares the possible states of the machine, and what state it starts in. This machine has four possible states – “scanright”, “eraseone”, “subtractOneFromResult”, and “skipblanks”. When the machine starts, it will be in the state “skipright”.
  • The second line declares the set of symbols that can appear on the tape, including what symbol will initially appear on any tape cell whose value wasn’t specified by the input. For this machine the symbols are the digit 1, a blank space, the equal sign, and the subtraction symbol; the blank symbol is on any tape cell whose initial value wasn’t specified.
  • After that is a series of transition declarations. Each declaration specifies what the machine will do for a given pair of initial state and tape symbol. So, for example, if the machine is in state “scanright”, and the current tape cell contains the equal-to sign, then the machine will change to state “eraseone”, write a blank onto the tape cell (erasing the “=” sign), and move the tape cell one position to the left.

Finally, the code. You’ll need GHCI to run it; at the moment, it won’t work in Hugs (I don’t have the parsing library in my version of Hugs), and I haven’t worked out the linkage stuff to make it work under the GHC compiled mode.

--
-- A Simple Turing machine interpreter
-- Copyright 2007 by Mark C. Chu-Carroll
--    markcc@gmail.com
--   http://scienceblogs.com/goodmath
--
-- You can do anything you want with this code so long as you
-- leave this copyright notice intact.
--
module Turing where
import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as P
import Text.ParserCombinators.Parsec.Language
import System.Environment

data Motion = MoveLeft  | MoveRight deriving (Show)

-- Transition from to on move write
data Transition = Transition String String [Char] Motion  Char deriving (Show)

-- TuringMachine states initial alphabet blank transitions
data TuringMachine = Machine [String] String   [Char]    Char    [Transition] deriving (Show)

data MachineState = TMState [Char] Char [Char]  String  deriving (Show)
--                           tape-left curcell  curstate

getBlankSym :: TuringMachine -> Char
getBlankSym (Machine _ _ _ bl _) = bl

findMatchingTransition :: [Transition] -> String -> Char -> Maybe Transition
findMatchingTransition [] _ _ =  Nothing
findMatchingTransition translist state c =
     let matches = filter (transitionMatches state c) translist
     in case matches of
          [] -> Nothing
          t:[] -> Just t
          _ -> error "Ambiguous transition"
       where transitionMatches state c (Transition from to on move write) =
                        (from == state) && (elem c on)

runTransition :: TuringMachine -> [Char] -> Char -> [Char] -> String -> Transition -> MachineState
runTransition m (l:left) c right state (Transition from tostate on MoveLeft write) =
   TMState left l (write:right) tostate
runTransition m left c [] state (Transition from tostate on MoveRight write) =
   TMState (write:left) (getBlankSym m) [] tostate
runTransition m left c (r:right) state (Transition from tostate on MoveRight write) =
   TMState (write:left) r right tostate

stepMachine :: TuringMachine -> MachineState -> MachineState
stepMachine machine@(Machine _ _ _ _ translist) st@(TMState tapeleft c taperight state) =
       let trans = findMatchingTransition translist state c
       in case trans of
          (Just t) -> runTransition machine tapeleft c taperight state t
          Nothing -> error "No applicable transition from state"

getMachineStateString (TMState left c right state) =
	(state ++ ":[" ++ (reverse left) ++ "{" ++ [c] ++ "}" ++ right ++ "]")

runTracedMachine :: TuringMachine -> [Char] -> [String]
runTracedMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
    runTracedMachineSteps tm (TMState [] t tape initial) where
        runTracedMachineSteps machine state =
           let newstate@(TMState left c right st) = stepMachine machine state
           in if st == "Halt"
               then [getMachineStateString newstate]
               else let trace=runTracedMachineSteps machine newstate
                    in ((getMachineStateString newstate):trace)

runMachine :: TuringMachine -> [Char] -> [Char]
runMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
    runMachineSteps tm (TMState [] t tape initial) where
        runMachineSteps machine state =
           let newstate@(TMState left c right st) = stepMachine machine state
           in if st == "Halt"
               then (concat [(reverse left), [c], right])
               else runMachineSteps machine newstate

---------------------------------------------------------------------------
-- Parsing code - implemented using the Parsec monadic parser library.
---------------------------------------------------------------------------

-- Basic setup stuff - use a standard haskell style lexer; set up the reserved words
-- and symbols in the lexer.
lexer :: P.TokenParser ()
lexer = P.makeTokenParser (haskellDef
                        { P.reservedNames = ["states","alphabet","trans","from","to","on","write","move","left","right","initial","blank"] })

reserved = P.reserved lexer
symbol = P.symbol lexer
braces = P.braces lexer
parens = P.parens lexer
charlit = P.charLiteral lexer
strlit = P.stringLiteral lexer
ident = P.identifier lexer
whitespace = P.whiteSpace lexer

states = reserved "states"
alphabet = reserved "alphabet"
trans = reserved "trans"
from = reserved "from"
to = reserved "to"
on = reserved "on"
write = reserved "write"
move = reserved "move"
initial = reserved "initial"
blank = reserved "blank"

left = do { reserved "left"
          ; return MoveLeft
          }

right = do { reserved "right"
           ; return MoveRight
           }

-- statesDecl ::= "states" "{"  strlit+  "}" "initial" strlit
statesDecl = do { states
                ; strlst <- braces (many1 strlit)
                ; initial
                ; i <- strlit
                ; return (i,strlst)
                }

-- alphaDecl ::= "alphabet" "{" charlit+  "}" "blank" charlit
alphaDecl = do { alphabet
               ; charlst <- braces (many1 charlit)
               ; blank
               ; bl <- charlit
               ; return (bl, charlst)
               }

-- transDecl ::= "trans" "from" strlit "to" strlit "on" "(" charlit+ ")" "write" charlit "move" ("left" | "right")
transDecl = do { trans
               ; from
               ; fromState <- strlit
               ; to
               ; targetState <- strlit
               ; on
               ; chrs <- parens (many1 charlit)
               ; write
               ; wrchar <- charlit
               ; move
               ; dir <- ( left <|> right)
      	   ; return (Transition fromState targetState chrs dir wrchar)
              }

-- machine ::= statesDecl alphaDecl transDecl+
machine = do { (i,sts) <- statesDecl
             ; (bl,alpha) <- alphaDecl
             ; trans <- many1 transDecl
             ; return (Machine sts i alpha bl trans)
             }

run :: (Show a) => Parser a -> String -> IO a
run p input
	= case (parse p "" input) of
		Left err -> do{ putStr "parse error at "
		; print err
		; error "Parse error"
		}
		Right x -> return x

runTParser ::  String -> IO TuringMachine
runTParser input =
	run (do { whitespace
	        ; x <- machine
	        ; eof
	        ; return x })  input

--------------------------------------------------------------------------
-- A sample Turing program - first in the comment, and then coded into
-- a string literal, to make it easy to run tests. This sample program
-- is a basic Turing subtract - it takes a unary equation "111111-111=",
-- and leaves the result of subtracting the second from the first.
--
-- states { "one" "two" "three" "four" } initial "one"
-- alphabet { '1' ' ' '=' '-' } blank ' '
-- trans from "one" to "one" on (' ') write ' ' move right
-- trans from "one" to "one" on ('1') write '1' move right
-- trans from "one" to "one" on ('-') write '-' move right
-- trans from "one" to "two" on ('=') write ' ' move left
-- trans from "two" to "three" on ('1') write '=' move left
- trans from "two" to "Halt" on ('-') write '-' move left
-- trans from "three" to "three" on ('1') write '1' move left
-- trans from "three" to "four" on ('-') write '-' move left
-- trans from "four" to "four" on (' ') write ' ' move left
-- trans from "four" to "one" on ('1') write ' ' move right

sampleMachine = concat ["states { "one" "two" "three" "four" } initial "one"n ",
                        " alphabet { '1' ' ' '=' '-' } blank ' 'n ",
                        "trans from "one" to "one" on (' ') write ' ' move rightn ",
                        "trans from "one" to "one" on ('1') write '1' move rightn ",
                        "trans from "one" to "one" on ('-') write '-' move rightn ",
                        "trans from "one" to "two" on ('=') write ' ' move leftn ",
                        "trans from "two" to "three" on ('1') write '=' move leftn ",
                        "trans from "two" to "Halt" on ('-') write '-' move leftn ",
                        "trans from "three" to "three" on ('1') write '1' move leftn ",
                        "trans from "three" to "four" on ('-') write '-' move leftn ",
                        "trans from "four" to "four" on (' ') write ' ' move leftn ",
                        "trans from "four" to "one" on ('1') write ' ' move right"  ]

runTracedMachineOnString :: String -> String -> IO ([String])
runTracedMachineOnString m str =
	do
		tm <- runTParser m
		return (runTracedMachine tm str)

runMachineOnString :: String -> String -> IO String
runMachineOnString m str =
    do
	    tm <- runTParser m
	    return (runMachine tm str)

sampleInput = " 11111111-111= "

------------------------------------------------------------------------
-- Main program execution scaffolding
-- main still needs a bit of work so that ghci will link correctly;
-- runs fine in GHCI, but linkage errors in GHC. For now, just load
-- this file, and then execute "runFromFile" from the command line.
------------------------------------------------------------------------
main =
    do
       [file] <- getArgs
       m <- parseFromFile (do { whitespace
                              ; x <- machine
                              ; eof
                              ; return x }) file
       case m of
          Right machine -> do
             print "Enter input for parser:"
             s <- getLine
             result <- return (runMachine machine s)
             print (concat ["Result:[", result, "]"])
          Left x -> do
	        print (concat ["Parse error"])

runFromFile :: String -> IO ()
runFromFile filename =
    do
       m <- parseFromFile (do { whitespace
                              ; x <- machine
                              ; eof
                              ; return x }) filename
       case m of
          Right machine -> do
             print "Enter input for parser:"
             s <- getLine
             result <- return (runMachine machine s)
             print (concat ["Result:[", result, "]"])
          Left x -> do
	        print (concat ["Parse error"])


A Neat Trick with Partial Evalutors

This is something I was talking about with a coworker today, and I couldn’t quickly find a friendly writeup of it online, so I decided to write my own. (Yes, we do have a lot of people who enjoy conversations like this at foursquare, which is one of the many reasons that it’s such a great place to work. And we are hiring engineers! Feel free to get in touch with me if you’re interested, or just check our website!)

Anyway – what we were discussing was something called partial evaluation (PE). The idea of PE is almost like an eager form of currying. Basically, you have a two parameter function, f(x, y). You know that you’re going to invoke it a bunch of times with the same value of x, but different values of y. So what you want to do is create a new function, fx(y), which evaluates as much as possible of f using the known parameter.

It’s often clearer to see it in programmatic form. Since these days, I pretty much live Scala, I’ll use Scala syntax. We can abstractly represent a program as an object which can be run with a list of arguments, and which returns some value as its result.

trait Program {
  def run(inputs: List[String]): String
}

If that’s a program, then a partial evaluator would be:

object PartialEvaluator {
  def specialize(prog: Program, knownInputs: List[String]): Program = {
     ...
  }
}

What it does is take and program and a partial input, and returns a new program, which is the original program, specialized for the partial inputs supplied to the partial evaluator. So, for example, imagine that you have a program like:

object Silly extends Program {
  def run(inputs: List[String]): String = {
    val x: Int = augmentString(inputs[0]).toInt
    val y: Int = augmentString(inputs[0]).toInt
    if (x % 2 == 0) {
	  (y * y).toString
    } else {
	  ((y+1)*(y+1)).toString
    }
  }
}

This is obviously a stupid program, but it’s good for an example, because it’s so simple. If we’re going to run Silly a bunch of times with 1 as the first parameter, but with different second parameters, then we could generate a specialized version of Silly:

   val sillySpecialOne = PartialEvaluator.specialize(Silly, List("1"))

Here’s where the interesting part comes, where it’s really different from
currying. A partial evaluator evaluates everything that it
possibly can, given the inputs that it’s supplied. So the value produced by the specializer would be:

object Silly_1 extends Program {
  def run(inputs: List[String]): String = {
    val y: Int = augmentString(inputs[0]).toInt
    ((y+1)*(y+1)).toString
  }
}

This can be a really useful thing to do. It can turn in to a huge
optimization. (Which is, of course, why people are interested in it.) In compiler terms, it’s sort of like an uber-advanced form of constant propagation.

But the cool thing that we were talking about is going a bit meta. Suppose that you run a partial evaluator on a programming language interpreter?

object MyInterpreter extends Program {
  def run(inputs: List[String]): String = {
    val code = inputs[0]
    val inputsToCode = inputs.tail
    interpretProgram(code, inputsToCode)
  }

  def interpretProgram(code: String, inputs: List[String]): String = {
     ...
  }
}

We can run our partial evaluator to generate a version of the interpreter that’s specialized for the code that the interpreter is supposed to execute:

  1. We have an interpreter for language M, written in language L.
  2. We have a partial evaluator for language L.
  3. We run the partial evaluator on the interpreter, with a program
    written in M that we want to interpret:
    PartialEvaluator.specialize(M_Interpreter, M_Code).
  4. The partial evaluator produces a program written in L.

So the partial evaluator took a program written in M, and transformed it into a program written in L!

We can go a step further – and generate an M to L compiler. How? By running the partial evaluator on itself. That is, run the partial
evaluator like this: PartialEvaluator.specialize(PartialEvaluator, M_Interpreter). The result is a program which takes an M program as
input, and generates an L program: that is, it’s an M to L compiler!

We can go yet another step, and turn the partial evaluator into a
compiler generator: PartialEvaluator(PartialEvaluator,
List(source(PartialEvalutor)))
. What we get is a program which takes an interpreter written in L, and generates a compiler from it.

It’s possible to actually generate useful tools this way. People have actually implemented Lisp compilers this way! For example, you can see someone do a simple version here.