Category Archives: Good Math

Everyone should program, or Programming is Hard? Both!

I saw something on twitter a couple of days ago, and I promised to write this blog post about it. As usual, I’m behind on all the stuff I want to do, so it took longer to write than I’d originally planned.

My professional specialty is understanding how people write programs. Programming languages, development environment, code management tools, code collaboration tools, etc., that’s my bread and butter.

So, naturally, this ticked me off.

The article starts off by, essentially, arguing that most of the programming tutorials on the web stink. I don’t entirely agree with that, but to me, it’s not important enough to argue about. But here’s where things go off the rails:

But that’s only half the problem. Victor thinks that programming itself is broken. It’s often said that in order to code well, you have to be able to “think like a computer.” To Victor, this is absurdly backwards– and it’s the real reason why programming is seen as fundamentally “hard.” Computers are human tools: why can’t we control them on our terms, using techniques that come naturally to all of us?

And… boom! My head explodes.

For some reason, so many people have this bizzare idea that programming is this really easy thing that programmers just make difficult out of spite or elitism or clueless or something, I’m not sure what. And as long as I’ve been in the field, there’s been a constant drumbeat from people to say that it’s all easy, that programmers just want to make it difficult by forcing you to think like a machine. That what we really need to do is just humanize programming, and it will all be easy and everyone will do it and the world will turn into a perfect computing utopia.

First, the whole “think like a machine” think is a verbal shorthand that attempts to make programming as we do it sound awful. It’s not just hard to program, but those damned engineers are claiming that you need to dehumanize yourself to do it!

To be a programmer, you don’t need to think like a machine. But you need to understand how machines work. To program successfully, you do need to understand how machines work – because what you’re really doing is building a machine!

When you’re writing a program, on a theoretical level, what you’re doing is designing a machine that performs some mechanical task. That’s really what a program is: it’s a description of a machine. And what a programming language is, at heart, is a specialized notation for describing a particular kind of machine.

No one will go to an automotive engineer, and tell him that there’s something wrong with the way transmissions are designed, because they make you understand how gears work. But that’s pretty much exactly the argument that Victor is making.

How hard is it to program? That all depends on what you’re tring to do. Here’s the thing: The complexity of the machine that you need to build is what determines the complexity of the program. If you’re trying to build a really complex machine, then a program describing it is going to be really complex.

Period. There is no way around that. That is the fundamental nature of programming.

In the usual argument, one thing that I constantly see is something along the lines of “programming isn’t plumbing: everyone should be able to do it”. And my response to that is: of course so. Just like everyone should be able to do their own plumbing.

That sounds like an amazingly stupid thing to say. Especially coming from me: the one time I tried to fix my broken kitchen sink, I did over a thousand dollars worth of damage.

But: plumbing isn’t just one thing. It’s lots of related but different things:

  • There are people who design plumbing systems for managing water distribution and waste disposal for an entire city. That’s one aspect of plubing. And that’s an incredibly complicated thing to do, and I don’t care how smart you are: you’re not going to be able to do it well without learning a whole lot about how plumbing works.
  • Then there are people who design the plumbing for a single house. That’s plumbing, too. That’s still hard, and requires a lot of specialized knowledge, most of which is pretty different from the city designer.
  • There are people who don’t design plumbing, but are able to build the full plumbing system for a house from scratch using plans drawn by a designer. Once again, that’s still plumbing. But it’s yet another set of knowledge and skills.
  • There are people who can come into a house when something isn’t working, and without ever seeing the design, and figure out what’s wrong, and fix it. (There’s a guy in my basement right now, fixing a drainage problem that left my house without hot water, again! He needed to do a lot of work to learn how to do that, and there’s no way that I could do it myself.) That’s yet another set of skills and knowledge – and it’s still plumbing.
  • There are non-professional people who can fix leaky pipes, and replace damaged bits. With a bit of work, almost anyone can learn to do it. Still plumbing. But definitely: everyone really should be able to do at least some of this.

  • And there are people like me who can use a plumbing snake and a plunger when the toilet clogs. That’s still plumbing, but it requires no experience and no training, and absolutely everyone should be able to do it, without question.

All of those things involve plumbing, but they require vastly different amounts and kinds of training and experience.

Programming is exactly the same. There are different kinds of programming, which require different kinds of skills and knowledge. The tools and training methods that we use are vastly different for those different kinds of programming – so different that for many of them, people don’t even realize that they are programming. Almost everyone who uses computers does do some amount of programming:

  • When someone puts together a presentation in powerpoint, with things that move around, appear, and disappear on your command: that is programming.
  • When someone puts formula into a spreadsheet: that is programming.
  • When someone builds a website – even a simple one – and use either a set of tools, or CSS and HTML to put the site together: that is programming.
  • When someone writes a macro in Word or Excel: that is programming.
  • When someone sets up an autoresponder to answer their email while they’re on vacation: that is programming.

People like Victor completely disregard those things as programming, and then gripe about how all programming is supercomplexmagicalsymbolic gobbledygook. Most people do write programs without knowing about it, precisely because they’re doing it with tools that present the programming task as something that’s so natural to them that they don’t even recognize that they are programming.

But on the other hand, the idea that you should be able to program without understanding the machine you’re using or the machine that you’re building: that’s also pretty silly.

When you get beyond the surface, and start to get to doing more complex tasks, programming – like any skill – gets a lot harder. You can’t be a plumber without understanding how pipe connections work, what the properties of the different pipe materials are, and how things flow through them. You can’t be a programmer without understanding something about the machine. The more complicated the kind of programming task you want to do, the more you need to understand.

Someone who does Powerpoint presentations doesn’t need to know very much about the computer. Someone who wants to write spreadsheet macros needs to understand something about how the computer processes numbers, what happens to errors in calculations that use floating point, etc. Someone who wants to build an application like Word needs to know a whole lot about how a single computer works, including details like how the computer displays things to people. Someone who wants to build Google doesn’t need to know how computers render text clearly on the screen, but they do need to know how computers work, and also how networks and communications work.

To be clear, I don’t think that Victor is being dishonest. But the way that he presents things often does come off as dishonest, which makes it all the worse. To give one demonstration, he presents a comparison of how we teach programming to cooking. In it, he talks about how we’d teach people to make a soufflee. He shows a picture of raw ingredients on one side, and a fully baked soufflee on the other, and says, essentially: “This is how we teach people to program. We give them the raw ingredients, and say fool around with them until you get the soufflee.”

The thing is: that’s exactly how we really teach people to cook – taken far out of context. If we want them to be able to prepare exactly one recipe, then we give them complete, detailed, step-by-step instructions. But once they know the basics, we don’t do that anymore. We encourage them to start fooling around. “Yeah, that soufflee is great. But what would happen if I steeped some cardamom in the cream? What if I left out the vanilla? Would it turn out as good? Would that be better?” In fact, if you never do that experimentation, you’ll probably never learn to make a truly great soufflee! Because the ingredients are never exactly the same, and the way that it turns out is going to depend on the vagaries of your oven, the weather, the particular batch of eggs that you’re using, the amount of gluten in the flour, etc.

To write complicated programs is complicated. To write programs that manipulate symbolic data, you need to understand how the data symbolizes things. To write a computer that manipulates numbers, you need to understand how the numbers work, and how the computer represents them. To build a machine, you need to understand the machine that you’re building. It’s that simple.

DNS and Yesterday's Outage

As I’m sure at least some of you noticed, Scientopia – along with a huge number of other sites – was unreachable for most of yesterday. I say “unreachable” instead of “down” because the site was up. But if you wanted to get to it, and you didn’t know it’s numeric address, you couldn’t reach it. That’s because your computer uses a service called DNS to find things.

So what’s this DNS thing? And what brought down so much of the net yesterday?

DNS stands for “domain name service”. What it does is provide a bridge between the way that you think of the name of a website, and the way that your computer actually talks to the website. You think of a website by a URL, like http://scientopia.org/blogs/goodmath. That URL basically says “the thing named ”blogs/goodmath” on a server named ”scientopia.org”, which you can access using a system called ”http”.”.

The catch is that your computer can’t just send a message to something named “scientopia.org”. Scientopia is a server in a datacenter somewhere, and to send a message to it, it needs to know a numeric address. Numeric addresses are usually written as a set of four numbers, each ranging from 0 to 255. For example, scientopia.org is currently 184.106.221.182.

You don’t want to have to remember those four meaningless numbers in order to connect to scientopia. But there’s more to it than just remembering: those numbers can change. The physical computer that scientopia works on has been moved around several times. It lives in a datacenter somewhere, and as that datacenter has grown, and hardware has needed to be either expanded or replaced, the numeric address has changed. Even though I’m the administrator of the site, there’ve been a couple of times where it’s changed, and I can’t pinpoint exactly when!

The reason that all of that works is DNS. When you tell your computer to connect to another computer by name, it uses DNS to ask “What’s the numeric address for this name?”, and once it gets that numeric address, it uses that to connect to the other computer.

When I first got access to the internet in college, it was shortly before DNS first got deployed. At the time, there was one canonical computer somewhere in California. Every other computer on the (then) arpanet would contact that computer every night, and download a file named “hosts.txt”. hosts.txt contained a list of every computer on the internet, and its numeric address. That was clearly not working any more, and DNS was designed as the replacement. When they decided to replace hosts.txt, they wanted to replace it with something much more resilient, and something that didn’t need to have all of the data in one place, and which didn’t rely on any single machine in order to work.

The result was DNS. It’s a fascinating system. It’s the first real distributed system that I ever learned about, and I still think it’s really interesting. I’m not going to go into all of the details: distributed systems always wind up with tons of kinks, to deal with the complexity of decentralized information in the real world. But I can explain the basics pretty easily.

The basic way that DNS works is really simple. You take a domain name. For example, my personal machine is named “sunwukong”, and my family network is “phouka.net” – so the hostname that I’m typing this on right now is “sunwukong.phouka.net”. To look it up, what happens is you take the host name, and split it at the dots. So for the example, that’s [“sun-wukong”, “phouka”, “net”].

You start with the last element at the list, which is called the top-level-domain, and you contact a special server called the root nameserver and ask “who do I ask for addresses ending with this tld? (“Who do I ask for ‘.net’?”) It sends a response to you, giving you the address of another nameserver. Then you contact that, and ask it “who do I ask for the next element”? (“Who do I ask for phouka?”) Finally, when you get to the last one, the address that it gives you is the address of the machine you’re looking for, instead of the name of another DNS server. At each level of the heirarchy, there’s a server or set of servers that’s registered with the level above to be the canonical server for some set of names. The “.com” servers are registered with the root level servers. The next level of domains is registered with the TLD servers. And so on.

So, to look up sunwukong.phouka.net, what happens is:

  1. You send a request to the root nameserver, and ask for the server for “.net”?
  2. The root server responds, giving you the “.net” server.
  3. You send a request to the .net server address that you just got, asking “Who can tell
    me about phouka?”
  4. The .net server responds, and gives you an address for a DNS server that can tell
    you where things are in phouka.net.
  5. You ask the “phouka.net” server for the address of sunwukong
  6. It finally gives you the address of your server.

This distributes the information nicely. You can have lots of root name servers – all they need to be able to do is find DNS servers for the list of top-level domains. And there can be – and are – thousands and thousands of nameservers for addresses in those top-level domains. And then each administrator of a private network can have a top-level nameserver for the computers that they directly manage. In this mess, any nameserver at any level can disappear, and all that will be affected are the names that it manages.

The problem is, there’s no one place to get information; and more importantly, every time you need to talk to another computer, you need to go through this elaborate sequence of steps.

To get around that last issue, you’ve got something called caches, on multiple levels. A cache is a copy of the canonical data, which gets kept around for some period of time. In the DNS system, you don’t need to talk to the canonical DNS server for a domain if someone has the data they need in a cache. You can (and generally do) talk to cacheing DNS servers. With a cacheing DNS server, you tell it the whole domain name that you want to look up, and it does the rest of the lookup process, and gives you the address you’re looking for. Every time it looks something up, it remembers what it did, so that it doesn’t need to ask again. So when you look up a “.com” address, your cacheing service will remember who it can ask for “.com” addresses. So most of the time, you only talk to one server, which does the hard work, and does it in a smart way so that it doesn’t need to unnecessarily repeat requests.

So now, we can get to what happened yesterday. GoDaddy, one of the major american DNS registries, went down. Exactly why it went down is not clear – a group of jerks claimed to have done a denial-of-service attack against them; but the company claims to have just had a configuration glitch. Honestly, I’m inclined to believe that it was the hackers, but I don’t know for sure.

But – the canonical server for scientopia.org was gone. So if you needed to look up scientopia.org and it wasn’t in your cache, then your cacheing nameserver would go to the .org server, and ask for scientopia; the .org nameserver would return the address of GoDaddy’s nameserver, and that wouldn’t respond. So you’d never reach us.

That’s also why some people were able to reach the site, and others weren’t. If your cacheing server had cached the address for scientopia, then you’d get the server address, and since the server was up the whole time, you’d connect right up, and everything would work.

Nulls

So, my post on monads apparently set of a bit of a firestorm over my comments about avoiding null pointer exceptions. To show you what I mean, here’s a link to one of the more polite and reasonable posts.

Ok. So. Let me expand a bit.

The reason that I think it’s so great that I don’t get NPEs when I use an option with something like Option isn’t because it makes me a super-programmer who’s better than the lowly slime who deal with NPEs. It’s because it changes how that I write code, in a way that helps me avoid making mistakes – mistakes that I make because I’m just a fallible idiot of a human.

There are two fundamental things about an option type, like what we have in Scala, that make a huge difference.

First, it narrows the field of errors. When I’m programming in Java, any call that returns a pointer could return a null. The language makes no distinction between a function/expression that could return a null, and one that can’t. That means that when I get an NPE, the source of that null pointer could be anything in the dynamic slice leading to the error. With an option type, I’ve got two kinds of functions: functions that always return a non-null value, and functions that sometimes return a non-null value, and sometimes return a None. That’s incredibly valuable.

Second, it forces me to explicitly deal with the None case. In Java, programmers constantly build code without null checks, because they know that a function won’t return null. And then it does, and ker-splat. With an option type, I have no choice: I have to explicitly deal with the potential error case. Sure, I can forcibly code around it – in Scala, I can use Option.get, which will turn into an error analagous to an NPE. But it forces me to make that choice, and make it explicitly.

Even if I’m going the stupid, brute-force route and assuming that I know, without fail, that a function is going to return a non-null value… Consider an example:

   Java: :
  T g = f.doSomething()
  g.doSomethingElse()

   Scala:
  val g: Option[T] = f.doSomething()
  g.get.doSomethingElse()

The scala case has to explicitly deal with the fact that it’s dealing with a potentially empty value, and using a statement that asserts the non-emptiness.

But in reality, if you’re a decent programmer, you never use .get to directly access an option. (The only exception is in cases where you call the .get in a context dominated by a non-empty test; but even then, it’s best to not, to avoid errors when the surrounding code is modified.) In real code, you pretty much always explicitly de-option a value using a function like getOrElse:

val f: User = getUser("markcc").getOrElse(new User("markcc"))

As I hope it has become plain, the point of avoiding NPEs through option-like type structures isn’t that somehow it makes the entire category of unexpected result value disappear. It’s that it changes the way that you code to distinguish where those errors can occur, and to force you to deal with them.

I think that ultimately, things like this are really just manifestations of the good-old static vs dynamic type wars. Type errors in a dynamically typed language are really just unexpected value errors. Strong typing doesn’t stop you from making those errors. It just takes a bunch of common cases of those errors, and converts them from a run-time error to a compile-time error. Whether you want them to be run-time or compile-time depends on the project your working on, on your development team, and on your personal preferences.

I find in practice that I get many fewer errors by being forced to explicitly declare when a value might be null/None, and by being required to explicitly deal with the null/None case when it might occur. I’ve spent much less time debugging that kind of error in my year at foursquare than in the 15 years of professional development that I did before. That’s not because I magically became a better programmer a year ago when I joined foursquare. It’s because I’m using a better tool that helps me avoid mistakes.

Monads and Programming

Sorry things have been so slow around here. I know I keep promising that I’m going to post more frequently, but it’s hard. Life as an engineer at a startup is exhausting. There’s so much work to do! And the work is so fun – it’s easy to let it eat up all of your time.

Anyway… last good-math post ’round these parts was about monoids and programming. Today, I’m going to talk about monads and programming!

If you recall, monoids are an algebraic/categorical construct that, when implemented in a programming language, captures the abstract notion of foldability. For example, if you’ve got a list of numbers, you can fold that list down to a single number using addition. Folding collections of values is something that comes up in a lot of programming problems – capturing that concept with a programming construct allows you to write code that exploits the general concept of foldability in many different contexts.

Monads are a construct that have become incredibly important in functional programming, and they’re very, very poorly understood by most people. That’s a shame, because the real concept is actually simple: a monad is all about the concept of sequencing. A monad is, basically, a container that you can wrap something in. Once it’s wrapped, you can form a sequence of transformations on it. The result of each step is the input to the next. That’s really what it’s all about. And when you express it that way, you can begin to see why it’s such an important concept.

I think that people are confused by monads for two reasons:

  1. Monads are almost always described in very, very abstract terms. I’ll also get into the abstract details, but I’ll start by elaborating on the simple description I gave above.
  2. Monads in Haskell, which are where most people are introduced to them, are very confusing. The basic monad operations are swamped with tons of funny symbols, and the basic monad operations are named in incredibly misleading ways. (“return” does almost the exact opposite of what you expect return to do!)

In programming terms, what’s a monad?

Basically, a monadic computation consists of three pieces:

  1. A monadic type, M which is a parametric type wrapper that can wrap a value of any type.
  2. An operation which can wrap a value in M.
  3. An operation which takes a function that transforms a value wraped in M into another value (possibly with a different type) wrapped in M.

Whenever you describe something very abstractly like this, it can seem rather esoteric. But this is just a slightly more formal way of saying what I said up above: it’s a wrapper for a series of transformations on the wrapped value.

Let me give you an example. At foursquare, we do all of our server programming in Scala. In a year at foursquare, I’ve seen exactly one null pointer exception. That’s amazing – NPEs are ridiculously common in Java programming. But in Scala at foursquare, we don’t allow nulls to be used at all. If you have a value which could either be an instance of A, or no value, we use an option type. An Option[T] can be either Some(t: T) or None.

So far, this is nice, but not really that different from a null. The main difference is that it allows you to say, in the type system, whether or not a given value might be null. But: Option is a monad. In Scala, that means that I can use map on it. (map is one of the transformation functions!)

	val t: Option[Int] = ...
	val u: Option[String] = t.map( _ + 2 ).map(_.toString)

What this does is: if t is Some(x), then it adds two to it, and returns Some(x+2); then it takes the result of the first map, and converts it toa string, returning an Option[String]. If t is None, then running map on it always returns None. So I can write code which takes care of the null case, without having to write out any conditional tests of nullness – because optionality is a monad.

In a good implementation of a monad, I can do a bit more than that. If I’ve got a Monad[T], I can use a map-like operation with a function that takes a T and returns a Monad[T].

For an example, we can look at lists – because List is a monad:

val l: List[Int] = List(1, 2, 3)
l.flatMap({ e => List( (e, e), (e+1, e+2) )  })
res0: List[(Int, Int)] = List((1,1), (2,3), (2,2), (3,4), (3,3), (4,5))

The monad map operation does a flatten on the map steps. That means a lot of things. You can see one in the rather silly example above.

You can take values, and wrap them as a list. THen you can perform a series of operations on those elements of a list – sequencing over the elements of the list. Each operation, in turn, returns a list; the result of the monadic computation is a single list, concatenating, in order, the lists returned by each element. In Scala, the flatMap operation captures the monadic concept: basically, if you can flatmap something, it’s a monad.

Let’s look at it a bit more specifically.

  1. The monadic type: List[T].
  2. A function to wrap a value into the monad: the constructor function from List def apply[T](value: T): List[T]
  3. The map operation: def flatMap[T, U](op: T => List[U]): List[U].

(In the original version of this post, the I put the wrong type in flatMap in the list above. In the explanation demonstrating flatMap, the type is correct. Thanks to John Armstrong for catching that!)

You can build monads around just about any kind of type wrapper where it makes sense to map over the values that it wraps: collections, like lists, maps, and options. Various kinds of state – variable environments (where the wrapped values are, essentially, functions from identifiers to values), or IO state. And plenty of other things. Anything where you perform a sequence of operations over a wrapped value, it’s a monad.

Now that we have some understanding of what this thing we’re talking about it, what is it in mathematical terms? For that, we turn to category theory.

Fundamentally, in category theory a monad is a category with a particular kind of structure. It’s a category with one object. That category has a collection of arrows which (obviously) are from the single object to itself. That one-object category has a functor from the category to itself. (As a reminder, a functor is an arrow between categories in the category of (small) categories.)

The first trick to the monad, in terms of theory, is that it’s fundamentally about the functor: since the functor maps from a category to the same category, so you can almost ignore the category; it’s implicit in the definition of the functor. So we can almost treat the monad as if it were just the functor – that is, a kind of transition function.

The other big trick is closely related to that. For the programming language application of monads, we can think of the single object in the category as the set of all possible states. So we have a category object, which is essentially the collection of all possible states; and there are arrows between the states representing possible state transitions. So the monad’s functor is really just a mapping from arrows to different arrows – which basically represents the way that changing the state causes a change in the possible transitions to other states.

So what a monad gives us, in terms of category theory, in a conceptual framework that captures the concept of a state transition system, in terms of transition functions that invisibly carry a state. When that’s translated into programming languages, that becomes a value that implicitly takes an input state, possibly updates it, and returns an output state. Sound familiar?

Let’s take a moment and get formal. As usual for category theory, first there are some preliminary definitions.

  1. Given a category, C, 1C is the identity functor from C to C.
  2. Given a category C with a functor T : CC, T2 = T º T.
  3. Given a functor T, 1T : TT is the natural transformation from T to T.

Now, with that out of the way, we can give the complete formal definition of a monad. Given a category C, a monad on C is a triple: (T:CC, η:1CT, μ:T2T), where T is a functor, and η and μ are natural transformations. The members of the triple must make the following two diagrams commute.

monad-prop1.jpg

Commutativity of composition with μ


monad-prop2.jpg

Commutativity of composition with η

What these two diagrams mean is that successive applications of the state-transition functor over C behave associatively – that any sequence of composing monadic functors will result in a functor with full monadic structure; and that the monadic structure will always preserve. Together, these mean that any sequence of operations (that is, applications of the monad functor) are themselves monad functors – so the building a sequence of monadic state transformers is guaranteed to behave as a proper monadic state transition – so what happens inside of the monadic functor is fine – to the rest of the universe, the difference between a sequence and a single simple operation is indistinguishable: the state will be consistently passed from application to application with the correct chaining behavior, and to the outside world, the entire monadic chain looks like like a single atomic monadic operation.

Now, what does this mean in terms of programming? Each element of a monadic sequence in Haskell is an instantiation of the monadic functor – that is, it’s an arrow between states – a function, not a simple value – which is the basic trick to monads. They look like a sequence of statements; in fact, each statement in a monad is actually a function from state to state. And it looks like we’re writing sequence code – when what we’re actually doing is writing function compositions – so that when we’re done writing a monadic sequence, what we’ve actually done is written a function definition in terms of a sequence of function compositions.

Understanding that, we can now clearly understand why we need the return function to use a non-monad expression inside of a monadic sequence – because each step in the sequence needs to be an instance of the monadic functor; an expression that isn’t an instance of the monadic functor couldn’t be composed with the functions in the sequence. The return function is really nothing but a function that combines a non-monadic expression with the id functor.

In light of this, let’s go back and look at the definition of Monad in the Haskell standard prelude.

class  Functor f  where
  fmap              :: (a -> b) -> f a -> f b

class  Monad m  where
  (>>=)  :: m a -> (a -> m b) -> m b
  (>>)   :: m a -> m b -> m b
  return :: a -> m a
  fail   :: String -> m a

  -- Minimal complete definition:
  --      (>>=), return
  m >> k  =  m >>= _ -> k
  fail s  = error s

The declaration of monad is connected with the definition of functor – if you look, you can see the connection. The fundamental operation of Monad is “>>=” – the chaining operation, which is basically the haskell version of the map operation, which is type m a -> (a -> m b) -> m b is deeply connected with the fmap operation from Functor‘s fmap operation – the a in m a is generally going to be a type which can be a Functor. (Remember what I said about haskell and monads? I really prefer map and flatMap to >> and >>=).

So the value type wrapped in the monad is a functor – in fact, the functor from the category definition! And the “>>=” operation is just the functor composition operation from the monad definition.

A proper implementation of a monad needs to follow some fundamental rules – the rules are, basically, just Haskell translations of the structure-preserving rules about functors and natural transformations in the category-theoretic monad. There are two groups of laws – laws about the Functor class, which should hold for the transition function wrapped in the monad class; and laws about the monadic operations in the Monad class. One important thing to realize about the functor and monad laws is that they are not enforced – in fact, cannot be enforced! – but monad-based code using monad implementations that do not follow them may not work correctly. (A compile-time method for correctly verifying the enforcement of these rules can be shown to be equivalent to the halting problem.)

There are two simple laws for Functor, and it’s pretty obvious why they’re fundamentally just strucure-preservation requirements. The functor class only has one operation, called fmap, and the two functor laws are about how it must behave.

  1. fmap id = id
    (Mapping ID over any structured sequence results in an unmodified sequence)
  2. fmap (f . g) = (fmap f) . (fmap g)
    (“.” is the function composition operation; this just says that fmap preserves the structure to ensure that that mapping is associative with composition.)

The monad laws are a bit harder, but not much. The mainly govern how monadic operations interact with non-monadic operations in terms of the “return” and “>>=” operations of the Monad class.

  1. return x >>= f = f x
    (injecting a value into the monad is basically the same as passing it as a parameter down the chain – return is really just the identity functor passing its result on to the next step. I hate the use of “return”. In a state functor, in exactly the right context, it does sort-of look like a return statement in an imperative language. But in pretty much all real code, return is the function that wraps a value into the monad.)
  2. f >>= return = f
    (If you don’t specify a value for a return, it’s the same as just returning the result of the previous step in the sequence – again, return is just identity, so passing something into return shouldn’t affect it.)
  3. seq >>= return . f = fmap f seq
    (composing return with a function is equivalent to invoking that function on the result of the monad sequence to that point, and wrapping the result in the monad – in other words, it’s just composition with identity.)
  4. seq >>= (x -> f x >>= g) = (seq >>= f) >>= g
    (Your implementation of “>>=” needs to be semantically equivalent to the usual translation; that is, it must behave like a functor composition.)

The Meaning of the Higgs

This isn’t exactly my area of expertise, but I’ve gotten requests by both email and twitter to try to explain yesterday’s news about the Higgs’ boson.

The questions.

  • What is this Higgs’ boson thing?
  • How did they find it?

  • What does the five sigma stuff mean?
  • Why do they talk about it as a “Higgs’-like particle”?

So, first things first. What is a Higgs’ boson?

When things in the universe interact, they usually don’t actually interacts by touching each other directly. They interact through forces and fields. What that means is a bit tricky. I can define it mathematically, but it won’t do a bit of good for intuition. But the basic idea is that space itself has some properties. A point in space, even when it’s completely empty, it has some properties.

Outside of empty space, we have particles of various types. Those particles interact with each other, and with space itself. Those interactions are what end up producing the universe we see and live in.

Fields are, essentially, a property of space. A field is, at its simplest, a kind of property of space that is defined at every point in space.

When particles interact with fields, they can end up exchanging energy. They do that through a particular kind of particle, called an exchange particle. For example, think about an electromagnetic field. An electron orbits an atomic nucleus, due to forces created by the electromagnetic fields of the electrons and protons. When an electron moves to a lower-energy orbital, it produces a photon; when it absorbs a photon, it can jump to a higher orbital. The photon is the exchange particle for the electromagnetic field. Exchange particles are instances of a kind of particle called a boson.

So.. one of the really big mysteries of physics is: why do some particles have mass, and other particles don’t? That is, some particles, like protons, have masses. Others, like photons, don’t. Why is that?

It’s quite a big mystery. Based on our best model – called the standard model – we can predict all of the basic kinds of particles, and what their masses should be. But we didn’t have a clue about why there’s mass at all!

So, following the usual pattern in particle physics, we predict that there’s a field. Particles moving through that field, if they interact with the field, experience a sort of drag. That drag is mass. So – just like particles like neutrinos aren’t affected by electromagnetic fields, some particles like photons won’t have mass because they don’t interact with the field that produces mass. We call that field the Higgs’ field.

(The previous paragraph formerly contained an error. The higgs field produces mass, not gravity. Just a stupid typo; my fingers got ahead of my brain.)

So physicists proposed the existence of the Higgs’ field. But how could they test it?

It’s a field. Fields have exchange particles. What would the exchange particles of the Higgs’ field be? Exchange particles are bosons, so this one is, naturally, called a Higgs’ boson. So if the Higgs’ field exists, then it will have an exchange particle. If the standard model of physics is right, then we can use it to predict the mass that that boson must have.

So – if we can find a particle whose mass matches what we predict, and it has the correct properties for a mass-field exchange particle, then we can infer that the Higgs’ field is real, and is the cause of mass.

How did they find the Higgs’ boson?

We have a pretty good idea of what the mass of the Higgs’ boson must be. We can describe that mass in terms of a quantity of energy. (See the infamous Einstein equation!) If we can take particles that we can easily see and manipulate, and we can accelerate them up to super-super high speed, and collide them together. If the energy of a collision matches the mass of a particle, it can create that kind of particle. So we slam together, say, two protons at high enough energy, we’ll get a Higgs’ boson.

But things are never quite that easy. There are a bunch of problems. First, the kind of collision that can produce a Higgs’ doesn’t always produce one. It can produce a variety of results, depending on the specifics of the collision as well as purely random factors. Second, it produces a lot more than just a Higgs’. We’re talking about an extremely complex, extremely high energy collision, with a ton of complex results. And third, the Higgs’ boson isn’t particularly stable. It doesn’t really like to exist. So like many unstable things in particle physics, it decays, producing other particles. And many of those particles are themselves unstable, and decay into other particles. What we can observe is the last products of the collision, several steps back from the Higgs’. But we know what kind of things the Higgs’ can decay into, and what they can decay into, etc.

So, we slam these things together a couple of thousand, or a couple of million times. And we look at the results. We look at all of the results of all of the collisions. And we specifically look for a bump: if there’s really a specific collision energy level at which Higgs’ bosons are produced, then we’ll see a bump in the number of Higgs’ decay products that are produced by collisions at that energy. And what the announcement yesterday showed is that that’s exactly what they saw: a bump in the observations inside the expected range of values of the mass of a Higgs’ boson.

The bump

What does five sigmas mean?

Whenever we’re making observations of a complex phenomenon, there are all sorts of things that can confound our observations. There are measurement errors, calculation errors, random noise, among many other things. So we can’t just look at one, or two, or ten data points. We need to look at a lot of data. And when you’ve got a lot of data, there’s always a chance that you’ll see what appears to be a pattern in the data, which is really just the product of random noise

For example, there are some people who’ve won the lottery multiple times. That seems crazy – it’s so unlikely to win once! To win multiple times seems crazy. But probabilistically, if you keep observing lotteries, you’ll find repeat winners. Or you’ll find apparent patterns in the winning numbers, even though they’re being drawn randomly.

We don’t want to be fooled by statistics. So we create standards. We can compute how unlikely a given pattern would be, if it were occuring do to pure randomness. We can’t even absolutely rule out randomness, but for any degree of certainty, we can determine just how unlikely a given observation is to be due to randomness.

We describe that in terms of standard deviations. An observation of a phenomenon has a roughly 68% chance of being measured within one standard deviation (one sigma) of the actual value, or a roughly 32% chance of being observed outside of one sigma. At two sigmas, there’s only a roughly 5% chance of being outside. At three sigmas out, you’re down to a roughly 0.3% chance of randomly observing an event outside. The odds continue that way.

So, the Higgs’ hunters computed probabilities of observing the data that they found if they assumed that there was no Higgs’. The amount of data that they found exceeded 5 sigmas away from what you would expect by random chance if there was no Higgs’. That translates as damned unlikely. The ultimate choice of 5 sigmas is arbitrary, but it’s accepted as a threshold for certainty in particle physics. At five sigmas, we realistically rule out random chance.

Why do they keep saying Higgs’-like particle?

Remember up above, I said: “So – if we can find a particle whose mass matches what we predict, and it has the correct properties for a mass-field exchange particle, then we can infer that the Higgs’ field is real, and is the cause of mass”? There are two thing we need to show to conclude that we’ve found the mediator of the Higgs’ field. There needs to be a particle with the right mass, and it needs to have the properties of a mass-mediator. What we’ve got right now is an observation that yes, there is a particle at the mass we’d expect for a Higgs’. But we don’t have observations yet of any properties of the particle other than its mass. Assuming the standard model is right, the odds of finding another particle with that mass is damned unlikely, but the standard model could be wrong. It’s not likely at this point, but people like to be careful. So at this point, to be precise, we’ve observed a Higgs’-like particle – a particle that according to all of the observations we’ve made so far appears to be a Higgs’; but until we observe some properties other than mass, we can’t be absolutely certain that it’s a Higgs’.

Turing Machines – what they are, what they aren't.

It’s the anniversary of the birth of Alan Turing. So there’s a ton of people writing commemorative stories. And naturally, a ton of those stories get it wrong. And they get it wrong in a very sad way.

Of course, when you talk about Turing, you talk about Turing machines. Despite the fact that Turing did lots of stuff besides just that machine, it’s always the machine that people focus on. Partly that’s laziness – the machine has his name after all, so it’s one of the first things you find if you Google Turing. It’s also the easiest thing to write about.

What do they say about the Turing machine? It’ “the simplest computing device”. It’s the “basis for modern computers”. It’s “the theoretical model for the microchips in your laptop”. It’s the “mathematical description of your computer”. None of those things are true. And they all both over and under-state what Turing really did. In terms of modern computers, the Turing machine’s contribution to the design of real computers is negligible if not non-existent. But his contrubition to our understanding of what computers can do, and our understanding of how mathematics really works – they’re far, far more significant than the architecture of any single machine.

The Turing machine isn’t a model of real computers. The computer that you’re using to read this has absolutely nothing to do with the Turing machine. As a real device, the turing machine is absolutely terrible.

The turing machine is a mathematical model not of computers, but of computation. That’s a really important distinction. The Turing machine is an easy to understand model of a computing device. It’s definitely not the simplest model. There are simpler computing devices (for example, I think that the rule 111 machine is simpler) – but their simplicitly makes them harder to understand.

The reason that the Turing machine is so important comes down to two important facts. First, which machine you use to talk about computation doesn’t matter. There’s a limit to what a mechanical device can do. There are lots of machines out there – but ultimately, no machine can go past the limit. Any machine that can reach that limit is, for the purposes of the theory of computation, pretty much the same. When we talk about studying computation, what we’re talking about is the set of things that can be done by a machine – not by a specific machine, but by any machine. The specific choice of machine isn’t important. And that’s the point: computation is computation. That’s what Turing figured out.

The Turing machine is a brilliant creation. It’s a simple machine. It’s really easy to understand. And it’s easy to tweak – that is, it’s easy to do experiments where you can modify the machine, and see what effect it has.

So let’s take a step back, and see: what is a Turing machine?

The Turing machine is a very simple kind of theoretical computing device. In fact, it’s almost downright trivial. But according to everything that we know and understand about computation, this trivial device is capable of any computation that can be performed by any other computing device.

The basic idea of the Turing machine is very simple. It’s a machine that runs on top of a tape, which is made up of a long series of little cells, each of which has a single character written on it. The machine is a read/write head that moves over the tape, and which can store a little bit of information. Each step, the machine looks at the symbol on the cell under the tape head, and based on what it sees there, and whatever little bit of information it has stored, it decides what to do. The things that it can do are change the information it has store, write a new symbol onto the current tape cell, and move one cell left or right.

That’s really it. People who like to make computing sound impressive often have very complicated explanations of it – but really, that’s all there is to it. The point of it was to be simple – and simple it certainly is. And yet, it can do anything that’s computable.

To really understand how that trivial machine can do computations, it helps to be a bit formal. In formal terms, we talk about a turing machine as a tuple: (S, s0, A, T), where:

  • S is a finite list of possible states that the machine can be in. The state is the information that the machine can store in the head to make decisions. It’s a very limited amount of information; you can think of it as either a specific list of states, or a small set of small numbers. For most Turing machines, we’ll use it as a specific list of states. (You’ll see what I mean in a minute.)
  • s0S is the initial state – the state that the machine will be in when it starts a computation.
  • A is the machine’s alphabet, which is the set of symbols which can appear on the machine’s tape.
  • T is the machines transition function. This is the real heart of the machine, where the computation is defined. It’s a function from the machine state and the alphabet character on the current tape cell to the action that the machine should take. The action is a triple consisting of a new value for the machine’s state, a character to write onto the current tape cell, and a direction to move the tape head – either left or right.

So, for example, let’s look at a simple machine. This is one of the classic examples: a Turing machine that does subtraction using unary numbers. A unary number “N” is written as a series of N 1s. For the program, we’ll give the machine an input in the format “N-M=” written onto the tape; after running the machine, the tape will contain the value of M subtracted from N. So, for example, we could use “1111-11=” as an input; the output would be “11”.

The alphabet is the characters “1”, ” ” (blank space), “-” (minus sign), and “=” (equal sign). The machine has four states: {“scanright”, “eraseone”, “subone”, “skip”}. It starts in the state “scanright”. It’s transition function is given by the following table:

FromState Symbol ToState WriteChar Dir
scanright space scanright space right
scanright 1 scanright 1 right
scanright minus scanright minus right
scanright equal eraseone space left
eraseone 1 subone equal left
eraseone minus HALT space n/a
subone 1 subone 1 left
subone minus skip minus left
skip space skip space left
skip 1 scanright space right

What this machine does is move to the right until it sees the equal sign; it erases the equal sign and moves to the left, erases one digit off of the second number, and replaces it with the equal sign (so the second number is reduced by one, and the equal sign is moved over one position). Then it scans back to the minus sign (which separates the two numbers), and erases one digit of the first number, and then switches back to scanning to the right for the equal. So one at a time, it erases one digit from each of the two numbers. When it reaches the equal sign, and starts going back to erase a digit from the second number, and hits the “-” before it finds a digit, that means its done, so it stops. Let’s look at a simple execution trace. In the trace, I’ll write the machine state, followed by a colon, followed by the tape contents surrounded by “[]”, with the current tape cell surrounded by “{}”.

	scanright:  [ {1}1111111-111= ]"
	scanright:  [ 1{1}111111-111= ]"
	scanright:  [ 11{1}11111-111= ]"
	scanright:  [ 111{1}1111-111= ]"
	scanright:  [ 1111{1}111-111= ]"
	scanright:  [ 11111{1}11-111= ]"
	scanright:  [ 111111{1}1-111= ]"
	scanright:  [ 1111111{1}-111= ]"
	scanright:  [ 11111111{-}111= ]"
	scanright:  [ 11111111-{1}11= ]"
	scanright:  [ 11111111-1{1}1= ]"
	scanright:  [ 11111111-11{1}= ]"
	scanright:  [ 11111111-111{=} ]"
	eraseone :  [ 11111111-11{1}  ]"
	subone   :  [ 11111111-1{1}=  ]"
	subone   :  [ 11111111-{1}1=  ]"
	subone   :  [ 11111111{-}11=  ]"
	skip     :  [ 1111111{1}-11=  ]"
	scanright:  [ 1111111 {-}11=  ]"
	scanright:  [ 1111111 -{1}1=  ]"
	scanright:  [ 1111111 -1{1}=  ]"
	scanright:  [ 1111111 -11{=}  ]"
	eraseone :  [ 1111111 -1{1}   ]"
	subone   :  [ 1111111 -{1}=   ]"
	subone   :  [ 1111111 {-}1=   ]"
	skip     :  [ 1111111{ }-1=   ]"
	skip     :  [ 111111{1} -1=   ]"
	scanright:  [ 111111 { }-1=   ]"
	scanright:  [ 111111  {-}1=   ]"
	scanright:  [ 111111  -{1}=   ]"
	scanright:  [ 111111  -1{=}   ]"
	eraseone :  [ 111111  -{1}    ]"
	subone   :  [ 111111  {-}=    ]"
	skip     :  [ 111111 { }-=    ]"
	skip     :  [ 111111{ } -=    ]"
	skip     :  [ 11111{1}  -=    ]"
	scanright:  [ 11111 { } -=    ]"
	scanright:  [ 11111  { }-=    ]"
	scanright:  [ 11111   {-}=    ]"
	scanright:  [ 11111   -{=}    ]"
	eraseone :  [ 11111   {-}     ]"
	Halt:       [ 11111  { }-     ]"
	

See, it works!

One really important thing to understand here is that we do not have a program. What we just did was define a Turing machine that does subtraction. The machine does not take any instructions: the states and the transition function are an intrinsic part of the machine. So the only thing this machine can do is to subtract.

The real genius of Turing wasn’t just defining this simple computing machine. It was realizing the concept of the program: you could design a Turing machine whose input tape contained a description of a Turing machine – that is a program – followed by an input to the program. This single machine – the Universal Turing machine – could simulate any other Turing machine; one machine which could perform any computation.

Turing machines are a lot of fun to play with. Figuring out how to do things can be fascinating. Figuring out how to define a Universal turing machine’s transition function is an amazing thing to do; it’s astonishing how simple the universal machine can be!

As I said earlier – you can’t make a mechanical computing device that does anything that a Turing machine can’t do. One of the beauties of the Turing machine is that it lets you explore that. You can try adding and removing features to the basic machine, and see what happens.

For example: if you can do lots of great stuff with a Turing machine with one tape, what if you had a two-tape turing machine? That is, take the basic turing machine, and say that it’s got two tapes, each with a read/write head. Each state transition rule on this machine depends on the pair of values found on the two tapes. For now, we’ll say that the tapes move together – that is, the transition rule says “move the heads right” or “move the heads left”.

Seems like this should represent a real increase in power, right? No. Take a single-tape turing machine. Take the alphabet for tape one, and call it A1; take the alphabet for tape 2, and call it A2. We can create a single-tape turing machine whose alphabet is the cross-product of A1 and A2. Now each symbol on the tape is equivalent of a symbol on tape 1 and a symbol on tape 2. So we’ve got a single-tape machine which is equivalent to the two-tape machine. Bingo.

We can lift the restriction on the heads moving together, but it’s a lot more work. A two-tape machine can do things a lot faster than a one-tape, and the simulation will necessarily adapt to that. But it’s entirely doable. How about a two-dimensional tape? We can simulate that pretty easily with a two-tape machine, which means we can do it with a one-tape machine. For a two tape machine, what we do is map the two-D tape onto the one-D-tape, as seen in the diagram below – so that cell 0 on the one-D tape corresponds to cell (0,0) on the two tape; cell (0,1) on the two-D corresponds to cell 1 on the one-D; cell (1,1) on the 2-D is cell 2 on the 1-D; etc. Then we use the second tape for the book-keeping necessary to do the equivalent of T-D tape moves. And we’ve got a two-D turing machine simulated with a two-tape one-D; and we know that we can simulate a two-tape one-D with a one-tape one-D.

This is, to me, the most beautiful thing about the Turing machine. It’s not just a fundamental theoretical construction of a computing device; it’s a simple construction of a computing device that’s really easy to experiment with. Consider, for a moment, lambda calculus. It’s more useful that a Turing machine for lots of purposes – we write real programs in lambda calculus, when no one would build a real application using a Turing machine program. But imagine how you’d try things like the alternate constructions of the Turing machine? It’s a whole lot harder to build experiments like those in lambda calculus. Likewise for Minsky machines, Markov machines, etc.

For your enjoyment, I’ve implemented a Turing machine programming language. You feed it a Turing machine description, and an input string, and it will give you a trace of the machines execution like the one above. Here’s the specification of the subtraction machine written in my little Turing language:

states { "scanright" "eraseone" "subtractOneFromResult" "skipblanks" } initial "scanright"
alphabet { '1' ' ' '=' '-' } blank ' '
trans from "scanright" to "scanright" on (' ') write ' ' move right
trans from "scanright" to "scanright" on ('1') write '1' move right
trans from "scanright" to "scanright" on ('-') write '-' move right
trans from "scanright" to "eraseone" on ('=') write ' ' move left
trans from "eraseone" to "subtractOneFromResult" on ('1') write '=' move left
trans from "eraseone" to "Halt" on ('-') write ' ' move left
trans from "subtractOneFromResult" to "subtractOneFromResult" on ('1') write '1' move left
trans from "subtractOneFromResult" to "skipblanks" on ('-') write '-' move left
trans from "skipblanks" to "skipblanks" on (' ') write ' ' move left
trans from "skipblanks" to "scanright" on ('1') write ' ' move right

I think it’s pretty clear as a syntax, but it still needs explanation.

  • The first line declares the possible states of the machine, and what state it starts in. This machine has four possible states – “scanright”, “eraseone”, “subtractOneFromResult”, and “skipblanks”. When the machine starts, it will be in the state “skipright”.
  • The second line declares the set of symbols that can appear on the tape, including what symbol will initially appear on any tape cell whose value wasn’t specified by the input. For this machine the symbols are the digit 1, a blank space, the equal sign, and the subtraction symbol; the blank symbol is on any tape cell whose initial value wasn’t specified.
  • After that is a series of transition declarations. Each declaration specifies what the machine will do for a given pair of initial state and tape symbol. So, for example, if the machine is in state “scanright”, and the current tape cell contains the equal-to sign, then the machine will change to state “eraseone”, write a blank onto the tape cell (erasing the “=” sign), and move the tape cell one position to the left.

Finally, the code. You’ll need GHCI to run it; at the moment, it won’t work in Hugs (I don’t have the parsing library in my version of Hugs), and I haven’t worked out the linkage stuff to make it work under the GHC compiled mode.

--
-- A Simple Turing machine interpreter
-- Copyright 2007 by Mark C. Chu-Carroll
--    markcc@gmail.com
--   http://scienceblogs.com/goodmath
--
-- You can do anything you want with this code so long as you
-- leave this copyright notice intact.
--
module Turing where
import Text.ParserCombinators.Parsec
import qualified Text.ParserCombinators.Parsec.Token as P
import Text.ParserCombinators.Parsec.Language
import System.Environment

data Motion = MoveLeft  | MoveRight deriving (Show)

-- Transition from to on move write
data Transition = Transition String String [Char] Motion  Char deriving (Show)

-- TuringMachine states initial alphabet blank transitions
data TuringMachine = Machine [String] String   [Char]    Char    [Transition] deriving (Show)

data MachineState = TMState [Char] Char [Char]  String  deriving (Show)
--                           tape-left curcell  curstate

getBlankSym :: TuringMachine -> Char
getBlankSym (Machine _ _ _ bl _) = bl

findMatchingTransition :: [Transition] -> String -> Char -> Maybe Transition
findMatchingTransition [] _ _ =  Nothing
findMatchingTransition translist state c =
     let matches = filter (transitionMatches state c) translist
     in case matches of
          [] -> Nothing
          t:[] -> Just t
          _ -> error "Ambiguous transition"
       where transitionMatches state c (Transition from to on move write) =
                        (from == state) && (elem c on)

runTransition :: TuringMachine -> [Char] -> Char -> [Char] -> String -> Transition -> MachineState
runTransition m (l:left) c right state (Transition from tostate on MoveLeft write) =
   TMState left l (write:right) tostate
runTransition m left c [] state (Transition from tostate on MoveRight write) =
   TMState (write:left) (getBlankSym m) [] tostate
runTransition m left c (r:right) state (Transition from tostate on MoveRight write) =
   TMState (write:left) r right tostate

stepMachine :: TuringMachine -> MachineState -> MachineState
stepMachine machine@(Machine _ _ _ _ translist) st@(TMState tapeleft c taperight state) =
       let trans = findMatchingTransition translist state c
       in case trans of
          (Just t) -> runTransition machine tapeleft c taperight state t
          Nothing -> error "No applicable transition from state"

getMachineStateString (TMState left c right state) =
	(state ++ ":[" ++ (reverse left) ++ "{" ++ [c] ++ "}" ++ right ++ "]")

runTracedMachine :: TuringMachine -> [Char] -> [String]
runTracedMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
    runTracedMachineSteps tm (TMState [] t tape initial) where
        runTracedMachineSteps machine state =
           let newstate@(TMState left c right st) = stepMachine machine state
           in if st == "Halt"
               then [getMachineStateString newstate]
               else let trace=runTracedMachineSteps machine newstate
                    in ((getMachineStateString newstate):trace)

runMachine :: TuringMachine -> [Char] -> [Char]
runMachine tm@(Machine states initial alphabet blank transitions) (t:tape) =
    runMachineSteps tm (TMState [] t tape initial) where
        runMachineSteps machine state =
           let newstate@(TMState left c right st) = stepMachine machine state
           in if st == "Halt"
               then (concat [(reverse left), [c], right])
               else runMachineSteps machine newstate

---------------------------------------------------------------------------
-- Parsing code - implemented using the Parsec monadic parser library.
---------------------------------------------------------------------------

-- Basic setup stuff - use a standard haskell style lexer; set up the reserved words
-- and symbols in the lexer.
lexer :: P.TokenParser ()
lexer = P.makeTokenParser (haskellDef
                        { P.reservedNames = ["states","alphabet","trans","from","to","on","write","move","left","right","initial","blank"] })

reserved = P.reserved lexer
symbol = P.symbol lexer
braces = P.braces lexer
parens = P.parens lexer
charlit = P.charLiteral lexer
strlit = P.stringLiteral lexer
ident = P.identifier lexer
whitespace = P.whiteSpace lexer

states = reserved "states"
alphabet = reserved "alphabet"
trans = reserved "trans"
from = reserved "from"
to = reserved "to"
on = reserved "on"
write = reserved "write"
move = reserved "move"
initial = reserved "initial"
blank = reserved "blank"

left = do { reserved "left"
          ; return MoveLeft
          }

right = do { reserved "right"
           ; return MoveRight
           }

-- statesDecl ::= "states" "{"  strlit+  "}" "initial" strlit
statesDecl = do { states
                ; strlst <- braces (many1 strlit)
                ; initial
                ; i <- strlit
                ; return (i,strlst)
                }

-- alphaDecl ::= "alphabet" "{" charlit+  "}" "blank" charlit
alphaDecl = do { alphabet
               ; charlst <- braces (many1 charlit)
               ; blank
               ; bl <- charlit
               ; return (bl, charlst)
               }

-- transDecl ::= "trans" "from" strlit "to" strlit "on" "(" charlit+ ")" "write" charlit "move" ("left" | "right")
transDecl = do { trans
               ; from
               ; fromState <- strlit
               ; to
               ; targetState <- strlit
               ; on
               ; chrs <- parens (many1 charlit)
               ; write
               ; wrchar <- charlit
               ; move
               ; dir <- ( left <|> right)
      	   ; return (Transition fromState targetState chrs dir wrchar)
              }

-- machine ::= statesDecl alphaDecl transDecl+
machine = do { (i,sts) <- statesDecl
             ; (bl,alpha) <- alphaDecl
             ; trans <- many1 transDecl
             ; return (Machine sts i alpha bl trans)
             }

run :: (Show a) => Parser a -> String -> IO a
run p input
	= case (parse p "" input) of
		Left err -> do{ putStr "parse error at "
		; print err
		; error "Parse error"
		}
		Right x -> return x

runTParser ::  String -> IO TuringMachine
runTParser input =
	run (do { whitespace
	        ; x <- machine
	        ; eof
	        ; return x })  input

--------------------------------------------------------------------------
-- A sample Turing program - first in the comment, and then coded into
-- a string literal, to make it easy to run tests. This sample program
-- is a basic Turing subtract - it takes a unary equation "111111-111=",
-- and leaves the result of subtracting the second from the first.
--
-- states { "one" "two" "three" "four" } initial "one"
-- alphabet { '1' ' ' '=' '-' } blank ' '
-- trans from "one" to "one" on (' ') write ' ' move right
-- trans from "one" to "one" on ('1') write '1' move right
-- trans from "one" to "one" on ('-') write '-' move right
-- trans from "one" to "two" on ('=') write ' ' move left
-- trans from "two" to "three" on ('1') write '=' move left
- trans from "two" to "Halt" on ('-') write '-' move left
-- trans from "three" to "three" on ('1') write '1' move left
-- trans from "three" to "four" on ('-') write '-' move left
-- trans from "four" to "four" on (' ') write ' ' move left
-- trans from "four" to "one" on ('1') write ' ' move right

sampleMachine = concat ["states { "one" "two" "three" "four" } initial "one"n ",
                        " alphabet { '1' ' ' '=' '-' } blank ' 'n ",
                        "trans from "one" to "one" on (' ') write ' ' move rightn ",
                        "trans from "one" to "one" on ('1') write '1' move rightn ",
                        "trans from "one" to "one" on ('-') write '-' move rightn ",
                        "trans from "one" to "two" on ('=') write ' ' move leftn ",
                        "trans from "two" to "three" on ('1') write '=' move leftn ",
                        "trans from "two" to "Halt" on ('-') write '-' move leftn ",
                        "trans from "three" to "three" on ('1') write '1' move leftn ",
                        "trans from "three" to "four" on ('-') write '-' move leftn ",
                        "trans from "four" to "four" on (' ') write ' ' move leftn ",
                        "trans from "four" to "one" on ('1') write ' ' move right"  ]

runTracedMachineOnString :: String -> String -> IO ([String])
runTracedMachineOnString m str =
	do
		tm <- runTParser m
		return (runTracedMachine tm str)

runMachineOnString :: String -> String -> IO String
runMachineOnString m str =
    do
	    tm <- runTParser m
	    return (runMachine tm str)

sampleInput = " 11111111-111= "

------------------------------------------------------------------------
-- Main program execution scaffolding
-- main still needs a bit of work so that ghci will link correctly;
-- runs fine in GHCI, but linkage errors in GHC. For now, just load
-- this file, and then execute "runFromFile" from the command line.
------------------------------------------------------------------------
main =
    do
       [file] <- getArgs
       m <- parseFromFile (do { whitespace
                              ; x <- machine
                              ; eof
                              ; return x }) file
       case m of
          Right machine -> do
             print "Enter input for parser:"
             s <- getLine
             result <- return (runMachine machine s)
             print (concat ["Result:[", result, "]"])
          Left x -> do
	        print (concat ["Parse error"])

runFromFile :: String -> IO ()
runFromFile filename =
    do
       m <- parseFromFile (do { whitespace
                              ; x <- machine
                              ; eof
                              ; return x }) filename
       case m of
          Right machine -> do
             print "Enter input for parser:"
             s <- getLine
             result <- return (runMachine machine s)
             print (concat ["Result:[", result, "]"])
          Left x -> do
	        print (concat ["Parse error"])


A Neat Trick with Partial Evalutors

This is something I was talking about with a coworker today, and I couldn’t quickly find a friendly writeup of it online, so I decided to write my own. (Yes, we do have a lot of people who enjoy conversations like this at foursquare, which is one of the many reasons that it’s such a great place to work. And we are hiring engineers! Feel free to get in touch with me if you’re interested, or just check our website!)

Anyway – what we were discussing was something called partial evaluation (PE). The idea of PE is almost like an eager form of currying. Basically, you have a two parameter function, f(x, y). You know that you’re going to invoke it a bunch of times with the same value of x, but different values of y. So what you want to do is create a new function, fx(y), which evaluates as much as possible of f using the known parameter.

It’s often clearer to see it in programmatic form. Since these days, I pretty much live Scala, I’ll use Scala syntax. We can abstractly represent a program as an object which can be run with a list of arguments, and which returns some value as its result.

trait Program {
  def run(inputs: List[String]): String
}

If that’s a program, then a partial evaluator would be:

object PartialEvaluator {
  def specialize(prog: Program, knownInputs: List[String]): Program = {
     ...
  }
}

What it does is take and program and a partial input, and returns a new program, which is the original program, specialized for the partial inputs supplied to the partial evaluator. So, for example, imagine that you have a program like:

object Silly extends Program {
  def run(inputs: List[String]): String = {
    val x: Int = augmentString(inputs[0]).toInt
    val y: Int = augmentString(inputs[0]).toInt
    if (x % 2 == 0) {
	  (y * y).toString
    } else {
	  ((y+1)*(y+1)).toString
    }
  }
}

This is obviously a stupid program, but it’s good for an example, because it’s so simple. If we’re going to run Silly a bunch of times with 1 as the first parameter, but with different second parameters, then we could generate a specialized version of Silly:

   val sillySpecialOne = PartialEvaluator.specialize(Silly, List("1"))

Here’s where the interesting part comes, where it’s really different from
currying. A partial evaluator evaluates everything that it
possibly can, given the inputs that it’s supplied. So the value produced by the specializer would be:

object Silly_1 extends Program {
  def run(inputs: List[String]): String = {
    val y: Int = augmentString(inputs[0]).toInt
    ((y+1)*(y+1)).toString
  }
}

This can be a really useful thing to do. It can turn in to a huge
optimization. (Which is, of course, why people are interested in it.) In compiler terms, it’s sort of like an uber-advanced form of constant propagation.

But the cool thing that we were talking about is going a bit meta. Suppose that you run a partial evaluator on a programming language interpreter?

object MyInterpreter extends Program {
  def run(inputs: List[String]): String = {
    val code = inputs[0]
    val inputsToCode = inputs.tail
    interpretProgram(code, inputsToCode)
  }

  def interpretProgram(code: String, inputs: List[String]): String = {
     ...
  }
}

We can run our partial evaluator to generate a version of the interpreter that’s specialized for the code that the interpreter is supposed to execute:

  1. We have an interpreter for language M, written in language L.
  2. We have a partial evaluator for language L.
  3. We run the partial evaluator on the interpreter, with a program
    written in M that we want to interpret:
    PartialEvaluator.specialize(M_Interpreter, M_Code).
  4. The partial evaluator produces a program written in L.

So the partial evaluator took a program written in M, and transformed it into a program written in L!

We can go a step further – and generate an M to L compiler. How? By running the partial evaluator on itself. That is, run the partial
evaluator like this: PartialEvaluator.specialize(PartialEvaluator, M_Interpreter). The result is a program which takes an M program as
input, and generates an L program: that is, it’s an M to L compiler!

We can go yet another step, and turn the partial evaluator into a
compiler generator: PartialEvaluator(PartialEvaluator,
List(source(PartialEvalutor)))
. What we get is a program which takes an interpreter written in L, and generates a compiler from it.

It’s possible to actually generate useful tools this way. People have actually implemented Lisp compilers this way! For example, you can see someone do a simple version here.

Introducing Algebraic Data Structures via Category Theory: Monoids

Since joining foursquare, I’ve been spending almost all of my time writing functional programs. At foursquare, we do all of our server programming in Scala, and we have a very strong bias towards writing our scala code very functionally.

This has increased my interest in category theory in an interesting way. As a programming language geek, I’m obviously fascinated by data structures. Category theory provides a really interesting handle on a way of looking at a kind of generic data structures.

Historically (as much as that word can be used for anything in computer science), we’ve thought about data structures primarily in a algorithmic and structural ways.

For example, binary trees. A binary tree consists of a collection of linked nodes. We can define the structure recursively really easily: a binary tree is a node, which contains pointers to at most two other binary trees.

In the functional programming world, people have started to think about things in algebraic ways. So instead of just defining data structures in terms of structure, we also think about them in very algebraic ways. That is, we think about structures in terms of how they behave, instead of how they’re built.

For example, there’s a structure called a monoid. A monoid is a very simple idea: it’s an algebraic structure with a set of values S, one binary operation *, and one value i in S which is an identity value for *. To be a monoid, these objects must satisfy some rules called the monad laws:

  1. \forall s \in S: s * i = s, i * s = s
  2. \forall x, y, z \in S: (x * y) * z = x * (y * z)

There are some really obvious examples of monoids – like the set of integers with addition and 0 or integers with multiplication and 1. But there are many, many others.

Lists with concatenation and the empty list are a monoid: for any list,
l ++ [] == l, [] + l == l, and concatenation is associative.

Why should we care if data structures like are monoids? Because we can write very general code in terms of the algebraic construction, and then use it over all of the different operations. Monoids provide the tools you need to build fold operations. Every kind of fold – that is, operations that collapse a sequence of other operations into a single value – can be defined in terms of monoids. So you can write a fold operation that works on lists, strings, numbers, optional values, maps, and god-only-knows what else. Any data structure which is a monoid is a data structure with a meaningful fold operation: monoids encapsulate the requirements of foldability.

And that’s where category theory comes in. Category theory provides a generic method for talking about algebraic structures like monoids. After all, what category theory does is provide a way of describing structures in terms of how their operations can be composed: that’s exactly what you want for talking about algebraic data structures.

The categorical construction of a monoid is, alas, pretty complicated. It’s a simple idea – but defining it solely in terms of the composition behavior of function-like objects does take a bit of effort. But it’s really worth it: when you see a monoidal category, it’s obvious what the elements are in terms of programming. And when we get to even more complicated structures, like monads, pullbacks, etc., the advantage will be even clearer.

A monoidal category is a category with a functor, where the functor has the basic properties of a algebraic monoid. So it’s a category C, paired with a bi-functor – that is a two-argument functor ⊗:C×C→C. This is the categorical form of the tensor operation from the algebraic monoid. To make it a monoidal category, we need to take the tensor operation, and define the properties that it needs to have. They’re called its coherence conditions, and basically, they’re the properties that are needed to make the diagrams that we’re going to use commute.

So – the tensor functor is a bifunctor from C×C to C. There is also an object I∈C, which is called the unit object, which is basically the identity element of the monoid. As we would expect from the algebraic definition, the tensor functor has two basic properties: associativity, and identity.

Associativity is expressed categorically using a natural isomorphism, which we’ll name α. For any three object X, Y, and Z, α includes a component αX,Y,Z (which I’ll label α(X,Y,Z) in diagrams, because subscripts in diagrams are a pain!), which is a mapping from (X⊗Y)⊗Z to X⊗(Y⊗Z). The natural isomorphism says, in categorical terms, that the the two objects on either side of its mappings are equivalent.

The identity property is again expressed via natural isomorphism. The category must include an object I (called the unit), and two natural isomorphisms, called &lamba; and ρ. For any arrow X in C, &lamba; and ρ contain components λX and ρX such that λX maps from I⊗X→X, and ρX maps from X⊗I to X.

Now, all of the pieces that we need are on the table. All we need to do is explain how they all fit together – what kinds of properties these pieces need to have for this to – that is, for these definitions to give us a structure that looks like the algebraic notion of monoidal structures, but built in category theory. The properties are, more or less, exact correspondences with the associativity and identity requirements of the algebraic monoid. But with category theory, we can say it visually. The two diagrams below each describe one of the two properties.

monoidal-categoy.png

The upper (pentagonal) diagram must commute for all A, B, C, and D. It describes the associativity property. Each arrow in the diagram is a component of the natural isomorphism over the category, and the diagram describes what it means for the natural isomorphism to define associativity.

Similarly, the bottom diagram defines identity. The arrows are all components of natural isomorphisms, and they describe the properties that the natural isomorphisms must have in order for them, together with the unit I to define identity.

Like I said, the definition is a lot more complicated. But look at the diagram: you can see folding in it, in the chains of arrows in the commutative diagram.

Categorical Equalizers

Category theory is really all about building math using composition. Everything we do, we do it by defining things completely in terms of composition. We’ve seen a lot of that. For example, we defined subclasses (and other sub-things) in terms of composition. It sounds strange, but it took me an amazingly long time to really grasp that. (I learned category theory from some not-very-good textbooks, which were so concerned with the formalisms that they never bothered to explain why any of it mattered, or to build any intuition.)

One thing that’s really important that we haven’t talked about yet is equality. In category theory, we define equality on arrows using something called a pullback. We’ll use that notion of equality for a lot of other things – like natural transformations. But before we can do pullbacks, we need to look at a weaker notion of arrow equality – something called the equalizer of a pair of arrows.

As part of my attempt to be better than those books that I complained about, we’ll start with intuition, by looking at what an equalizer is in terms of sets. Suppose we have two functions f, g : A \rightarrow B.

Now, in addition, suppose that they have a not-empty intersection: that is, that there’s some set of values x \in A for which f(x) == g(x). We can take that set of values, and call it C. C is the equalizer of the functions f and g. It’s the largest set C where if you restrict the inputs to the functions
to members of C, then f and g are equal.

Now, let’s look at the category theoretic version of that. We have objects A and B. We have two arrows f, g : A \rightarrow B. This is the category analogue of the setup of sets and functions from above.

To get to the equalizer, we need to add an object C which is a subobject of A (which corresponds to the subset of A on which f and g agree in the set model).

The equalizer of A and B is the pair of the object C, and an arrow i : C \rightarrow A. (That is, the object and arrow that define C as a subobject of A.) This object and arrow must satisfy the following conditions:

  1.  f \circ i = g \circ i
  2. (\forall j: D \rightarrow A) if f \circ j = g \circ j then (\exists 1 k : D \rightarrow C) such that i \circ k = j

That second one is the mouthful. What it says is:

  • Suppose that I have any arrow j from some other object D to A:
  • if f and g agree on composition about j, then there can only be one unique arrow from C to D which composes with j to get to A.

In other words, (C, i) is a selector for the arrows on which A and B agree; you can only compose an arrow to A in a way that will compose equivalently with f and g to B if you go through (C, i).

Or in diagram form, k in the following diagram is necessarily unique:

equalizer.jpg

There are a couple of interesting properties of equalizers that are worth mentioning. The morphism in an equalizer is a always monic arrow (monomorphism); and if it’s epic (an epimorphism), then it must also be iso (an isomorphism).

The pullback is very nearly the same construction as the equalizer we just looked at; except it’s abstracting one step further.

Suppose we have two arrows pointing to the same target: f : B \rightarrow A and g : C \rightarrow A. Then the pullback of of f and g is the triple of an object and two arrows (B \times_A C, p : B \times_A C \rightarrow B, q : B\times_A C \rightarrow C). The elements of this triple must meet the following requirements:

  1. f \circ p = g \circ q
  2. (f \circ p) : B\times_A C \rightarrow A
  3. for every triple (D, h : D \rightarrow B , k : D \rightarrow C), there is exactly one unique arrow \langle h,k \rangle_A : D \rightarrow B \times_A C where p \circ \langle h,k \rangle_A = h, and q \circ \langle h,k \rangle_A = k.As happens so frequently in category theory, this is clearer using a diagram.

    pullback.jpg

    If you look at this, you should definitely be able to see how this corresponds to the categorical equalizer. If you’re careful and clever, you can also see the resemblance to categorical product (which is why we use the ×A syntax). It’s a general construction that says that f and g are equivalent with respect to the product-like object B×AC.

    Here’s the neat thing. Work backwards through this abstraction process to figure out what this construction means if objects are sets and arrows are functions, and what’s the pullback of the sets A and B?

    { (x,y) \in A \times B : f(x) = g(y) }

    Right back where we started, almost. The pullback is an equalizer; working it back shows that.

Substuff

What’s a subset? That’s easy: if we have two sets A and B, A is a subset of B if every member of A is also a member of B.

We can take the same basic idea, and apply it to something which a tad more structure, to get subgroups. What’s a subgroup? If we have two groups A and B, and the values in group A are a subset of the values in group B, then A is a subgroup of B.

The point of category theory is to take concepts like “subset” and generalize them so that we can apply the same idea in many different domains. In category theory, we don’t ask “what’s a subset?”. We ask, for any structured THING, what does it mean to be a sub-THING? We’re being very general here, and that’s always a bit tricky. We’ll start by building a basic construction, and look at it in terms of sets and subsets, where we already understand the specific concept.

In terms of sets, the most generic way of defining subsets is using functions. Suppose we have a set, A. How can we define all of the subsets of A, in terms of functions? We can do it using injective functions, as follows. (As a reminder, a function from X to Y where every value in X is mapped to a distinct function in y.)

For the set, A, we can take the set of all injective functions to A. We’ll call that set of functions Inj(A).

Given Inj(A), we can define equivalence classes over Inj(A), so that f: X \rightarrow A and g: Y \rightarrow A are equivalent if there is an isomorphism between X and Y.

The domain of each function in one of the equivalence classes in Inj(A) is a function isomorphic to a subset of A. So each equivalence class of injective functions defines a subset of A.

And there we go: we’ve got a very abstract definition of subsets.

Now we can take that, and generalize that function-based definition to categories, so that it can define a sub-object of any kind of object that can be represented in a category.

Before we jump in, let me review one important definition from before; the monomorphism, or monic arrow.

A monic arrow is an arrow f : a \rightarrow b such that
\forall g_1, g_2: x \rightarrow a: f \circ g_1 \ge f \circ g_2 \Rightarrow g_1 = g_2 (That is, if any two arrows composed with f in f \circ g end up at the same object only if they are the same.)

So, basically, the monic arrow is the category theoretic version of an injective function. We’ve taken the idea of what an injective function means, in terms of how functions compose, and when we generalized it, the result is the monic arrow.

Suppose we have a category C, and an object a in \mbox{Obj}(C). If there are are two monic arrows f : x \rightarrow a and g : y \rightarrow a, and
there is an arrow h such that g \circ h = f, then we say f \le g (read “f factors through g”). Now, we can take that “≤” relation, and use it to define an equivalence class of morphisms using f \equiv g \Leftrightarrow f \le g \land g \le f.

What we wind up with using that equivalence relation is a set of equivalence classes of monomorphisms pointing at A. Each of those equivalence classes of morphisms defines a subobject of A. (Within the equivalence classes are objects which have isomorphisms, so the sources of those arrows are equivalent with respect to this relation.) A subobject of A is the sources of an arrow in one of those equivalence classes.

It’s exactly the same thing as the function-based definition of sets. We’ve created a very general concept of sub-THING, which works exactly the same way as sub-sets, but can be applied to any category-theoretic structure.