How Computers Really Work: Math via Boolean Logic

As I try to write more posts about ARM programming, I’m finding that I keep getting sidetracked by background stuff. Hopefully it’s interesting; if not, let me know in the comments!

Today’s sidetrack: how the heck does a computer actually do math?

As I said in my post about binary the other day, at a deep level, computers don’t actually work with numbers, not even zeros and ones. Computers are electronic devices, and what they really do is worth with electrical signals. In computer hardware, there are really just a few fundamental operations that it can perform with those signals. By taking those basic operations, and combining them in the right way, we can do math. How that works is very mysterious to most people. Obviously, I can’t describe all of how a computer works in a blog post. But what I’m going to do is explain the basic parts (the gates that are used to build the hardware), and how to combine them to implement a piece of hardware that does integer addition.

In computer hardware, there are four fundamental components that we can use to build operations, and they’re the basic operations of boolean logic: And, Or, Exclusive Or, and Not.

  • AND: Boolean AND takes two inputs, and outputs a 1 if both inputs are one.andgate
  • OR: Boolean OR takes two inputs, and outputs a 1 if at least one of its inputs are 1. orgate
  • XOR: Boolean Exclusive-OR takes two inputs, and outputs a 1 if one, but not
    both, of its inputs are one.xor
  • NOT: Boolean NOT (also called negation) takes one input, and outputs a 1 if its input was 0. notgate

In talking about these kinds of operations, we frequently use truth tables to explain them. A truth table just lists all of the possible input combinations, and tells you what the output for them is. For example, here’s a truth table showing you AND, OR, and XOR:

A B A∧B A∨B A⊕B
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0

In terms of hardware, you can actually implement all of these using one primitive gate type, the NAND (negated AND) gate. For example, in the diagram below, you can see how to build a NOT gate using a NAND gate, and then using using that NAND-based NOT gate, to build an AND gate using two NANDs.

building-gates

In the hardware, those gates are all that we have to work with. To build arithmetic, we need to figure out how to combine these primitive pieces to build up something that produces the right results. First, we need to be able to describe what the correct results are. We’ll start by figuring out what it means to add two 1-bit numbers. We’ll have two one-bit inputs. We need two outputs – because two one-bit numbers can add up to a two-bit sum. We’ll call the outputs S and C (sum and carry – S is the low-order bit, which would be the sum if we could only have a one-bit result, and C is the value of the second bit.)

We can describe addition with a truth table:

A B S C
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

We can read those rows as “If A is 0 and B is 0, then A+B is 0 with a carry of 0”.

If we compare that table to the table up above showing the basic operations, then we can see that S is exactly the same as XOR, and C is exactly the same as AND. So we can build a one-bit adder out of gates:

half-adder

This little one-bit adder is commonly called a half-adder, because to really implement a useful add operation, we need two of them for each bit.

If you think about what you’d need to do to add together two two-bit numbers, you couldn’t just use a half-adder for each bit. The carry from the first bit needs to get added to the second bit. You need to plumb the carry result from the first bit into a third input for the second bit. To get that second bit to work properly, you need to add together three one-bit values: the two inputs for the high-order bit, and the carry from the low-order bit. Generalizing, if you want to add together N bits,
when you’re computing the sum of the Mth bit, you need to include the carry from the M-1th bit.

To include the carry, we need to combine half-adders into a full adder, which looks like the following:

full-adder

Using that, we can create an N-bit adder, by chaining the carry output from bit M-1 to the carry input of bit M. This creates the simplest adder circuit, which is called a ripple-carry adder.

ripple-carry

Ripple-carry adders are the simplest way of building an integer addition operation in hardware. They’ve got one very big downside: to add the Nth bit, you need to have the result of adding the N-1th bit. That means that the more bits you add, the slower the ripple-carry adder gets. You have to way for the signals to propagate all the way through the circuit from the low bits to the high bits.

So ripple-carry isn’t really used in hardware anymore. There are a bunch of more complicated ways of building the adder that get rid of that propagation delay. It comes down to a very common tradeoff for engineers: performance versus complexity. But at the end of the day, the concept is still basically the same: it’s still a chain of multiple simple adders, that work the way I described here.

Closeness without distance

In my introduction, I said that topology is fundamentally built on the notion of closeness. Someone very quickly responded on Twitter, because they thought that was wrong. It wasn’t wrong, but it’s easy to see where the confusion came from. Like so much math, Topology is built on a very precise logical and set-theoretic formalism. Mathematicians build those formalisms not because they’re annoying people who want to be mysterious and incomprehensible, but because the precision of those formalisms is critically important.

When you hear a statement like “point A is close to point B in a space S”, you have an intuitive idea of what the word “close” means. But when you try to expand that to math, it could potentially mean several different things. The easiest meaning would be: the distance between A and B is small.

Mathematicians have used that definition for a lot of interesting work. It’s got one limitation though: For it to work, you need to be able to define “distance” in the space. How do you do that? In conventional Euclidean space, we have an easy definition. Describe the position of the two points using Cartesian coordinates: A=(x1, y1), B = (x2, y2). The distance between A and B is:

d(A, B) = \sqrt{(x_2-x_1)^2 + (y_2-y_1)^2}

But we’re moving towards the world of topology. We can’t count on our spaces to be Euclidean. In fact, the whole point of topology is, in some sense, to figure out what happens when you have different spatial structures – that is, structures other than the familiar Euclidean one! We need to be able to talk about distances in some more general way. To do that, we’ll create a new kind of space – a space with an associated distance metric. This new space is called a metric space.

A distance metric is conceptually simple. It’s just a special kind of function, from pairs of points in a space to a real number. To be a distance metric, it needs a couple of properties. Suppose that the set of points in the space is S. Then a function d: S \times S \rightarrow \mathbf{R} is a distance metric if it satisfies the following requirements:

  1. Identity: \forall s_i, s_j \in S: d(s_i, s_j) = 0 \Leftrightarrow s_i = s_j
  2. Symmetry:\forall s_i, s_j \in S: d(s_i, s_j) = d(s_j, s_i)
  3. Triangle Inequality: \forall s_i, s_j, s_k \in S: d(s_i, s_k) \le d(s_i, s_j) + d(s_j, s_k)
  4. Non-negativity: \forall s_i, s_j \in S: d(s_i, s_j) \ge 0

A metric space is just the pair (S,d) of a set S, and a metric function d over the set. For example:

  1. A cartesian plane is a metric space whose metric function is the euclidean distance: d((a_x,a_y), (b_x,b_y)) = \sqrt{(a_x-b_x)^2 + (a_y-b-y)^2}.
  2. A checkerboard is a metric space with the number of kings moves as the metric function.
  3. The Manhattan street grid is a metric space where the distance function between two intersections is the sum of the number of horizontal blocks and the number of vertical blocks between them.

All of this is the mathematical work necessary to take one intuitive notion of closeness – the idea of “two points are close if there’s a small distance between them” and turn it into something formal, general, and unambiguous. But we still haven’t gotten to what closeness means in topology! It’s not based on any idea of distance. There are many topological spaces which aren’t metric spaces – that is, there’s no way to define a metric function!

Fortunately, metric spaces give us a good starting point. In topological spaces, closeness is defined in terms of neighborhoods and open balls.

Take a metric space, (S, d), and a point p \in S. An open ball B(p, r) (that is, a ball of radius r around the point p) is the set of points x \in S | d(p, x) < r.

Given a large enough set of points, you can create an infinite series of concentric open spheres: B(p, \epsilon), B(p, 2\epsilon), B(p, 3\epsilon), and so on. Once you’ve got that series of ever-smaller and ever-larger open balls around a point p, you’ve got another notion of closeness. A is closer to p than B is if A is in a smaller open ball around p.

This is the heart of topology. You can define something like an open-ball on a set without a metric. As long as you can create a consistent sequence of open balls, where each larger ball is a strict superset of all of the smaller ones, you can define closeness without any notion of measurable distance!

In the next post, we’ll use this notion of a distance-free sense of closeness to define what a topology actually is.

Another pass at Topology!

A long time ago – in 2006! – I wrote a ton of blog posts about topology. In the course of trying to fix up some of the import glitches from migrating this blog to its new home, I ended up looking at a bunch of them. And… well… those were written in my early days of blogging, and looking back at those posts now… well, let’s just say that my writing has come a long way with 8 years of practice! I was thinking, “I could do a much better job of writing about that now!”

So that’s what I’m going to do. This isn’t going to be reposts,
but rather completely rewrites.

Topology is typical of one of the methods of math that I love: abstraction. What mathematicians do is pick some fundamental concept, focus tightly on it, discarding everything else. In topology, you want to understand shapes. Starting with the basic idea of shape, topology lets us understand shapes, distortions, dimensions, continuity, and more.

The starting point in topology is closeness. You can define what a shape is by describing which points are close to which other points. Two shapes are equivalent if they can be built using the same closeness relationships. That means that if you can take one shape, and pull, squash, and twist it into another shape – as long as you don’t have to either break closeness relations (by tearing or punching holes in it), or add new closeness relations (by gluing edges together) – the two shapes are really the same thing.

This leads to a very, very bad math joke.

How do you recognize topologists at breakfast?

They’re the people who can’t tell their donut from their coffee.

The easiest way to ruin a joke is to over-explain it. Happily, when the joke is this bad, it’s already ruined, so this isn’t my fault. See, in topology, a coffee mug is the same shape as a donut. They’re each three dimensional shapes with one hole. If you had a donut made of an infinitely stretchable material, you could shape it into a coffee mug without every tearing it, ripping a hole, or gluing two edges together.

Anyway – starting tomorrow, I’ll be posting a new version of that old topology series.

Representing Numbers in Binary

Before we can really start writing interesting programs with our ARM, even simple ones, we need to understand a bit about how how a computer actually works with numbers.

When people talk about computers, they frequently say something like “It’s all zeros and ones”. That’s wrong, in a couple of different ways.

First, when you look at the actual computer hardware, there are no zeros and ones. There are two kinds of signals, represented as two different voltage levels. You can call them “high” and “low”, or “A” and “B”, or “+” and “-“. It doesn’t matter: it’s just two signals. (In fact, one fascinating thing to realize is that there’s a fundamental symmetry in those signals: you can swap the choice of which signal is 0 and 1, and if you just swap the and gates and the or gates, everything you won’t be able to tell the difference! So one ARM processor could use one signal for 1, and a different ARM could use that signal as 0, and you wouldn’t actually be able to tell. In fact, everyone does it the same way, because chip design and manufacture are standardized. But mathematically, it’s possible. We’ll talk about that duality/symmetry in another post.)

Second, the computer really doesn’t work with numbers at all. Computer hardware is all about binary logic. Even if you abstract from different voltages to the basic pairs of values, the computer still doesn’t understand numbers. You’ve got bits, but to get from bits to numbers, you need to decide on a meaning for the bits two possible values, and you need to decide how to put them together.

That’s what we’re really going to focus on in this post. Once you’ve decided to call on of the two primitive values 1, and the other one 0, you need to decide how to combine multiple zeros and ones to make a number.

It might seem silly, because isn’t it obvious that it should use binary? But the choice isn’t so clear. There have been a lot of different numeric representations. For one example, many of IBM’s mainframe computers used (and continue to use) something called binary coded decimal (BCD). It’s not just different for the sake of being different: in fact, for financial applications, BCD really does have some major advantages! Even if you have decided to use simple binary, it’s not that simple. Positive integers are easy. But how do you handle negative numbers? How do you handle things that aren’t integers?

We’re going to start with the simplest case: unsigned integers. (We’ll worry about fractions, decimals, and floating point in another post.) Like pretty much all modern computers, the ARM uses the basic, mathematical binary exponential representation. This works basically the same as our usual base-10 numbers. We look at the digits from right to left. The leftmost digit (also called the least significant digit counts ones; the next digit counts 10s; the next counts 100s, and soon. So in base 10, the number 3256 means 6*100 plus 5*101 plus 2*102 plus 3*103.

binary-bits

In binary, we do exactly the same thing, only we do it with powers of 2 instead of powers of 10. So in binary the number 1001101 is 1 + 0*21 + 1*22 + 1*23 + 0*24 + 0*25 + 1*26 =1 + 4 + 8 + 64 = 77.

Arithmetic on binary is easy enough – you do the same thing you would with decimal, but in subtraction, borrows give you 2, not 10. As a quick example, let’s look at 7 + 13, which is 111 + 1101.

  1. We start at the right edge. We have 1 + 1 = 10 – so the first digit of the sum is 0, and we carry 1.
  2. Next we have 1 + 0 + 1(carry) = 10 – so the second digit is again 0, and we carry 1. Our sum so far is 00.
  3. Now we’re on to the third digit. 1 + 1 + 1(carry) = 11. So the third digit is 1, and we carry one. Our sum so far is 100.
  4. Now the fourth digit, 1 + 0 + 1(carry), so we get 10. So the sum is 10100, or 20.

We’ll ignore subtraction for a moment, because as we’ll see in a little while, in computer hardware, we don’t actually need to have subtraction as a primitive. Addition is the core of what computer arithmetic, and we’ll use addition to implement subtraction by taking the negative of the number being subtracted. (That is, to compute A-B, we’ll do A+(-B).)

Positive integers with addition aren’t enough to do most stuff we want to do on a computer. Even if we’re never going to write a program that manipulates numbers, we absolutely need to be able to subtract. To a computer, the only way to compare values is to subtract one from another! So we need to be able to do negatives and subtractions. How can we represent negative numbers in binary?

There are three basic choices, called sign-bit/sign-magnitude, one’s-complement and two’s complement. I’ve put an example of the three representing the number 75 in the figure below.

different-binary

In sign-bit representation, what you do is take the leftmost bit (also called the high-order bit), and use it to indicate sign (for obvious reasons, you call it the sign bit.). If the sign bit is 0, then the number is positive; if it’s 1, then the number is negative. For example, 01010 would be +10; 11010 would be -10.

For a human being, sign-bit looks great. But in practice, sign-bit was never used much, because while it looks simple to a human, it’s quite complicated to build in computer hardware. IBM did use it in some early machines, but even they gave up on it.

Next is one’s complement. In 1’s complement, high order bit is still a sign bit. But to convert a number from positive to negative, you don’t just change the sign bit – you invert every single bit in the number. You can still tell whether a number is positive or negative by its sign bit, but the rest of the bits are also different. +10 in one’s complement binary is 01010; -10 is 10101.

Arithmetic in 1s complement is a bit weird. You can almost just add a negative number to a positive number as if they were both positive. Almost, but not quite.

For example, let’s try 6 + -6. 6 is 0110, and -6 is 1001. Add them up: 1111. In twos complement, that’s -0. And there’s one of the weird issues about one’s complement: it’s got two distinct values for 0 – +0 and -0. Since they’re both just 0, we treat them as equal, and it’s not really a problem.

How about 6 + -8? 6 is 00110 (we need 5 bits to handle 8), and -8 is 10111. Add
them up, and you get 11101 – which is -2, the correct answer.

Now, what about 8 + -6? 8 is 01000, and -6 is 11001. Add them up, and you
get 00001, with a carry of 1. So 8 + -6 = 1? That’s wrong! In one’s complement,
there are a bunch of places where simple binary addition will be off by one.
So you need to work out the algorithm for where it’s off-by-one and where it’s not, and you need to build more complicated hardware to incorporate it. That’s not attractive.

Still, one’s complement has been used a lot. In particular, one of the first computers I got to use was an old, beaten-up PDP-1, which used 1’s complement numbers.

Finally, we get to the representation that’s used in pretty much all modern computers: 2’s complement!

Once again, 2’s complement uses a sign bit. But instead of flipping all of the bits, you do something different. In 2’s complement, you need to know how many bits you’re using. If you’re doing an N-bit 2’s complement binary number, then the number -x is represented by 2N-x.

So if we’re doing 6 bits, and we wanted to represent -5, then we’d take 26-5, or 64-5=59. In binary, that’s 111011.

The really beautiful thing about 2s complement is that it’s pretty much the same thing as a trucated 2-adic integer – which means that arithmetic just works. If you’re adding two numbers, it doesn’t matter whether they’re signed numbers or not – it works.

It’s also really easy to implement negation. You don’t have to do that whole “subtract from 2^N” thing. In 2s complement, -N is 1+(ones_complement(N)). That’s super-easy to implement in hardware, and it’s also easy to understand and do for a human: flip the bits and add one!

Two’s complement is, in my opinion, a clear winner in integer representation, and the world of computer hardware maker agrees – everyone now uses 2’s complement for integers.

Bayes Theorem

I’ve been meaning to get back to some of the probability stuff. We’re currently recovering from a major snow/ice storm, and I’m snowed/iced in, so this is a good time!

Today, we’ll talk about what is, according to many people, the most important rule in all of probability: Bayes theorem. It’s also, in my experience, the single most abused rule in all of mathematics. Nothing else has been used so poorly, by so many people, to support sloppy, dumb arguments. After we talk about what the rule is, and what it means, we’ll move on to talk about how it gets abused.

In a pure mathematical sense, Bayes theorem is simple. The interpretation of it, and what it means gets pretty hairy. Suppose that you’ve got two related events, A and B. You know the probability of A occurring is P(A). You know the probability of B occurring is P(B). And you know that if A has already occurred, what the probability of B occurring is. (We write that P(B | A), which you can ready as “the probability of B given A”.) What you’d like to know is, suppose that I know that B occurred. What’s the probability that A also occurred? (What is P(A | B)?)

Bayes theorem says:

P(A|B) = \frac{P(B|A) P(A)}{P(B)}

Let’s be concrete. I go to work, and walk into my office in the morning, and get into the elevator with one other person that I work with.What is the probability that it’s a man?

Without knowing anything about the people that I work with, a reasonable guess would be 50% – the population is pretty close to evenly divided between the genders.

But I’m an engineer, and one of the very unfortunate facts about my job is that the gender pool of engineers is very skewed. Let’s say that it’s 80% men. (In reality, that’s probably actually pretty low.)

Let’s say that about 1/3 of the office is engineering. So the odds that someone I bump into will be an engineer is about 50%.

I can do a couple of things with that information. I could ask, suppose that I walked into the elevator with a woman. What’s the probability that she’s an engineer?

To answer that, I’ll use Bayes law. We’ll say that P(A) is the probability that a random person is a woman- 1/2. P(B) is the probability that a random person is an engineer – 1/3. If I know that a given person is an engineer, the probability of that person being a woman is P(B | A), or 1/5. So what’s the probability of my random female coworker being an engineer (P(A | B))?

  • P(\text{woman}) = 1/2
  • P(\text{eng}) = 1/3
  • P(\text{woman} | \text{eng}) = 1/5
  • P(\text{eng} | \text{woman}) = \frac{P(\text{woman}|\text{eng})P(\text{eng})}{P(\text{woman})} = \frac{(1/5)(1/3)}{1/2} = \frac{2}{15}

See? That was easy, wasn’t it?

Now, what’s it actually mean? If you look at it this way, it doesn’t seem to be such a big deal. Sure, it’s a way of combining probabilities in another situation, but so what? Why’s it any more important than any other?

Because it’s the mathematical method for how to incorporate new knowledge into your expectations. What we did above was start with one understanding of the thing we were trying to predict. Knowing nothing but the typical distribution of genders in the general population, we made a guess about a 50% probability of encountering a woman. But then we added in new information. We knew the population of engineers, and the fact that the gender ration was skewed in engineering – and we incorporated that new information into our prediction.

That answer comes from interpretations. One of the classic interpretations of probability theory is the Bayesian interpretation – named Bayesian specifically because of how it interprets this rule! The Bayesian interpretation says that a statement about probability is really a statement about the state of our knowledge. If I say that the probability of flipping heads on a coin is 1/2, what I’m saying under the Bayesian interpretation is that my certainty that I’ll flip heads is just 1/2.

In that kind of knowledge-based interpretation, there is no intrinsic probability of any event. There is just our degree of certainty about whether the event will occur. Given new information, our degree of certainty can change. Bayes theorem tells us, given new information, exactly how we should change our interpretation.

To explain the bayesian interpretation, we’ll add a couple of terms.

Hypothesis
The hypothesis is the thing whose degree of certainty we’re trying to measure. In the formulation of Bayes law up above, we call it A; here, we’ll call it H.
Prior
The prior, P(H), is the degree of certainty about the hypothesis given no other information.
Evidence
The evidence is the new piece of information that we’re trying to add to our measurement of certainty. Above, we called it B, but here, we’ll call it E.
Likelihood
The likelihood P(E | H) of a piece of evidence is our degree of certainty that a specific piece of evidence would be found if the hypothesis is true.
Model Evidence
The model evidence is P(E), and it’s a bit confusing. It’s the analytic likelihood of any piece of evidence occurring. If you’re considering a set of possible hypotheses using Bayes rule, P(E) will be the same for all of them, but P(E | H) will be the specific likelihood of finding that particular piece of evidence under the hypothesis.
Posterior
The posterior, P(H|E), is the degree of certainty that we will have about A if we add new knowledge, B.
Support
Support is the change in our certainty created by the addition of our new evidence. The support is \frac{P(E|H)}{P(E)}.

So Bayes theorem is a formal statement of how, given evidence, we can modify our certainty about the truth of a particular statement. The classical textbook statement of it is the following. (I took this specific formulation from wikipedia, but any textbook will have nearly the same sentence.)

The posterior probability of a hypothesis is determined by a combination of the inherent likeliness of a hypothesis (the prior) and the compatibility of the observed evidence with the hypothesis (the likelihood).

Or, in mathematical terms, P(H | E) = \frac{P(E | H)}{P(E)} \times P(H) – or exactly what we wrote for Bayes theorem up above.

Why is this abused so badly? Because under a naive, stupid
understanding of Bayes rule, you can essentially randomly estimate the probability of anything. After all, Bayes says that probability is just the combination of our certainties about some collection of facts. So if I can line up some set of facts, along with an estimate of the individual probabilities of those facts, then I can combine those probabilities, and come up with an estimate of the probability of anything! And if I don’t know the probability of an event occurring at al, then the state of my initial knowledge is really simple: it’s always 1/2 – 1/2 is always the starting point given absolutely no other knowledge.

That leads to rubbish like this proof that there are no extra-terrestial intelligences, or this or this purported proof of the existence of God.

All of these arguments fail in the same way. They don’t really use Bayes theorem. The quality of the priors – all of the priors, including the priors used to come up with measures of the likelihoods of the evidences – are crucial. They don’t bother with that. They just make up priors, and combine them without good likelihoods.

Controlling Thousands of Machines. Aka, My Day Job.

I promised my friend Toby that I’d link to one of his blog posts. Toby is one of the SREs that I work with at twitter, and let me tell you, you always want to stay on the good side of the SREs!

And to introduce it, I actually get a chance to talk about what I really do!

Since last July, I’ve been working for twitter. I haven’t yet gotten to talk about what it is that I do at twitter. For obvious reasons, I think it’s absolutely fascinating. And since we just recently released it as open-source, there’s no reason to keep it secret anymore. I work on something called Aurora, which is a framework for Mesos.

Let’s start with Mesos.

When you run a web application like Twitter, you can’t run your application on a single computer. There’s a bunch of reasons for that, but the biggest one is that there’s not a computer in the world that’s big enough to do it; and if there was, depending on one epically massive computer would be incredibly unreliable. Instead, what we do is set up a data center, and fill it up with thousands upon thousands of cheap machines. Then we just use all of those thousands of little computers. These days, everyone does things that way. Twitter, Google, Facebook, Amazon, Microsoft, Apple – we all run on gigantic clusters of cheap machines in a datacenter.

But there’s a problem there. How do you use a thousand computers? Much less 10,000 or a million, or more?

What you do is build a substrate, called a cluster management system. You write a program, and you say that you want to run 1,000 copies of it, and hand it to the cluster manager substrate. The substrate takes care of figuring out where to put those processes, and telling the specific individual machines that it selects what to do to run your program.

One of the hardest problems in a cluster management system is figuring out how to assign programs to machines. Depending on what a particular program needs to do, its requirements can vary enormously. Running a quick map-reduce has one set of requirements. Running a reliable service – a service that can survive if part of a datacenter loses electricity – has a different set of requirements.

One approach – the one used by Google when I worked there – was to define a constraint language that ultimately had hundreds of possible parameters, and then treat it as a classical optimization problem, trying to optimize over all of it. The good side of that is that it meant that every program at Google could express its requirements exactly. The bad side of it was that the constraint system was incredibly complicated, and almost no one could predict what results a given set of constraints would produce. Configuration turned into a sort of game, where you’d make a guess, look at what the borg gave you, and then modify your constraint – repeat until you got something satisfactory.

mesos_logo At Twitter, Mesos is our cluster management system. It takes a completely different approach. Basically, it takes the borg-like approach and turns it upside down. Mesos itself doesn’t try to do constraint satisfaction at all. What it does is talk to components, called frameworks, and it offers them resources. It basically says “Here’s all of the possible ways I can give you a fair share of resources in the clusters. Tell me which ones you want.”.

Each framework, in turn, figures out how to allocate the resources offered to it by Mesos to individual jobs. In some ways, that might seem like it’s just deferring the problem: don’t the frameworks end up needing to do the same allocation nightmare? But in fact, what you do is write a framework for a particular kind of job. A framework needs to work out some constraint satisfaction – but it’s a much simpler problem, because instead of needing to satisfy every possible customer with one way of expressing constraints, each framework can decide, on its own, how to allocate resources to one particular kind of job. Map-reduce jobs get one framework that knows how to do a good job scheduling map-reduces. Services get a framework that knows how to do a good job allocating resources for services. Databases get a framework that knows how to do a good job allocating resources for storage and bandwidth intensive processes. As a result, the basic Mesos cluster manager is dramatically simpler than a one-size-fits-all scheduler, and each of the component frameworks is also much simpler.

You can see a really cute sort-of demonstration of how Mesos works by looking at @MesosMaster and @MesosSlave on twitter.

I don’t work on Mesos.


I work on Aurora. Aurora is a framework that runs under Mesos. It’s specialized for running services. Services are basically little server processes that run a lot of identical copiesfor a long time. Things like web-servers – where you need a ton of them to support millions of users. Or things like common processes that live behind the web-server answering requests for particular kinds of information.

At Twitter, all of the services that make up our system run in our datacenters on top of Mesos. They’re scheduled by Aurora.

With Aurora, running a service is incredibly easy. You write a configuration:

my_server_task = SequentialTask(
  processes = [
    Process(name='stage_binary',
        cmdline='hadoopfs-copy /glorp/foo/hello_service .'),
    Process(name='run_service', cmdline='./myservice')
  ],
  resources = Resources(cpu=1.0, ram=128*MB, disk=128*MB))

jobs = [
    Service(task = my_server_task, 
        cluster = 'mycluster',
        role='markcc_service',
        environment = 'prod',
        name = 'hello',
        instances=300)]

The configuration says how to install a program and then run it on a machine, once Aurora assigns a machine to it. Aurora will find 300 machines, each of which can dedicate one full CPU, 128MB of memory, and 128MB of disk to the process, and it will run the service on those 300 machines.

Both Aurora and Mesos are open-source, under the Apache license. We’ve got a test environment in the distribution, so that you could be running tests in a virtual Mesos/Aurora cluster one hour from now! And my good friend Toby wrote an excellent blog post on how to set up a working Mesos/Aurora cluster.

I don’t want to toot my own horn too much, but I’m incredibly proud that Twitter open-sourced this stuff. Most companies consider their cluster management systems to be super-proprietary secrets. But at Twitter, we decided that this is a problem that no one should have to solve from scratch. It’s something we all know how to do, and it’s time that people stop being forced to waste time reinventing the same old wheel. So we’re giving it away, to anyone who wants to use it. That’s pretty damned awesome, and I’m proud to be a part of it!

Farewell Scientopia, Welcome to Goodmath.org!

Welcome to the new home of Good Math/Bad Math!

Sorry about the change. I know that it’s a pain for readers to switch to following the site at a new location. But it was, honestly, necessary.

Scientopia was originally set up to be a community. I know this, because I was the founder. Back before ScienceBlogs pulled its PepsiBlog stunt, I’d been considering leaving and setting up an alternative, non-profit community. When PepsiGate happened, and my friend Scicurious volunteered to help, I flipped the switch, and turned on what became scientopia.

Since then, I’ve been footing the bills – around $250/month. And I’ve been doing all of the systems administration – every backup, every upgrade, every cache tweek, every DDOS attack – I’ve taken care of them all on my own time. Even after we started showing ads, I’ve never gotten back a single cent. Just that endless drain of time and money.

And that was OK. Really. I’ve been doing it for 3 1/2 years, and while it was a lot of work, I kept at it. I genuinely believed in the ideals that we wrote into the scientopia charter. I really believed that the community we’d built was a good thing, a thing worth supporting.

Scientopia was supposed to be a community. A community where we made decisions, as a group. Where we interacted with each other as peers. Where, as our code said, “It is a community, held together by mutual respect and operated by consensus, in which people can write, educate, discuss, and learn about science and the process of doing science”.

But people are people. If you’ve got more than two people in a group, you’ll wind up getting some politics. And Scientopia, as a community, is no different.

As you may have noticed, Scientopia was down, for about 36 hours. Why?

Because our DNS record got messed up. DNS is the system on the internet that’s used to map from hostnames to numbers. It’s the thing that your web-browser uses to get from “scientopia.org” to the numeric 184.106.221.182, which tells it where scientopia.org can be found on the network. The DNS registration was expiring, and the person who controlled the DNS decided to move to a different registrar, and they created an invalid DNS record with the new registrar. As a result, no one could get to scientopia. This was completely beyond my control: this person had sole control of the DNS record, and refused to allow anyone else access.

(For the tech-heads out there, after switching DNS providers, this person only created a CNAME record for the site. CNAME records are aliases/redirects from one resolvable hostname to another resolvable hostname. Since our IP address isn’t a resolvable hostname, DNS servers rejected the record as invalid, and thus Scientopia was not resolved.)

Fine. Screwups happen, right?

Except for the part where the guilty party decided to blame me. To cover up for their screwup, by lying about what was wrong, and blame it on me. To scapegoat me – not just about this, but to accuse me of general incompetence, and to blame me for the repeated DNS screwups.

The DNS record continues to be the one piece of infrastructure of scientopia that remains in the hands of one person. Despite repeated promises to turn them over to the community, it hasn’t happened. There’s been one condition, one excuse after another. When the community elects a governing board. When the community incorporates. When the community formally registers members as owners. For three and a half years, it’s always been a unkept promise. The community was never allowed to have any access to managing the DNS record.

Even when the site was down, allowing anyone else to have any access to the password needed to fix it was simply out of the question. I was expected to – and did, years ago – turn over the passwords to the site hosting account, which was secured with my personal credit card. But getting the site back up? Too much to ask for. And then they didn’t even have the decency to admit that they made a mistake, but instead tried to blame it on me. (And, I’ll add, has still refused to admit that registering the CNAME was an error, at all.)

This wasn’t the first time something like this happened. This has been a repeated pattern. Not the first, but I decided that it had to be the last.

Running a blog site like Scientopia is a lot of work. Keeping it up, monitoring it, keeping it up to date, dealing with every problem that any of the bloggers have – it’s a lot of work. At times, it’s a lot of aggravation. When you add malicious behavior and abuse on top of that? It’s just too much to put up with.

I’ve still got a lot of people at Scientopia that I consider my friends. I wish them well. But I’m done with it.

Hello World in ARM Assembly Language

Since K&R’s book on C, it’s become traditional to start any tutorial on a new language is to present a program that prints “Hello world”. For ARM assembly running on Raspbian Linux, that traditional program looks like:

.global _start
_start:
  MOV R7, #4
  MOV R0, #1
  MOV R2, #12
  LDR R1, =string
  SWI 0
  MOV R7, #1
  SWI 0
  .data
string:
  .ascii "Hello Worldn"

It’s definitely a bit more cryptic than most languages, but it doesn’t look all that bad, now does it? Before I can explain how that works, we’ll need to talk a bit about what we’re programming, and how we can program it. We’re going to go through a bunch of introductory material here; everything that we touch on, I’ll come back to in more detail in a later post.


arm
In the diagram to the right, you can see my attempt to draw a diagram illustrating the important parts of an ARM CPU from our initial perspective. As we learn more about it, we’ll gradually refine this picture, adding more details – but for now, this is what things look like.

For now, we’ll say that the CPU has 5 main parts:

  1. A collection of 16 registers. A register is a memory cell that’s built in to the CPU. On an ARM processor, any time that you want to do any kind of operation – arithmetic, logic, comparison, you name it! – you’ll need to have the values in a register. The first thirteen registers are available to you, to use for whatever you want. The last three are special; R13 is called the stack pointer (SP), R14 is called the link register (LR), and R15 is called the program counter (PC). We’ll talk about what those three mean as we learn to program.
  2. An arithmetic/logic unit (ALU). This is where the CPU does integer arithmetic and logic. Most of our programs will work exclusively with the ALU. (Floating point is important, but it’s possible to do an awful lot of programming without it.)
  3. A floating point unit (FPU). This is where the CPU does floating point arithmetic.
  4. A status register. This is, like the other registers, a chunk of internal storage. But you can’t manipulate it or access it directly. It’s automatically updated by the ALU/FPU. Individual bits of the status register get updated to reflect various conditions about the current status of the CPU, and the results of the previous instruction. For example, the way that you can compare two values in the ARM is to subtract one from the other. If the two values were equal, then the ZERO flag in the status register will be set to 1; otherwise it will be set to 0. There’s a branch instruction that only actually branches if the ZERO flag is set.
  5. A data channel, called the bus. The bus connects the CPU to the rest of the computer. Memory, other storage devices, and input and output devices are all connected to the CPU via the bus. Doing anything that involves communicating through the bus is slow compared to doing anything that doesn’t. For now, we’ll say that memory is the only thing on the bus.

Now that we have a bit of a clue about the basic pieces of this thing we’re going to program, we can start looking at our hello world program. We still need to talk about one other bit of background before we can get started.

For a computer, on the lowest level, a “program” is just a chunk of numbers. It’s not even a chunk of instructions – it’s just numbers. The numbers can be instructions, data, or both at the same time! That last bit might sound strange, but you’ll see instructions like MOV R0, #4. That’s saying load the literal value 4 into register R0. The 4 is a value encoded as a part of an instruction. So that 4 is both literal data sitting in the middle of a collection of instructions, and it’s also a part of an instruction. The actual instruction doesn’t really say “load the value 4”; it says “load the data value that’s at this position in the instruction sequence”.

We’re not going to program the ARM using the numeric instructions directly. We’re going to program the ARM using assembly language. Assembly language is a way of writing that chunk of numbers that is your program, but doing it with a syntax that easy for a human being to read. Then a program called an assembler will translate from that readable format into the raw numeric format used by the computer. Conceptually, the assembler sounds a lot like the compiler that you’d use with a higher level language. But it’s quite different: compilers take your code, and change it. Frequently, if you look at code that your compiler generates, you’d have a hard time recognizing code that was generated for a program that you wrote! But an assembel doesn’t change anything. There’s no restructuring, no optimization, no changes at all. In an assembly language program, you’re describing how to lay out a bunch of instructions and data in memory, and the assembler does nothing but generate that exact memory layout.


Ok. That said, finally, we can get to the program!

Programming in assembly is quite different from programming in any reasonable programming language. There are no abstractions to make your life easier. You need to be painfully explicit about everything. It really brings home just how many abstractions you generally use in your code.

For example, in assembly language, you don’t really have variables. You can store values anywhere you want in the computer’s memory, but you have to decide where to put them, and how to lay them out, by yourself. But as I said before – all of the arithmetic and logic that makes up a program has to be done on values in registers. So a value in memory is only good if you can move it from memory into a register. It’s almost like programming in a language with a total of 16 variables – only you’re only really allowed to use 13 of them!

Not only do you not have variables, but you don’t really have parameters. In a high level programming language, you can just pass things to subroutines. You don’t need to worry about how. Maybe they’re going onto a stack; maybe there’ doing some kind of fancy lambda calculus renaming thing; maybe there’s some magic variables. You don’t need to know or care. But in assembly, there is no built-in notion of parameter-passing. You need to use the computer’s register and memory to build a parameter passing system. In the simplest form of that, which is what we’re using here, you designate certain registers as carrying certain parameters. There’s nothing in assembly to enforce that: if your program puts something into register R3, and a function was expecting it to be in R4, you won’t get any kind of error.

In our “Hello world” program above, the first three instructions are loading specific values into registers expected by the operating system “print” function. For example, MOV R0, #4 means move the specific number 4 into register R0.

Loading literal values into registers are done using the MOV instruction. It’s got two operands, the register to move the data into, and the source of the data. The source of the data can be either a literal value, or another register. If you want to load data from memory, you need to use a different instruction – LDR.

With the LDR instruction, we can see one of the conveniences of using assembly language. We want to print the string “Hello world”. So we need to have that string in memory somewhere. The assembler lets us do that using a .ascii directive. The directive isn’t an ARM instruction; it’s an instruction to the assembler telling it “I want to put this string data into a block in memory”. The .ascii directive is prefaced with a label, which allows us to refer to the beginning of the memory block populated by the directive. Now we can use “string” to refer to the memory block. So the instruction LDR R1, =string is exactly the same as saying LDR R1, address, where address is the memory location where the first byte of the string is stored.

These four instructions have been preparation for calling a function provided by the operating system. R0 and R7 are used by the operating system to figure out what function we want to call. R1 and R2 are being used to pass parameters to the function. The print function expects R1 to contain the memory location of the first byte in the string we want to print, and R2 to contain the number of characters in the string.

We call the function using SWI 0. SWI is the software interrupt function. We can’t call the operating system directly! One of the purposes of the operating system is to provide a safe environment, where different programs can’t accidentally interfere with one another. If you could just branch into an OS function directly, any program would be able to do anything it wanted! But we don’t allow that, so the program can’t directly call anything in the OS. Instead, what it does is send a special kind of signal called an interrupt. Before it runs our program, the operating system has already told the CPU that any time it gets an interrupt, control should be handed to the OS. So the operating system gets called by the interrupt. It sees the values in R0 and R7, and recognizes that the interrupt is a request to run the “print” function, so it does that. Then it returns from the interrupt – and execution continues at the first instruction after the SWI call.

Now it’s returned from the print, and we don’t want to do anything else. If we didn’t put something here to tell the operating system that we’re done, the CPU would just proceed to the next memory address after our SWI, and interpret that as an instruction! We need to specifically say “We’re done”, so that the operating system takes control away from our program. The way we do that is with another SWI call. This SWI is the operating system “exit” call. To exit a program and kill the process, you call SWI with R0=1 and R7=1.

And that’s it. That’s hello-world in assembly.

Everyone stop implementing programming languages, right now! It's been solved!

Back when I was a student working on my PhD, I specialized in programming languages. Lucky for me I did it a long time ago! According to Wired, if I was working on it now, I’d be out of luck – the problem is already solved!

See, these guys built a new programming language which solves all the problems! I mean, just look how daft all of us programming language implementors are!

Today’s languages were each designed with different goals in mind. Matlab was built for matrix calculations, and it’s great at linear algebra. The R language is meant for statistics. Ruby and Python are good general purpose languages, beloved by web developers because they make coding faster and easier. But they don’t run as quickly as languages like C and Java. What we need, Karpinski realized after struggling to build his network simulation tool, is a single language that does everything well.

See, we’ve been wasting our time, working on languages that are only good for one thing, when if only we’d had a clue, we would have just been smart, and built one perfect language which was good for everything!

How did they accomplish this miraculous task?

Together they fashioned a general purpose programming language that was also suited to advanced mathematics and statistics and could run at speeds rivaling C, the granddaddy of the programming world.

Programmers often use tools that translate slower languages like Ruby and Python into faster languages like Java or C. But that faster code must also be translated — or compiled, in programmer lingo — into code that the machine can understand. That adds more complexity and room for error.

Julia is different in that it doesn’t need an intermediary step. Using LLVM, a compiler developed by University of Illinois at Urbana-Champaign and enhanced by the likes of Apple and Google, Karpinski and company built the language so that it compiles straight to machine code on the fly, as it runs.

Ye bloody gods, but it’s hard to know just where to start ripping that apart.

Let’s start with that last paragraph. Apparently, the guys who designed Julia are geniuses, because they used the LLVM backend for their compiler, eliminating the need for an intermediate language.

That’s clearly a revolutionary idea. I mean, no one has ever tried to do that before – no programming languages except C and C++ (the original targets of LLVM). Except for Ada. And D. And fortran. And Pure. And Objective-C. And Haskell. And Java. And plenty of others.

And those are just the languages that specifically use the LLVM backend. There are others that use different code generators to generate true binary code.

But hey, let’s ignore that bit, and step back.

Let’s look at what they say about how other people implement programming languages, shall we? The problem with other languages, they allege, is that their implementations don’t actually generate machine code. They translate from a slower language into a faster language. Let’s leave aside the fact that speed is an attribute of an implementation, not a language. (I can show you a CommonLisp interpreter that’s slow as a dog, and I can show you a CommonLisp interpreter that’ll knock your socks off.)

What do the Julia guys actually do? They write a front-end that generates LLVM intermediate code. That is, they don’t generate machine code directly. They translate code written in their programming languages into code written in an abstract virtual machine code. And then they take the virtual machine code, and pass it to the LLVM backend, which translates from virtual code to actual true machine code.

In other words, they’re not doing anything different from pretty much any other compiled language. It’s incredibly rare to see a compiler that actually doesn’t do the intermediate code generation. The only example I can think of at the moment is one of the compilers for Go – and even it uses some intermediates internally.

Even if Julia never displaces the more popular languages — or if something better comes along — the team believes it’s changing the way people think about language design. It’s showing the world that one language can give you everything.

That said, it isn’t for everyone. Bezanson says it’s not exactly ideal for building desktop applications or operating systems, and though you can use it for web programming, it’s better suited to technical computing. But it’s still evolving, and according to Jonah Bloch-Johnson, a climate scientist at the University of Chicago who has been experimenting with Julia, it’s more robust than he expected. He says most of what he needs is already available in the language, and some of the code libraries, he adds, are better than what he can get from a seasoned language like Python.

So, our intrepid reporter tells us, the glorious thing about Julia is that it’s one language that can give you everything! This should completely change the whole world of programming language design – because us idiots who’ve worked on languages weren’t smart enough to realize that there should be one language that does everything!

And then, in the very next paragraph, he points out that Julia, the great glorious language that’s going to change the world of programming language design by being good at everything, isn’t good at everything!

Jeebus. Just shoot me now.

I’ll finish with a quote that pretty much sums up the idiocy of these guys.

“People have assumed that we need both fast and slow languages,” Bezanson says. “I happen to believe that we don’t need slow languages.”

This sums up just about everything that I hate about what happens when idiots who don’t understand programming languages pontificate about how languages should be designed/implemented.

At the moment, in my day job, I’m doing almost all of my programming in Python. Now, I’m not exactly a huge fan of Python. There’s an awful lot of slapdash and magic about it that drive me crazy. But I can’t really dispute the decision to use it for my project, because it’s a very good choice.

What makes it a good choice? A certain kind of flexibility and dynamicism. It’s a great language for splicing together different pieces that come from different places. It’s not the fastest language in the world. But for my purposess, that’s completely irrelevant. If you took a super-duper brilliant, uber-fast language with a compiler that could generate perfectly optimal code every time, it wouldn’t be any faster than my Python program. How can that be?

Because my Python program spends most of its time idle, waiting for something to happen. It’s talking to a server out on a datacenter cluster, sending it requests, and then waiting for them to complete. When they’re done, it looks at the results, and then generates output on a local console. If I had a fast compiler, the only effect it would have is that my program would spend more time idle. If I were pushing my CPU anywhere close to its limits, using less CPU before going idle might be helpful. But it’s not.

The speed of the language doesn’t matter. But by making my job easier – making it easier to write the code – it saves something much more valuable than CPU time. It saves human time. And a human programmer is vastly more expensive than another 100 CPUs.

We don’t specifically need slow languages. But no one sets out to implement a slow language. People implement useful languages. And they make intelligent decisions about where to spend their time. You could implement a machine code generator for Python. It would be an extremely complicated thing to do – but you could do it. (In fact, someone is working on an LLVM front-end for Python! It’s not for Python code like my system, but there’s a whole community of people who use Python for implementing numeric processing code with NumPy.) But what’s the benefit? For most applications, absolutely nothing.

According the the Julia guys, the perfectly rational decision to not dedicate effort to optimization when optimization won’t actually pay off is a bad, stupid idea. And that should tell you all that you need to know about their opinions.

On outing in the sciblogging community

I’m coming in to this a bit late, but since I really do care about the online science blogging community,I still have something that I want to say.

For those who don’t know, there’s a complete horses ass named Henry Gee. Henry is an editor at the science journal Nature. Poor Henry got into some fights with DrIsis (a prominent science blogger), and DrIsis was mean to him. The poor little guy was so hurt that he decided that he needed to get back at her – and so, Henry went ahead and he outed her, announcing her real name to the world.

This was a thoroughly shitty thing to do.

It’s not that I think Isis didn’t do anything wrong. We’ve got history, she and I. My experience with her led me to conclude that she’s a petty, vicious bully that takes great pleasure in inflicting pain and anguish on other people. She’s also someone who’s done a lot of good things for her friends, and if you want to find out about any of it, go read another blog – plenty of people have written about her in the last couple of days.

If she’s so awful, why do I care that someone outed her?

Because it’s not just about her.

The community that we’re a part of isn’t something which has been around for all that long. There’s still a lot of fudging around, figuring out the boundaries of our online interactions. When people play games like outing someone who’s using a pseudonym, they’re setting a precedent: they’re declaring to the community that “I know Xs real name, and here it is”. But beyond that, they’re also declaring to the community that “I believe that our community standards should say that this is an appropriate way to deal with conflict”.

I don’t want that to be something that people in my community do.

People use pseudonyms for a lot of different reasons. Some people do for bad reasons, like separating unethical online behavior from their professional identity. But some people do it to avoid professional retaliation for perfectly reasonable behaviors – there are tenure committees at many universities that would hold blogging against a junior faculty; there are companies that don’t won’t allow employees to blog under their real names; there are people who blog under a pseudonym in order to protect themselves from physical danger and violence!

Once you say “If someone makes me angry enough, it’s all right for me to reveal their real identity”, what you’re saying is that none of those reasons matter. Your hurt feelings take precedence. Bloggeroid tells you how to blog successfully so you avoid all of this. You’ve got the right to decide whether their reasons for using a pseudonym areimportant enough to protect or not.

Sorry, but no. People’s identities belong to them. I don’t care how mean someone is to you online: you don’t have the right to reveal their identity. Unless someone is doing something criminal, their identity isn’t yours to reveal. (And if they are doing something criminal, you should seriously consider reporting them to the appropriate legal authorities, rather than screwing around online!)

But to be like Mr. Gee, and just say “Oh, she hurt my feelings! I’m going to try to hurt her back”! That’s bullshit. That’s childish, kindergarten level bullshit. And frankly, for someone who’s an editor at a major scientific journal, who has access to all sorts of information about anonymous referees and authors? It’s seriously something that crosses the line of professional ethics to the point where if I were in the management at Nature, I’d probably fire him for it.

But Henry didn’t stop there: no! He also went ahead and – as an editor of Nature! – told people who criticized him for doing this that he want “adding them to the list”.

What kind of list do you think Henry is adding them to? This guy who’s showed how little he cares about ethics – what do you think he’s going to do to the people who he’s adding to his list?

I think that if Nature doesn’t fire this schmuck, there’s something even more seriously wrong over there than any of us expected.