What I do in my day job is working on infrastructure systems at dropbox. That means that I’m neck-deep in distributed systems programming – that’s my bread and butter. One of the problems that comes up over and over again in distributed systems is time.
In distributed systems, time is a problem. Each computer has a clock built in, but those clocks are independent. The clocks on different machines can vary quite a bit. If a human being is setting them, then they’re probably at best accurate to one second. Even using a protocol like NTP, which synchronizes clocks between different computers, you can only get the clocks accurate to within about a millisecond of each other.
That sounds pretty good. In human timescales, a millisecond is a nearly imperceptible time interval. But to a modern computer, which can execute billions of instructions per second, it’s a long time: long enough to execute a million instructions! To get a sense of how much time that is to a computer, I just measured the time it took to take all of the source code for my main project, compile it from scratch, and execute all of its tests: it took 26 milliseconds.
That’s a lot of work. On the scale of a machine running billions of instructions per second, a millisecond is a long time.
Why does that matter?
For a lot of things that we want to do with a collection of computers, we need to know what event happened first. This comes up in lots of different contexts. The simplest one to explain is a shared resource locked by a mutex.
A mutex is a mutual exclusion lock. It’s basically a control that only allows one process to access some shared resource at a time. For example, you could think of a database that a bunch of processes all talk to. To make an update, a process P needs to send a request asking for access. If no one is using it when the server receives the request, it will give a lock to P, and and then block anyone else from accessing it until P is done. Anyone else who asks for access to the the database will have to wait. When P is done, it releases the lock on the mutex, and then if there’s any processes waiting, the database will choose one, and give it the lock.
Here’s where time comes into things. How do you decide who to give the lock to? You could give it to whoever you received the request from first, using the time on the database host. But that doesn’t always work well. It could easily end up with hosts with a lower-bandwidth connection to the server getting far worse service than a a closer host.
You get better fairness by using “send time” – that is, the time that the request was sent to the server by the client. But that’s where the clock issue comes up. Different machines don’t agree perfectly on the current time. If you use their clocktime to determine gets the lock first, then a machine with a slow clock will always get access before one with a fast clock. What you need is some fair way of determining ordering by some kind of timestamp that’s fair.
There are a couple of algorithms for creating some notion of a clock or timestamp that’s fair and consistent. The simplest one, which we’ll look at in this post, is called Lamport timestamps. It’s impressively simple, but it works really well. I’ve seen it used in real-world implementations of Paxos at places like Google. So it’s simple, but it’s serious.
The idea of Lamport timestamps is to come up with a mechanism that defines a partial order over events in a distributed system. What it defines is a causal ordering: that is, for any two events, A and B, if there’s any way that A could have influenced B, then the timestamp of A will be less than the timestamp of B. It’s also possible to have two events where we can’t say which came first; when that happens, it means that they couldn’t possible have affected each other. If A and B can’t have any affect on each other, then it doesn’t matter which one “comes first”.
The way that you make this work is remarkably simple and elegant. It’s based on the simplest model of distributed system, where a distributed system is a collection of processes. The processes only communicate by explicitly sending messages to each other.
- Every individual process in the distributed system maintains an integer timestamp counter, .
- Every time a process performs an action, it increments . Actions that trigger increments of include message sends.
- Every time a process sends a message to another process, it includes the current value of in the message.
- When a process receives a message from a process , that message includes the value of when the message was sent. So it updates its to the (one more than the maximum of its current timestamp and the incoming message timestamp).
For any two events A and B in the system, if (that is, if A causally occurred before B – meaning that A could have done something that affected B), then we know that the timestamp of A will be smaller than the timestamp of B.
The order of that statement is important. It’s possible for timestamp(A) to be smaller than timestamp(B), but for B to have occurred before A by some wallclock. Lamport timestamps provide a causal ordering: A cannot have influenced or caused B unless ; but A and B can be independent.
Let’s run through an example of how that happens. I’ll write it out by describing the clock-time sequence of events, and following it by a list of the timestamp counter settings for each host. We start with all timestamps at 0: [A(0), B(0), C(0), D(0).
- [Event 1] A sends to C; sending trigger a timestamp increment. [A(1), B(0), C(0), D(0)].
- [Event 2] C receives a message from A, and sets its counter to 2. [A(1), B(0), C(2), D(0).
- [Event 3] C sends a message to A (C increments to 3, and sends.) [A(1), B(0), C(3), D(0).
- [Event 4] A recieves the message from C, and sets its clock to 4. [A(4), B(0), C(3), D(0)]
- [Event 5] B sends a message to D. [A(4), B(1), C(3), D(0)]
- [Event 6] D receives the message. [A(4), B(1), C(3), D(2)].
- [Event 7] D sends a message to C. [A(4), B(1), C(3), D(3)].
- [Event 8] C receives the message, and sets its clock to 4.
According to the Lamport timestamps, in event 5, B sent its message to D at time 1. But by wallclock time, it sent its message after C’s timestamp was already 3, and A’s timestamp was already 4. We know that in our scenario, event 5 happened before event 3 by wallclock time. But in a causal ordering, it didn’t. In causal order, event 8 happened after event 4, and event 7 happened before event 8. In causal comparison, we can’t say whether 7 happened before or after 3 – but it doesn’t matter, because which order they happened in can’t affect anything.
The Lamport timestamp is a partial ordering. It tells us something about the order that things happened in, but far from everything. In effect, if the timestamp of event A is less than the timestamp of event B, it means that either A happened before B or that there’s no causal relation between A and B.
The Lamport timestamp comparisons only become meaningful when there’s an actual causal link between events. In our example, at the time that event 5 occurs, there’s no causal connection at all between the events on host A, and the events on host B. You can choose any arbitrary ordering between causally unrelated events, and as long as you use it consistently, everything will work correctly. But when event 6 happens, now there’s a causal connection. Event 5 could have changed some state on host D, and that could have changed the message that D sent in event 7. Now there’s a causal relationship, timestamp comparisons between messages after 7 has to reflect that. Lamport timestamps are the simplest possible mechanism that captures that essential fact.
When we talk about network time algorithms, we say that what Lamport timestamps do is provide weak clock consistency: If A causally happened before B, then the timestamp of A will be less than the timestamp of B.
For the mutex problem, we’d really prefer to have strong clock consistency, which says that the timestamp of A is smaller than the timestamp of B if and only if A causally occurred before B. But Lamport timestamps don’t give us enough information to do that. (Which is why there’s a more complex mechanism called vector clocks, which I’ll talk about in another post.
Getting back to the issues that this kind of timestamp is meant to solve, we’ve got a partial order of events. But that isn’t quite enough. Sometimes we really need to have a total order – we need to have a single, correct ordering of events by time, with no ties. That total order doesn’t need to be real – by which I mean that it doesn’t need to be the actual ordering in which events occured according to a wallclock. But it needs to be consistent, and no matter which host you ask, they need to always agree on which order things happened in. Pure lamport timestamps don’t do that: they’ll frequently have causally unrelated events with identical timestamps.
The solution to that is to be arbitrary but consistent. Take some extra piece of information that uniquely identifies each host in the distributed system, and use comparisons of those IDs to break ties.
For example, in real systems, every host has a network interface controller (NIC) which has a universally unique identifier called a MAC address. The MAC address is a 48 bit number. No two NICs in the history of the universe will ever have the same MAC address. (There are 281 trillion possible MAC codes, so we really don’t worry about running out.) You could also use hostnames, IP addresses, or just random arbitrarily assigned identifiers. It doesn’t really matter – as long as it’s consistent.
This doesn’t solve all of the problems of clocks in distributed systems. For example, it doesn’t guarantee fairness in Mutex assignment – which is the problem that I used as an example at the beginning of this post. But it’s a necessary first step: algorithms that do guarantee fairness rely on some kind of consistent event ordering.
It’s also just a beautiful example of what good distributed solutions look like. It’s simple: easy to understand, easy to implement correctly. It’s the simplest solution to the problem that works: there is, provably, no simpler mechanism that provides weak clock consistency.
Sorry I didn’t read the whole thing, but you’re compile time + test runtime is 26 ms?? What build system do you use?
It’s Go.
I’m not a huge fan of Go as a language – but the tools for Go are astonishing. The compiler is so fast that at first, I thought I was doing something wrong, and it wasn’t compiling. But no: it’s really just that fast.
What is your notion of ‘simplest’ solution? Things like that turn out to be really difficulty to formally define.
I mean in one sense of the word a central timestamper is a trivial instance of a timestamp for a distributed system and in many senses simpler and I can’t think of a good definition.
—-
Nit: The ordering induced by comparing timestamps is a linear order. The ordering induced by saying A <= B just if the timestamp of A is <= that of B and some chain of messages witnesses that A could causally influence B is a partial order. Presumably you could easily make lamport give you the partial order just uses maps from processes to integers as the timestamps instead of integers.
Nit: Duh, it was just a typo you meant to say causal ordering not partial. Sorry for being pedantic.
Unfortunately, observation of the real world seems to contradict this rather nice view. Most (from my understanding, at least) manufacturers of network devices have one or more 24-bit designated prefix, leaving 24 bits to actually vary. This is normally done by using a PRNG of some sort, in each production line, seeded with (possibly a hash of) the ID of the production line, and then simply stepped to generate the MAC of the next bit of hardware. This means you’re unlikely to get two network devices from the same production line, with the same MAC, but you may end up with two devices, from two different production lines, with the same MAC.
From actual real-world observation, I’ve seen “we didn’t do anything to cause it” MAC collisions on the order of 5 times between ~1990 and ~2010. I’ve also seen well-intentioned, but fatally flawed, “active firewalls/IPS systems” send out RSTs to close down TCP sessions they were unhappy with, cloning the local end all the way from TCP down to MAC, but I don’t really count those.
It is also the case that (1) hardware MACs can be set, not just read, (2) MACs are sometimes cloned deliberately, and (3) manufacturers goof.
Sun Solaris (now Oracle) machines used to explicitly set the MAC for every NIC to be the same on a given system. I don’t know that they still do.
IPv6 stateless automatic address configuration (SLAAC) takes the MAC address of the NIC it is running on and combines it with an advertised prefix to pretend to get a unique IPv6 address. I would guess that enabling IPv6 on a Solaris machine turns off the shared-MAC setting, but you never know.
Pingback: Visto nel Web – 227 | Ok, panico
In step 4 don’t you mean to say that P updates its clock to tau-p, rather than tau-q?
The problem of time in distributed system is already solved in the network/telecom world using PTP. The secret of PTP is hardware timestamp engine that puts a timestamp in the Ethernet packet.
Imagine if every Ethernet port in every computer is PTP-compliant, then all computers will be synchronous to a master clock to the accuracy of a few ns.
In causal order, event 8 happened after event 4
I cant understand why