I’d like to start with a quick apology. Sorry that both the abstract algebra and the new game theory posts have been moving so slowly. I’ve been a bit overwhelmed lately with things that need doing right away, and
by the time I’m done with all that, I’m too tired to write anything that
requires a lot of care. I’ve known the probability theory stuff for so long,
and the parts I’m writing about so far are so simple that it really doesn’t take nearly as much effort.
With that out of the way, today, I’m going to write about probability distributions.
Take a probabilistic event. That’s an event where we don’t know how
it’s going to turn out. But we know enough to describe the set of possible outcomes. We can formulate a random variable for the basic idea of the event, or we can consider it in isolation, and use our knowledge of it to describe
the possible outcomes. A probability distribution describes the set of outcomes of some probabilistic event – and how likely each one is. A
probability distribution always has the property that if S in the set
of possible outcomes, and ∀s∈S:P(s) is the probability of outcome s, then Σs∈SP(s)=1 – which is really just another way
of saying that the probability distribution covers all possible outcomes.
In the last post, about random variables, I described a formal abstract
version of what we mean by “how likely” a given outcome is. From here on,
we’ll mainly use the informal version, where an outcome O with a probability
P(O)=1/N means that if watched the event N times, we’d expect to see O occur
once.
Probability distributions are incredibly important to understand statistics. If we know the type of distribution, we can make much stronger
statements about what a given statistic means. We can also do all
sorts of interesting things if we understand what the distributions look like.
For example, when people do work in computer networks, some of the key
concerns are called bandwidth and utilization. One of the really
interesting things about networks is that they’re not generally as fast
as we think they are. The bandwidth of a channel is the maximum amount of
information that can be transmitted through that channel in a given time. The utilization is the percentage of bandwidth actually being used at a particular moment. The utilization virtually never reaches 100%. In fact, on most networks, it’s much lower. The higher utilization gets, the harder it is for anyone else to use what’s left.
For different network protocols, the peak utilization – the amount of the bandwidth that can be used before performance starts to drop off – varies. There’s often a complex tradeoff. To give two extreme cases, you can build
a protocol in which there is a “token” associated with the right to transmit on the wire, and the token is passed around the machines in the network. This
guarantees that everyone will get a turn to transmit, and prevents more than one machine from trying to send at the same time. On the other hand, there’s a family of protocols (including ethernet) called “Aloha” protocols, where you
just transmit whenever you want to – but if someone else happens to be
transmitting at the same time, both of your messages will be corrupted, and will need to be resent.
In the token-ring case, you’re increasing the amount of time it takes
to get a message onto the wire during low utilization, and you’re eating up a
slice of the bandwidth with all of those token-maintenance messages. But as capacity increases, in theory, you can scale smoothly up to the point where all of the bandwidth is being used by either the token maintenance or by real
traffic. In the case of Aloha protocols, there’s no delay, and no token maintenance overhead – but when the utilization goes up, so does the chance
of two machines transmitting simultaneously. So in an Aloha network, you
really can’t effectively use all of the bandwidth, because as
utilization goes up, the amount of bandwidth dedicated to actual messages
decreases, because so much is being used by messages that had to be discarded due to collisions.
From that brief overview, you can see why it’s important to be able to
accurately assess what the effective bandwidth of a network is. To
do that, you get into something called queueing theory, which
uses probability distributions to determine how much of your bandwidth will
likely be taken up by collisions/token maintenance under various utilization scenarios. Without the probability distribution, you can’t do it; and if the probability distribution doesn’t match the real pattern of usage, it will
give you results that won’t match reality.
There are a collection of basic distributions that we like to work with,
and it’s worth taking a few moments to run through and describe
them.
- Uniform: the simplest distribution. In a uniform distribution,
there are N outcomes, o1,…,on, and
the probability of any one of them, P(oi)=1/N. Rolling
a fair die, picking a card from a well-shuffled deck, or picking a ticket in
a raffle are all events with uniform probability distributions. - Bernoulli: the Bernoulli distribution covers a binary
event – that is, and event that has exactly two possible
outcomes, “1” and “0”, and there’s a fixed probability “p”
of an outcome of “1” – so P(1)=p, P(0)=1-p. - Binomial: the binomial distribution describes the probability of
a particular number of “1” outcomes in a limited series of
independent trials of an event with a Bernoulli distribution.
So, for example, a binomial probability distribution will say
something like “Given 100 trials, there is a 1% probability of
one success, a 3% probability of 2 successes, …”, etc.
The binomial distribution is one of the sources of the
perfect bell curve. As the number of data points increases,
a histogram of the Bernoulli distribution becomes closer and closer
to the perfect bell. - Zipf distribution. As a guy married to a computational linguist,
I’m familiar with this one. It’s a power law distribution:
given an ordered set of outcomes, oi, …, on,
the probabilities work out so that P(o1)=2×P(o2); P(o2)=2×P(o3), and so on. The most
example of this is if you take a huge volume of english text, you’ll
find that the most common word is “the”; randomly picking
a work from english text, the probability of “the” is about 7%. The
number 2 word is “of”, which has a probability of about 3.5%. And so
on. If you plot a Zipf distribution on a log-log scale, with the
vertical axis descending powers of 10, and the horizontal axis
is ordered ois, you’ll get a descending line. - Poisson. The poisson distribution is the one used in computer
networks, as I described above. The poisson distribution describes
the probability of a given number of events happening in a particular
length of time, given that (a) the events have a fixed average
rate of occurence, and (b) the time to the next event is independent of
the time since the previous event. It turns out that this is a pretty
good model of network traffic: on a typical network, taken over
a long period of time, each computer will generate traffic at a
particular average rate. There’s an enormous amount of variation –
sometimes it’ll go hours without sending anything, and sometimes it’ll
send a thousand messages in a second. But overall, it averages out
to something regular. And there’s no way of knowing what pattern it will
follow: just because you’ve just seen 1,000 messages per second for 3
seconds, you can’t predict that it will be 1/1000th of a second before
the next message. - Zeta. The zeta distribution is basically the Zipf distribution over
an infinite number of possible discrete outcomes.
A good start!
My only complaint is about an engineering detail. While you are writing about computer networking, you only cover one subtype, local area networks, and limited to link layer protocols.
In wide area networks the Poisson distribution begins to fall apart. One possible culprit is in the higher protocol layers, especially Transmission Control Protocol (TCP). Since it resends missing packets after timeout, there will be duplicate packets in flight, and as load approaches capacity, the distributions will have longer tails than Poisson predicts.
http://en.wikipedia.org/wiki/Long-tail_traffic
Real life measurements show that the Internet has a self-similar behaviour over wide time ranges, i.e. it is chaotic in nature. Therefore it can’t be modelled accurately using any classical distribution.
Lassi:
Thanks! I wasn’t actually aware of that. I haven’t studied queueing theory since grad school, and on the networks of the time, I think that poisson still worked pretty well. But the class where I studied that came before the web or the mainstreaming of the internet.
Do you have any references I could look at about the self-similar nature of the internet? I’d like to read more about it.
Aloha based networks are not actually “transmit when you want to”, they are “transmit when you want to, if nobody else is talking”. This makes a big difference, since a collision is virtually guaranteed at high utilizations in the first case, but allows a much higher utilization in the second. In fact, the limit of utilization becomes dependent on the overall physical size of the network in the second case.
I second Lassi’s commendation of this is as a good start. My only quibble would be that it seems a bit odd to list basic distributions — even referring to a ‘bell’ curve as the limit of the binomial — and yet decline to at least mention the normal, even in a “then there’s this other thing that turns out to be rather important in ways we’ll go into next time” sort of way. If it’s because you’re being picky about distributions as discrete, then you should say so — especially given your less pedantic approach in the previous post…
geometric (exponential decay, roughly, describing the distribution of how many games one needs to play to achieve the first “win”)?
(rough) example: my team wins about .40 of its games, based on past performance. How many games will we play this year before we win our first league game?
“Do you have any references I could look at about the self-similar nature of the internet? I’d like to read more about it.”
I tried to Google for something that is on-line (using “self-similar distribution internet”), but all relevant articles seem to be behind a paywall. Of course many readers of your blog have access to IEEE or ACM libraries, so they can read them.
Sorry for a terse answer, but I’ll be in Dubai for the rest of the week, and the plane leaves pretty soon…
They just wrote an interesting post about visualizing higher dimensional distributions over at Overcoming Bias:
http://www.overcomingbias.com/2008/04/conf-space.html
Citeseer seems to have a number of papers. The one below seems pretty good though it doesn’t discuss the “hurst” parameter which is an interesting way of gauging the level of self-similarity in traffic.
http://citeseer.ist.psu.edu/uhlig01understanding.html
what you quoted as zipf is actually exponential
p(x)~(1/C)^x
with C const
a power law is p(d)~d^(-C)
where C is a constant
for words i think it is C=3/2 or 5/2 but too lazy to google
Nothing about the normal distribution?
Or the t-distribution?
Or the F-distribution?
or Chi-square?
I’d say those are the four I use most as a statistician.
And the Poisson, at least for regression problems, usually needs to be abandoned for a negative binomial distribution, since the conditional variance almost always increases with the conditional mean.
Minor nitpick: Ethernet actually uses CSMA/CD, which means that nodes wait until the line is free before transmitting. Collisions occur only when two nodes start transmitting at the same time.
Love your blog as always.
secondclass:
Yeah, I know that my description was a bit oversimplified, but I was already feeling like I was giving too much detail; try sending and look for collision” is close enough to give people a sense of how it works, and I was debating whether or not to even include that much detail.
(As an interesting aside, when I worked as a sysadmin, we actually had some trouble with an ethernet that got too long. We’d run an ether as a coil through four floors of the EE building. Turned out that it got long enough that at Ethernet propagation speed, the machines at the far ends of the network could collide without detecting, because ether only looks for a collision when it starts sending. Took us a while to figure out that that was the problem, because we couldn’t believe that the signal propagation could take long enough on a cable in one building to allow missed collisions! Once we finally figured it out, we split the building into two subnets with a repeater between them, and that took care of it.)
http://www.ams.org/notices/199808/paxson.pdf
is a very good introduction to the failure of poisson modeling due to the self-similar structure of network traffic