I’ve been meaning to get back to some of the probability stuff. We’re currently recovering from a major snow/ice storm, and I’m snowed/iced in, so this is a good time!
Today, we’ll talk about what is, according to many people, the most important rule in all of probability: Bayes theorem. It’s also, in my experience, the single most abused rule in all of mathematics. Nothing else has been used so poorly, by so many people, to support sloppy, dumb arguments. After we talk about what the rule is, and what it means, we’ll move on to talk about how it gets abused.
In a pure mathematical sense, Bayes theorem is simple. The interpretation of it, and what it means gets pretty hairy. Suppose that you’ve got two related events, A and B. You know the probability of A occurring is P(A). You know the probability of B occurring is P(B). And you know that if A has already occurred, what the probability of B occurring is. (We write that P(B | A), which you can ready as “the probability of B given A”.) What you’d like to know is, suppose that I know that B occurred. What’s the probability that A also occurred? (What is P(A | B)?)
Bayes theorem says:
Let’s be concrete. I go to work, and walk into my office in the morning, and get into the elevator with one other person that I work with.What is the probability that it’s a man?
Without knowing anything about the people that I work with, a reasonable guess would be 50% – the population is pretty close to evenly divided between the genders.
But I’m an engineer, and one of the very unfortunate facts about my job is that the gender pool of engineers is very skewed. Let’s say that it’s 80% men. (In reality, that’s probably actually pretty low.)
Let’s say that about 1/3 of the office is engineering. So the odds that someone I bump into will be an engineer is about 50%.
I can do a couple of things with that information. I could ask, suppose that I walked into the elevator with a woman. What’s the probability that she’s an engineer?
To answer that, I’ll use Bayes law. We’ll say that P(A) is the probability that a random person is a woman- 1/2. P(B) is the probability that a random person is an engineer – 1/3. If I know that a given person is an engineer, the probability of that person being a woman is P(B | A), or 1/5. So what’s the probability of my random female coworker being an engineer (P(A | B))?
See? That was easy, wasn’t it?
Now, what’s it actually mean? If you look at it this way, it doesn’t seem to be such a big deal. Sure, it’s a way of combining probabilities in another situation, but so what? Why’s it any more important than any other?
Because it’s the mathematical method for how to incorporate new knowledge into your expectations. What we did above was start with one understanding of the thing we were trying to predict. Knowing nothing but the typical distribution of genders in the general population, we made a guess about a 50% probability of encountering a woman. But then we added in new information. We knew the population of engineers, and the fact that the gender ration was skewed in engineering – and we incorporated that new information into our prediction.
That answer comes from interpretations. One of the classic interpretations of probability theory is the Bayesian interpretation – named Bayesian specifically because of how it interprets this rule! The Bayesian interpretation says that a statement about probability is really a statement about the state of our knowledge. If I say that the probability of flipping heads on a coin is 1/2, what I’m saying under the Bayesian interpretation is that my certainty that I’ll flip heads is just 1/2.
In that kind of knowledge-based interpretation, there is no intrinsic probability of any event. There is just our degree of certainty about whether the event will occur. Given new information, our degree of certainty can change. Bayes theorem tells us, given new information, exactly how we should change our interpretation.
To explain the bayesian interpretation, we’ll add a couple of terms.
- Hypothesis
- The hypothesis is the thing whose degree of certainty we’re trying to measure. In the formulation of Bayes law up above, we call it A; here, we’ll call it .
- Prior
- The prior, P(H), is the degree of certainty about the hypothesis given no other information.
- Evidence
- The evidence is the new piece of information that we’re trying to add to our measurement of certainty. Above, we called it B, but here, we’ll call it .
- Likelihood
- The likelihood of a piece of evidence is our degree of certainty that a specific piece of evidence would be found if the hypothesis is true.
- Model Evidence
- The model evidence is , and it’s a bit confusing. It’s the analytic likelihood of any piece of evidence occurring. If you’re considering a set of possible hypotheses using Bayes rule, will be the same for all of them, but will be the specific likelihood of finding that particular piece of evidence under the hypothesis.
- Posterior
- The posterior, P(H|E), is the degree of certainty that we will have about A if we add new knowledge, B.
- Support
- Support is the change in our certainty created by the addition of our new evidence. The support is .
So Bayes theorem is a formal statement of how, given evidence, we can modify our certainty about the truth of a particular statement. The classical textbook statement of it is the following. (I took this specific formulation from wikipedia, but any textbook will have nearly the same sentence.)
The posterior probability of a hypothesis is determined by a combination of the inherent likeliness of a hypothesis (the prior) and the compatibility of the observed evidence with the hypothesis (the likelihood).
Or, in mathematical terms, – or exactly what we wrote for Bayes theorem up above.
Why is this abused so badly? Because under a naive, stupid
understanding of Bayes rule, you can essentially randomly estimate the probability of anything. After all, Bayes says that probability is just the combination of our certainties about some collection of facts. So if I can line up some set of facts, along with an estimate of the individual probabilities of those facts, then I can combine those probabilities, and come up with an estimate of the probability of anything! And if I don’t know the probability of an event occurring at al, then the state of my initial knowledge is really simple: it’s always 1/2 – 1/2 is always the starting point given absolutely no other knowledge.
That leads to rubbish like this proof that there are no extra-terrestial intelligences, or this or this purported proof of the existence of God.
All of these arguments fail in the same way. They don’t really use Bayes theorem. The quality of the priors – all of the priors, including the priors used to come up with measures of the likelihoods of the evidences – are crucial. They don’t bother with that. They just make up priors, and combine them without good likelihoods.
Typo?
“Let’s say that about 1/3 of the office is engineering. So the odds that someone I bump into will be an engineer is about 50%.”
33% perhaps?
Otherwise, nice post!
Shouldn’t that say “… about 33%.”?
Also, I’m probably missing something (perhaps your point?), but at a firm where 1/3rd is engineering with 1 woman to every 4 men and (implied) the rest of the company is 1 woman to 1 man, wouldn’t the chance that a random coworker is female be 1/5 * 1/3 + 1/2 * 2/3 = 6/15?
So with P(woman) = 6/15, P(eng|woman) becomes 1/6, which makes sense (to me): say the company size is 300, that means 100 coworkers are engineering, with 20 women and 80 men; the other departments combined are 200 coworkers with 100 women and 100 men, so 120 women and 180 men work at that company. Of those 120 women 20 are engineers, or 1/6, so if you meet a random female coworker on the elevator there’s a ~17% chance she’s an engineer.
Mark, can you think of one good book about Bayesian analysis combined with fuzzy logic ?
Nice introduction to the Bayes formulation but I think you should have left the discussion of how it is used badly to another entire post. As you know (but is not explained here) the validity of Bayes is entirely dependent on independence, which is where folks go wrong by assuming independence without sufficient justification.
“Entirely dependent”? But A and B (or E and H) are expressly /not/ independent. If they were, then P(A|B) = P(A) and you wouldn’t even have to bother with Bayes.
Maybe you’re think of applications where P(B|A) is written as a product of P(b|A) terms. That’s a conditional independence assumption and is not inherent to working with Bayes Rule.
Yes, the details have to do with independent and conditionally independent random variables. I wasn’t trying to explain that in my comment. My point was that in order to understand why the application of Bayes rule can be either valid or invalid it is necessary to understand the nature of independence which is what makes it possible to multiply and divide probabilities and have a valid probability as a result.
There are good examples to use like the Monty Hall problem where Bayes wins over folks’ common sense and where common sense fails such as medical tests for rare diseases.
One can then explain how there are successful applications of Näive Bayes which work even in cases where we know the probabilities are not independent such as statistical NLP (HMMs for POS tagging for example) but experience has shown the results to be useful anyhow as long as the sample is “close enough” to the relevant population.
Ed Jaynes’s book Probability: The Logic of Science is a great reference for Bayesian probability theory with emphasis on the logical aspects. Indeed, Bayes’s Rule corresponds exactly to the “cut principle” of logic, which states that logical entailment is transitive. Conditional probability is the analogue of logical entailment, and Bayes’s Rule is just the cut principle under this correspondence.
You got through the rant section without mentioning Richard Carrier even once. Congratulations! That must have been hard.