Statistics is something that surrounds us every day – we’re constantly
bombarded with statistics, in the form of polls, tests, ratings, etc. Understanding those statistics can be an important thing, but unfortunately, most people have never been taught just what statistics really mean, how they’re computed, or how to distinguish the different between
statistics used properly, and statistics misused to deceive.
The most basic concept in statistics in the idea of an average. An average is a single number which represents the idea of a typical value. There are three different numbers which can represent the idea of an average value, and it’s important to know which one is being used, and whether or not that is appropriate. The three values are the mean, the median, and the mode.
The mean is what most people are taught as the average in middle school math. Given a set of values, the mean is what you get by adding up all of the values, and dividing that sum by the number of values. Written in
math notation, you have a set {x1, …, xn} of values called the population. The mean is:
(Σi ∈ 1..n xi)/n
The mean is a very useful number – it summarizes the properties of the group. It’s important to understand that the mean does not represent an individual – in fact, there may be no individual whose value matches the mean; but the mean is a summary of the entire population.
The median is often a better representative of a typical member
of a group. If you take all of the values in a list, and arrange them in
increasing order the number at the center will be the median. The median
is an actual value belonging to some member of the group – depending on the distribution of values, the mean may not be particularly close to the value of any member of the group; and the mean is also subject to skew – as few as one value significantly different from the rest of the group can dramatically change the mean. The median gives you a central member of the
group without the skew factor introduced by outliers. If you have a normal distribution, then the median value will be a typical member of the population. (“Normal” here is a technical term that I’ll explain in a later post; for now, just take it to mean “not a totally weird distribution”.)
The mode is the most common member of the group. It doesn’t matter whether it’s the biggest or smallest value in the group – whatever value is most common is the mode. The most is the least commonly used of these three average measures, and that’s because it’s generally the least meaningful. But once in a while, it’s useful. If your data is perfectly
regular, then the mean, median, and mode will all be the same value. This
almost never happens in real life. In general, you’ll find that the median is in between the mode and the mean, closer to the mean.
Let’s look at a couple of examples. Suppose we wanted to know something about the income of a population of people. (Just to be clear, the following numbers are completely made up to give me good examples of how these three measures differ, not because they are in any way representative of real income distribution!) We’ll imagine a group of 24 people. In order, they make
$20,000, $20,000, $22,000, $25,000,
$25,000, $25,000, $28,000, $30,000,
$31,000, $32,000, $34,000, $35,000,
$37,000, $39,000, $39,000, $40,000,
$42,000, $42,000, $43,000, $80,000,
$100,000, $300,000, $700,000, and $3,000,000.
The mean income of this group is about $200,000 (The sum of the incomes is $4,789,000; there are 24 members; so the mean is 200,000).
For the median, we arrange the values in order, with half the values on one side, and half on the
other. To make it all fit, we’ll write the number of thousands, without the trailing “000”:
20, 20, 22, 25, 25,
25, 28, 30, 31, 32,
34,
35, 37,
39,
39, 40, 42, 42, 43,
80, 100, 300, 700, 3000
The median is the value with the same number of things on either side of it. In this case, our population has an even number of members, which means we need to pick one of the two. The correct thing to do in a case like this depends on what you’re doing – in general, you try to do the conservative thing, which means choosing the one that is least likely to skew the data in favor of a particular conclusion. In this case, since we’re looking at uneven salary distributions, skewing it upwards (picking the larger of the two) is going to reduce the imbalance, but certainly not eliminate it; skewing it downwards will make the difference even larger, which will exaggerate the results. So we’ll pick the higher of the middle values: 37,000.
The mode is $25,000.
This case demonstrates the skew effect of the mean quite clearly. The
value of the mean is larger than 80% of the actual members of the group! Even
with our selecting a the larger of two possible values of the median, the mean
is more than five times larger than the median!) The median is a much better
measure of a typical member of a group. In this case (as is pretty
common) the mode is not particularly meaningful.
Some common tricks that people use with statistics is using the median where the mean is more appropriate, or the mean where the median is more
appropriate.
For example, it’s a pretty common trick when talking about
incomes to talk about how the mean income of a large group of people
has increased – when in fact, the typical member of the group did not
get any raise – instead one or two outliers got huge raises, and everyone else
got nothing. Suppose that you had ten employees, and you gave them pay-changes of -2%, -2%, 0%, 0%, 0%, 0%, 1%, 1%, 3%, 20%. The mean salary change would be +2%. But half of the employees saw either no change or a decrease; and in fact, almost all of the increase went to just one person. Take that one person out, and the average raise drops by nearly a factor of 20 to 0.11%.
Another example of misuse the other way: there’s a creationist who published a book under the pen-name of John Woodmorappe explaining how Noah’s ark could have actually held all of the animals. He did this by computing the median size of a species, and then multiplying that by the number of species. This is wrong, because the median is simple not an appropriate measure for talking about the population as a whole; it identifies a typical member of the population, but it doesn’t extrapolate well.>/p>
To see why, let’s look at a specific example. Imagine we had a group of
animals, and their masses were: 10@0.02kg, 10@0.1kg, 20@0.2kg, 20%0.3kg, 5@1kg, 5@2kg, and 10@5kg. The total mass is 76.2kg. There are 80 individuals; their mean mass is about 0.95kg; their median mass is (leaning toward the high side) 0.3kg. If you use the mean to estimate the mass of the population, you’ll get 75 (because of rounding errors). If you use the median to estimate the mass of the population, you’ll get 24kg – less than 1/3 the correct value. Woodmorappe used this trick to try to make it look like you could fit more animals on the ark than you actually could – as you can see from the example, using the median to reason about the population as
a whole can give you ridiculously wrong answers.
One of the reasons why the mean is the preferred representive of a population is that statistical tests, such as the students t test can be performed on it.
Thank you for the basic stats refresher. Interestingly statistics was the only math class I absolutely aced in college. I actually understood it. Everything else was just a blur.
You might mention how median and mean are defined if you have a continuous probability distribution, instead of simply a number of samples.
Of course, you could assume that anyone able to use the term “continuous probability distribution” accurately already knows, but just in case:
Given a probability distribution p[x] such that:
Integrate[p[x], {x, -Infinity, Infinity}] == 1
Then the mean is:
Integrate[x*p[x], {x, -Infinity, Infinity}]
And the median is the solution L to the problem:
Integrate[p[x], {x, -Infinity, L}] == 0.5
(The notation used is Mathematica‘s)
Another thing I just realized: what you have there is the arithmetic mean. Occasionally, it is useful to get into other definitions of mean (geometric mean, harmonic mean, or the generalized power mean). Presumably, anyone interested in those can go look it up in wikipedia.
Related to that, one thing you might want to address is how the mean of derived quantities generally needs to be computed by aggregating the components first. An example helps to clarify this:
It wasn’t 45mph. It was 40mph.
Last night I saw a commercial for some anti-cholesterol medication (clearly poorly made, since I can’t remember the brand name). In the commercial it claimed “[[drug name]] lowered cholesterol by an average of 30 points – that’s 18%”. Now, I want to know how that “18%” figure was arrived at – was it done by taking the average starting cholesterol of everyone on the medication and dividing 30 by that figure, or was it done by averaging the percentage drops everyone got? The two methods are not equivalent, and I don’t trust the drug company to use the most appropriate definitions if that fails to paint their product in the absolute best light possible. (And I’m not certain either method is appropriate – I want to know the median effect of this drug, or would if I were a potential patient)
Actually, although creationist claims are easy prey around here, I’m certain that there are just tons of misleading/bad uses of statistics all over any product being promoted for money.
The newspapers can be particularly misleading when they talk about “the average employee”, “the average American”, etc. One can be lulled into the false sense that they’re talking about somebody who actually exists.
For example, last year the financial press, who really should show some numeracy, reported that the average employee of Goldman Sachs earns half a million dollars per year, even when every secretary and janitor was taken into account. Of course this is skewed massively by the few people at the top who earn many hundreds of millions. I’d love to know what the median figure was.
If you’re going for basics, I’d leave out the formula you provide. While it does add extra information, it may lead to more confusion amongst beginers than if it weren’t there. Also, you use the term distribution without adequately defining it. I’d avoid the term altogether — same goes for “normal distribution” — in an introduction to mean/median/mode.
But one glaring error is that it’s unclear (to someone who isn’t familiar with the notiation, a beginer) how to calculate the mean. Your example using incomes is good, but it could be presented more clearly. For example, you could format the data in a table to show how the median is the middle number when the data are sorted. It would also help to show skew graphically, so that it’s clear how the mean, median, and mode are different when the data are not normal (although don’t use the term normal distribution).
You introduce the topic quite well, and the closing examples at the end are very good. But the description in the middle is a bit muttledstart off good, examples at end are good, but middle is a bit muddled. The examples at the end would have much more impact if the following suggestions are implemented.
SLC,
There is a t-test using the median or percentile instead. It’s called a “ranked t-test” an some people say that’s what we should always use since it’s more robust to outliers. I think that’s what I would use if I believed null hypothesis testing.
And my comment above would be a lot clearer if the final paragraph read as follows:
You introduce the topic quite well, and the closing examples at the end are very good. But the description in the middle is a bit muttled. The examples at the end would have much more impact if the suggestions above are implemented.
“If you’re going for basics, I’d leave out the formula you provide.”
Wow, a big part of the beauty of mathematics is the consision and expresivness of its language and formulas. These make it easyer to understand otherwise hard concepts, not the other way around.
Although, since we are going back to the basics, maybe a post on the basics of mathematical notation, what it means, how it’s made and guidelines to make new useful notational constructs would be in order.
I’m curious about one point. I recall being taught that “mean” was a more specific and preferable term for “average.” I have no recollection of median or mode being classed as different types of “average.”
In the professional community, if I were to say “average,” would folks ask, “Which one – mean, median, or mode?”.
Nice article!
For me, it always helped to use the physical analogy–the mean is the “center of gravity” of your distribution. I.e. if you plunked your data points down on the x-axis, the mean is where they balance perfectly. (The fulcrum, if you think of it as a lever.) The median, on the other hand, is the position where you have equally many data points to the right or to the left.
On a side note, I feel the urge to point out (to anyone who might be wondering why we talk about the mode at all, if it’s so useless) that the mode is much more meaningful if your data are continuous, rather than discrete. So there is a good reason for defining the thing–other than making things more confusing for students 🙂
“Wow, a big part of the beauty of mathematics is the consision and expresivness of its language and formulas.”
Yep. To anyone with cs or math background (most decent undergrad cs programs provide you with the equivalent of a minor in math)it is obvious and meaningful.
To most everyone else not so much, and for a wider audience a primer in all the neat greek letters and their meanings might be usefull.
The big problem with including that formula is that the math markup here is so bad that it doesn’t look like an equation in a book, typeset in LaTeX or written on a chalkboard. Can we please browbeat the ScienceBlogs tech people into catching up with Jacques Distler? I mean, superscripts and subscripts don’t work; HTML entities get replaced in source text upon preview, making dashes and Greek letters come out funny; and I can’t cite more than one URL per post without dumping my comment in the spam queue. It’s, well, discouraging.
Other than that, it would be a good idea to explain the notation at slightly greater length. What is that Σ and how does it relate to a sum?
Blake:
I fixed the problem with superscripts and subscripts in comments – you should be able to use them with no trouble now. Things with URLs do get out of the mod queue – it just takes a bit of time for me to get to them. (If things like the links causing delays really bugs people, I can turn typekey back on, and then not moderate anything that comes from an authenticated user.)
The HTML entities in preview issue, there’s nothing I can do about; it’s deep in the guts of the MoveableType software that we’re using.
Many thanks — or, should I say, (thanks)n.
Oh, also:
In the paragraph beginning, “The mean income of this group”, the phrase inside parentheses is unfinished. There’s also a dangling “><p>” at the end of the penultimate paragraph.
I would also like to request a refresher on basic notation, especially that used in logical propositions, such as upside-down A’s and backwards E’s. I had some of that in freshman calculus many years ago, but don’t remember all of it.
Very nice! I hope you do standard deviation and standard error.
As to RPM’s point, I think the equations are important, don’t take them out!
But it would be helpful to explain some of the notation. Some people don’t know or won’t remember that this symbol Σ, for example, is used to represent a sum.
One area where the distinction between arithmetic and geometric mean is important is investment returns. In the geometric mean you take the n root of multiplying n data points together (see http://en.wikipedia.org/wiki/Geometric_mean for formula) An investment that goes up 20% one year and down 20% the next has an annual arithmetic mean return of 0%, but you have actually lost 4% of your beginning dollars. The geometric mean of (20%, -20%) is (1.2 * 0.8)^.5 or -2.02% which is the compound annualized return. The geometric mean is also easily convertible into logarithms and therefore continuous, rather than periodic, rates of return. Most research in financial economics deals with continuous returns rather than periodic as the math is much simpler to deal with.
A more general ensemble average discussion would be fun too! (i.e. observables and higher order distribution moments)
One group of cases where looking for the mode is useful is when you have heterogeneity in the population. The easiest way for beginners to visualize this is by making a stem-and-leaf plot (they teach those in middle / high school math) for test scores in a large school, half of whose students are a “magnet school” group. In the s-&-l plot, they’d see two bulges popping out, one representing the most frequent value for the regular students and another for the most frequent value of the eggheads who’ve been shipped in from around the state.
Given how popular meta-analyses are these days in all fields, beginners should also know that the same can happen in looking for the effect size of a treatment. Effect sizes found among industry-funded and non-industry-funded studies, effects among one racial group vs another (like BiDil), and so on — the mean and median will obscure this heterogeneity, but looking for and finding two modes will let people know something’s up. Industry-funded studies may be vastly more likely to find support for their drugs, Europeans and Africans might not respond the same way to heart failure drugs, and a decent “average test score” at a magnet school might mean that there are haves and have-nots forming separate sub-populations on either side of the average.
Just a note to say that the *only* college course that had direct application in every job I ever held (I’m now retired.) was statistics.
I’ll also second the request for a refresher in logical notation.
Mark – thanks, nice post. You’ve been reading Huff, haven’t you?
You probably wouldn’t, because “average” would either be taken to be a bit vaguer, or to refer to the (arithmetic) mean: you can usually tell from the context.
Incidentally, we have a whole bunch of means to use to confuse people: arithmetic, geometric, harmonic, weighted, trimmed, and even Winzorised.
Bob
The mode is most useful in the case where the distribution of your observations is bi-modal or multi-modal. This sometimes comes from combining two disparate populations into one measure.
I used to explain mode by reminding students of
pie a la mode (that’s the pie as in American Apple and the rest is French to me)
or fashionable (designer fashion)
but the analogies are no longer in Vogue so the advantage as an explanation is no longer stylish.
Seriously, the “average” is an important concept for people to understand, especially in the -est battles between states, cities, and tribes, e.g., http://ykalaska.wordpress.com/2006/03/13/richest-cities-in-the-us-bethel/
Thanks for helping.
Quantitative PCR to measure viral load is an example where the geometric mean is the appropriate “average” to use because the technique yields a lognormal error distribution.
This seems to confuse HIV “rethinkers” who claim that using the geometric mean is deceptive.
Why have they used the geomteric mean here? Well the only reason we could think of (myself and my colleague Sylvano Lucchetti, who is a statistician) is that the geometric mean smooths out ratio changes.
Very nice. I hope you take it further. A couple of things that the world needs are an intuitive (graphical perhaps) explanation of correlation and an explanation of statements such as “genetic factors account for 50% of the variance in IQ”.
For JimV, the upside-down “A” is called a universal quantifier. The back to front “E” is called an existential quantifier. I won’t try to use these signs here, but will stick with the ordinary “A” and “E” in their place. To give a basic idea of how they work, take this expression:
[AxGx]Ux
It might mean “For all x such that x is a gorilla, x is ungodly”, or, more simply “All gorillas are ungodly” or “Every gorilla is ungodly” (these all come to the same thing).
Take this expression:
[ExGx]~Ux
It might mean, “There is an x such that x is a gorilla and x is not ungodly”; more simply “Some gorillas are not ungodly” or, technically better, “There’s at least one gorilla that’s not ungodly.”
It’s obviously a lot better to use the proper quantifiers, rather than an ordinary capital A and E, because capital letters are used to stand for predicates (basically properties of things, in a broad sense). Thus, we might want “E” to mean “is an elephant” or “A” to mean “is aggressive”.
The bit of the predicate calculus that I found most counterintuitive when I first studied it was the way upper case letters are used for properties of things, while lower case letters are used as names. Thus
Pc
might mean “Colin is predatory” or “Carolyn is a polytheist”. The expression
Pc –> ~Ub
might mean “If Carolyn is a polytheist then Belinda is not ungodly.”
Pc –> (Gb & Ub)
might mean “If Carolyn is a polytheist then Belinda is an ungodly gorilla”.
And so on …
Hope I haven’t made any blunders here; I’m feeling a bit rusty with all this.
One of the first steps to eliminating innumeracy in America is to get those darn weather people to stop calling the average temperature “normal.” As in, “It was 5 degrees warmer than normal.” NORMAL would be + or – one standard deviation from the mean. Those 5 degrees are likely well within the “normal” range. If they want to say 5 degrees warmer than average, they should just say so…
This blog entry has single handedly confused the age old linkage between “classical” average and mean… somehow referring to average to mean “central tendency” to take a phrase from wikipedia. … Even Wikipedia is a bit saucy on the definition: http://en.wikipedia.org/wiki/Average
From the OED:
5. transf. The distribution of the aggregate inequalities (in quantity, quality, intensity, etc.) of a series of things among all the members of the series, so as to equalize them, and ascertain their common or mean quantity, etc., when so treated; the determination or statement of an arithmetical mean; a medial estimate. Now only in phrases at an average, on an average.
6 a. The arithmetical mean so obtained; the medium amount, the generally prevailing, or ruling, quantity, rate, or degree; the ‘common run.’
The last thing I need are politicians and others confusing means and medians… like median income for example… a very important measure from the US census, but definitely not the mean income.
Thanks.
A minor point but IMO the definition of medians needs tweaked just a bit, providing I correctly understand the conventions for calculating these. Medians exist as an actual value in a group only when the number of members is odd or when the two adjacent values are identical when the number of members is even. When the number of members is even but the two adjacent members have different ranked values, by convention the median is the simple average of the two adjacent ranked values. This principle applies to the other ranked measures, such as quartile boundaries and so on.
Now, on to the explanation of Box-and-Whisker Plots. 🙂
I have yet to take statistics and haven’t seen the inside of a math classroom since 1975 and I understood every word you just wrote. You must be (or would make) a fantastic teacher!
I have a pet peeve about my local radio weather report, which is that when they give the forecast daily high temperature, they usually note that it is X degrees above/below what is “typical” (or even “where it’s supposed to be”). Presumably, they’re comparing it to the historical average (mean) for that date. The language used implies that the situation represents some sort of aberration, as if the mean has some magical attractive power. It happens almost every day, indicating that in fact the temperature rarely matches the historical mean — IOW, it is “typical” (for one common usage of that word) for it to be a few degrees off the mean.
I’ve never gathered the stats to test this, but I hypothesize that the historical temperature for any particular date has a large variance, possibly even bimodal (since we seem to alternate roughly weekly between warm and cold spells, as systems move through the area).
The town where I live is weird. Nearly everyone has more than the average number of legs.
That’s actually a great example, and should find a place in the posting, IMHO.
Cheers,
–Bob
i don’t know if it’s me, my computer, or your writing, but in your first example of income, there are 24 people…
“Nearly everyone has more than the average number of legs.”
Tip of the iceberg.
In my town, the average person has close to one breast, one ovary, one testicle, and half a penis.
That is a drop in the ocean.
In my town every average family has 1.7 kids. The 0.7 kid is the runt.
Sixth paragraph, third sentence reads “The most …” should be ‘The mode …’. Nice basic intro. Thanks
dedicated to where math lives
and dedicated to the mathematics advancin grove
Perfect”
Hey dad look at me
Think back and talk to me
Did I grow up according to plan?
And do you think I’m wasting my time doing things I wanna do?
But it hurts when you disapprove all along
And now I try hard to make it
I just want to make you proud
I’m never gonna be good enough for you
I can’t pretend that
I’m alright
And you can’t change me
‘Cuz we lost it all
Nothing lasts forever
I’m sorry
I can’t be perfect
Now it’s just too late and
We can’t go back
I’m sorry
I can’t be perfect
I try not to think
About the pain I feel inside
Did you know you used to be my hero?
All the days you spent with me
Now seem so far away
And it feels like you don’t care anymore
And now I try hard to make it
I just want to make you proud
I’m never gonna be good enough for you
I can’t stand another fight
And nothing’s alright
‘Cuz we lost it all
Nothing lasts forever
I’m sorry
I can’t be perfect
Now it’s just too late and
We can’t go back
I’m sorry
I can’t be perfect
Nothing’s gonna change the things that you said
Nothing’s gonna make this right again
Please don’t turn your back
I can’t believe it’s hard
Just to talk to you
But you don’t understand
‘Cuz we lost it all
Nothing lasts forever
I’m sorry
I can’t be perfect
Now it’s just too late and
We can’t go back
I’m sorry
I can’t be perfect
‘Cuz we lost it all
Nothing lasts forever
I’m sorry
I can’t be perfect
Now it’s just too late and
We can’t go back
I’m sorry
I can’t be perfect
suppose anew high temperature wererecorded in europe,and the new mean temperature became 120faherheit.what is europe’s new high rempereature?
this is a question from a girl having trouble with her homrwork so i want the answer on this question NOW PLEASE.
Anonymous:
(A) As you experience more of life, you’ll learn that you rarely get help by going to people and DEMANDING THAT THEY HELP YOU NOW PLEASE!
(B) I don’t write this blog so that I can do people’s homework for them.
(C) Even if I wanted to, you can’t solve a problem without all of the information – and you’re missing one.
I just used your site to refresh my memory. It has been a long time since I worked with mean, median, and mode.
Thank you