Several people have asked me to write a few basic posts on statistics. I’ve
written a few basic posts on the subject – like, for example, this post on mean, median and mode. But I’ve never really started from the beginnings, for people
who really don’t understand statistics at all.
To begin with: statistics is the mathematical analysis of aggregates. That is, it’s a set of tool for looking at a large quantity of data about a population, and finding ways to measure, analyze, describe, and understand the information about the population.
There are two main kinds of statistics: sampled statistics, and
full-population statistics. Full-population statistics are
generated from information about all members of a population; sampled statistics
are generated by drawing a representative sample – a subset of the population that should have the same pattern of properties as the full population.
My first exposure to statistics was full-population statistics, and that’s
what I’m going to talk about in the first couple of posts. After that, we’ll move on to sampled statistics.
The way that I learned this stuff from from my father. My father was
working for RCA on semiconductor manufacturing. They were producing circuits for
satellites and military applications. They’d do a test-run of a particular
design and manufacturing process, and then test all of the chips from that
run. They’d basically submit them to increasing stress until they failed. They’d get failure data about every chip in the manufacturing run. My father’s job was to
take that data, and use it to figure out the basic failure properties of the run, and whether or not a full production run using that design and process would produce chips with the desired reliability.
One evening, he brought some work home. After dinner, he spread out a ton of
little scraps of paper all over our dining room table. I (in third or fourth grade at the time), walking in and asked him what he was doing. So he explained it to me.
The little slips were test results. They were using a test system called, if I remember correctly, a teradyne. It printed out results on these silly little slips of paper. If you’ve ever watched “Space: 1999”, they were like the slips that come out of the computer on that show.
Together, we went through the slips of paper, taking information off of them,
and putting them into long columns. Then we’d add up all of the information in
the column, and start doing the statistics. We did a couple of things. We computed
the mean and the standard deviation of the data; we did a linear regression;
and we computed a correlation coefficient. I’m going to explain each of those in turn.
First, we come to the mean. The mean is the average of a set of values. Given
a theoretical object which behaved individually exactly as the aggregate information would predict, the behavior of that object is the mean. To compute the mean, you sum up all of the values in the dataset, and divide by the number of
values. To write it formally, if your data are N values x1,…,xn, then the mean, which is usually written as
x, is defined by:
x = (1/n)Σi=1..nxi
The mean is a tricky thing. It’s not nearly as informative as you might
hope. A very typical example of what’s wrong with it is an old joke: Bill Gates walks into a homeless shelter, and suddenly, the average person in the shelter is a millionaire.
To be more concrete, suppose you had a set of salaries at a small company. The receptionist makes $30K. Two tech support guys make $40K each. Two programmers make
$70K each. The technical manager makes $100K. And the CEO makes $600K. What’s the mean salary? (30+40+40+70+70+100+600)/7 = 950/7 = 135. So the average salary of an employee is $135,000. But that’s more than the second-highest salary! So knowing the mean salary doesn’t tell you very much on its own.
One fix for that is called the standard deviation. The standard deviation
tells you how much variation there is in the data. If everything is
very close together, the standard deviation will be small. If the data is very
spread out, then the standard deviation will be large.
To compute the standard deviation, for each value in the
population, you take the difference between that value and the
mean. You square it, so that it’s always positive. Then you take
those squared differences, and take their mean. The result is
called the variance. The standard deviation is the square root
of the variance. The square root is generally written σ, so:
σ = sqrt((1/n)Σi=1..n(x-x)2)
So, let’s go back to our example. The following table shows, for each salary,
the salary, the difference between the salary and the mean, and the square
of the difference.
Salary | Difference | Square |
---|---|---|
30 | -105 | 11025 |
40 | -95 | 9025 |
40 | -95 | 9025 |
70 | -65 | 4225 |
70 | -65 | 4225 |
100 | -35 | 1225 |
600 | 465 | 216225 |
Now, we take the sum of the squares, which gives us 254974. Then we
divide by the number of values (7), giving us 36425. Finally, we take
the square root of that number, giving us about 190. So the standard deviation
of the salaries is 190,000. That’s pretty darned big, for a mean
of $135,000!
The real meaning of the standard deviation is very specific. Given a set
of data, 68 percent of the data will be within the range (x-σ, x+σ) (which we usually say as “within one standard deviation of the mean”, or ever “within one sigma”); and about 95 percent of the data is within 2 sigmas of the mean.
What should you take away from this? A couple of things. First,
that these statistics are about aggregates, not individuals. Second,
that when you see someone draw a conclusion from a mean without telling you anything more than the mean, you really don’t know enough to draw any
particularly meaningful conclusions about the data. To know how much the mean tells you, you need to know how the data is distributed – and the easiest way of
describing that is by the standard deviation.
Next post, I’ll talk about something called linear regression, which was the next thing my dad taught me when I learned this stuff. Linear regression is a way of taking a bunch of data, and analyzing it to see if there’s a simple linear relationship between some pair of attributes.
Be careful with your comments about the standard deviation; the `1 sd = 68 percent’ and `2 sd = 95 percent’ only work for normal distributions; they’re not true for other distributions (and, in particular, they’re not true for the example you constructed). The question of how to tell whether a distribution is `normal enough’ to use the `1 sd = 68 percent’ and `2 sd = 95 percent’ shortcuts is a problem in and of itself.
Look up Chebyshev’s Inequality (on Wikipedia, for example); it states that
`At least (1 − 1/k^2) × 100% of the values are within k standard deviations from the mean’
so, for any distribution, around 75 percent of the data will be within 2 sd of the mean; a normal distribution raises that 75 percent to 95 percent.
Mark, mad propz. I am firmly convinced that the lack of a fundamental understanding of the various ways to think about central tendency and (especially) variance of the distribution is one of the biggest problems we have in public policy, public health and other personal thinking and decision making. That relate to math, anyway.
Is it possible to open a daily paper and not be bombarded by articles which muddy concepts of “average”? Which completely ignore the all-critical concept of variance from said “average”? From political polls to the sports page. From the real estate section to the local society-ball philanthropic report. the list goes on and on….
A very basic yet super useful post that I hope more people read. I’m a college student majoring in engineering (with hopes to be a teacher after retirement) and it always annoys the crap out of me when *anyone*, including professors or whatnot, declare that the average for some exam was a 75% so there’s no curve on a test — but the standard deviation is so high that only a few people actually did better than the average.
I just had a conversation with my girlfriend today (a math major, no less!) about how she’s frustrated with classes because she got some tests back today. She got an 80% on one of them, and the average was an 84% — “I’m worse than average!!!! =( =( =(” But how many kids actually did better than average?
It’s funny to me that most people assume that because something is average, it’s the exact middle of the set — okay, fine, I guess that’s the definition — but how often is there exactly 50% of a population below an average and 50% above?
Anyway, enough of my rambling. Good post.
The income illustration is confusing because income is not normally distributed, it follows a Pareto or power law distribution where there often is infinite variance (and in theory sometimes an infinite expectation). If you randomly sampled the wealth of 1,000 or 10,000 people on the planet and inferred the population standard deviation then Bill Gates would be a statistical impossibility (even if the standard deviation was as high as $100,000 Bill Gates would till be a close to a 600,000-sigma outlier)
I hope you’ll include abuse of linear regression in your next post on that topic. Some papers assume that high r squared means a good predictive relationship. R squared can be increased by putting several unrelated sets of samples on the same graph just to stretch out the axes. Also, I learned that to test the ability of a relationship to make predictions, I should randomly split the data and use half of it to generate an equation, then predict values for the other half and compare predictions with reality. There is some software now that rotates values in and out iteratively and I think mixes equation generation with prediction generation. Should that be trusted?
Excellent post! I have been really interested in learning some stats and this really wet my whistle.
Good article.
Your first paragraph is missing the actual link to your earlier article here: http://scienceblogs.com/goodmath/2007/01/basics_mean_median_and_mode.php
When monitoring something that should hold constant to see if it behaves, we could track the mean and standard deviation to see if they both hold constant, but a more insightful approach is to differ the actual and the nominal and monitor the RMS of the residue.
Very nearly: RMS = sqrt( mean**2 + variance )
If the RMS value stays negligibly small, we know not to bother looking further.
If the RMS is too large, we look at the mean to find any bias, and at the standard deviation to find any noise increase.
A question I’ve always been a bit hazy on is the distinction between the standard deviations where the term before the sum was 1/N and 1/(N-1). Is there an appropriate time when you use the 1/(N-1) standard deviation?
Thanks!
tcmJOE:
A sample variance computed using 1/N is slightly biased (it’s expected value is not exactly the population variance), while the sample variance using 1/(N-1) is unbiased (it’s expected value is the population variance).
While biased, the 1/N method has some advantages, and the bias decreases rapidly as the sample size increases.
I’m not sure if there’s a rule on when to use one or the other, but I hope this helps.
Wow! A topic I know about!
I’m a statistician for a living.
Full-population vs. sample is a dichotomy I had not heard before. When you said “two types of statistics” I was thinking you were going to go for descriptive and inferential, which isn’t totally different from what you have, but not exactly the same, either.
The mean can be abused and confused.
What, for example, is the average time you go to bed?
Let’s say that on Saturday you went to bed at 1 AM, and on Sunday at 11PM. Add them up, that’s 12. Divide by 2 = 6. hmmmm. Maybe a 24 hour clock? 1 + 23 = 24, divide by two…. NOON! oops.
Also useful are the median and (less commonly used) the trimmed mean, or, even less common, Winsorized mean.
The median is the number that’s got half higher than it, and half lower.
The trimmed mean is the mean, after discarding some of the highest and lowest values (typically 10% or 20% of the highest and 10% or 20% of the lowest); the Winsorized mean doesn’t discard values, it substiutes the highest or lowest that’s left. So…
10 12 15 20 25 35 40 100 200
mean = 50.78
median = 25
20% trimmed mean = 35.29
20% Winsorized mean = 39.89
To C. Chu….
You seem to be slightly confusing the median and the mean
the median is the middle of a data set
the mean is the expected value…..it’s the number to guess if you pay a penalty for being wrong, and get a reward for being right.
So, if you were playing a game and had to guess the score of some random student, and got $10 if you were exactly right, $9 if you were off by 1, $8 if you were off by 2, and so on, the best thing to guess would be 80
But that’s not necessarily the middle of the data.
Peter: 1am is 25 not 1 in this case: 25+23/2=24 or 12pm. Likewise if you were looking for your average rising time and woke up at 11pm, 12am, 2am, and 3am you would use -1, 0, 2, and 3: -1+0+2+3/4=1.5 or 1:30am. Don’t conflate inability to work with time properly with the confusion surrounding statistics. A better way of dealing with this would be to measure amount of time asleep and awake (going to bed at 1am after 48 hours of being awake is not the same as going to bed at 1am after being awake for 16 hours).
Actually, Chas. Owens, Peter does know what he’s talking about. It is a statistical problem, and there are whole books on the subject.
I’m having a hard time finding a good introductory source on this, but here’s the wikipedia article.
To Chas Owens
I was using that as an example of how you can go wrong with the mean. Whether you say the problem is about “working with the data properly” or “confusion about statistics” is, to me, irrelevant. Your method is one way of getting a right answer. (another is to take time since some point the previous day, e.g. hours after noon, but it’s the same concept, mine just avoids negative numbers)
I was just trying to point out that the adage
“There are no routine statistical questions, only questionable statistical routines” applies in all cases, even ones that appear simple
To Dave
Thanks for that defense, and the links!
It’s even more complex than I thought. Chas Owens solution (which is what I would have suggested) will, I think, work in most cases. It bogs down when the angles (that is, times) are uniformly distributed:
hours after midnight: 0 4 8 12 16 20
that is midnight 4AM 8AM noon 4PM 8PM
mean = 10AM, which is sort of silly, I think we’ll agree.
But if people generally go to bed around the same time (which seems likely) then I think the methods are roughly equivalent, but right now I don’t have time to check.
“The mean is a tricky thing. It’s not nearly as informative as you might hope. A very typical example of what’s wrong with it is an old joke:”
The mean person in my town has approximately one breast, one ovary, one testicle, half a penis, and half a vagina.
tcmJoe:
If you’ve got full-population data, then you use the “/N” standard deviation. When you’re using sampled data, you use “/N-1”. The basic reasoning is that the probabilistic expectation for samples is that they’ll be narrower than the full population. Using the “N-1” denominator is a compensation for that.
No, the standard deviation is generally written σ
I was taught that you use the `N-1′ for sample data ’cause you’ve already used one degree of freedom to calculate the mean.
Much appreciated, thank you!
When you say that using 1/(N-1) is a compensation for the slight difference in probabilistic expectation between sample and census data, is there some sort of proof that the variation is compensated by subtracting 1 from N? Would (and I’m just throwing this out) taking something like 1/(N-2) ever be a “more” accurate guess of census deviation in some cases?
tcmJoe:
I’ve never studied the formal derivations of the sampled standard deviation, so I may well be wrong. My father, when he taught me this stuff, told me that it was purely an empirical thing.
The fact that the sample is likely to be narrow should be sort of clear: on a sample of a very large data set, you’re likely to miss the outliers. That’s what narrows the standard deviation. So the fact that some correction will help describe that should be fairly obvious. But the specific “/N-1” correction is, I think, empirical: if you look at the standard deviation of samples versus populations, where you know the population data, “/N-1” is what produces the best result.
tcmJOE,
There is! When you do a linear regression, the denominator in the unbiased estimator of the variance is N-p, where p is the number of parameters being estimated. Estimating the mean in the manner described by MarkCC can be viewed as a special case of linear regression where there is only one paremeter being estimated — hence N-1.
MarkCC,
It’s not empirical — it’s an expected value, i.e., an integral. You can calculate it if you’ve got the mathematical chops. (I don’t have the chops, but I get the same formula from Bayesian posterior expectations… which is a whole other story.)
Here’s the intuitive explanation (which I got from David MacKay‘s book). When you estimate the distribution mean using the sample mean, the estimated mean minimizes the sum of the squares of the residuals (SSR). Any other estimate of the distribution mean would give a larger SSR — and in particular, the true distribution mean would give a larger SSR. The denominator N-p exactly counteracts (in expectation) the shrinkage of the SSR.
This is what people mean when they say that you use up a degree of freedom estimating the parameters.
edit:
…you use up a degree of freedom estimating each parameter.
Peter and Dave:
I misspoke, I should have said “Don’t conflate inability to work with time properly with the confusion surrounding what the mean average and other statistical functions mean.” rather than “Don’t conflate inability to work with time properly with the confusion surrounding statistics.” The article is about what comes out of the mean average function and how it is useful, not how to make sure what goes into the function is meaningful. Another example of meaningless input could be mean(“running shoes”, “socks”, “slacks”, “underwear”, “shirt”) to try to get the average price of the clothes a person is wearing. In this case it is obvious that the the understanding of the data is at fault, not the understanding of the statistical function being used (because they don’t look like numbers like time does). Like the time problem, this is not an issue of the statistical functions producing data that is not very enlightening about the population (as is the case with the salaries from the article), but rather a problem of how to represent the data in such a way that the functions can operate on them. The time issue would be a wonderful thing to bring up if the article where about the GIGO rule, but this article is about what the various statistical functions mean and how to use them to get information about a population.
@Chas
Peter’s example was entirely appropriate for the article. If you read the article carefully you will see that MCC uses the example of Bill Gates walking into a homeless shelter to illustrate misuse of mean values. Peter’s bed time example leads to similarly humorous results. He knew full well that it was a silly way to compute a mean.
To Chas. Owens
No, it’s not GIGO at all.
The average of “running shoes” and “shirts” is, as you point out, obvious nonsense.
The average time going to bed is perfectly meaningful, and, as pointed out in another comment, your solution (which, I admit, was mine too) wasn’t even fully correct.
Is this a data problem? Well, clearly. But, it’s also a problem with understanding what the mean is.
HJT:
Bill Gates walking into a homeless shelter makes the mean increases the standard deviation, but the mean is still correct. Trying to take the mean of 11pm and 1am produces garbage if you naively average 23 and 1 (producing and average of 6). It isn’t a matter of the result having a large standard deviation, the result is pure garbage because the inputs where pure garbage (as bad as my example with the clothes). The data just doesn’t look like garbage because they are numbers. The example from the article shows how people abuse valid results, the time example is not a valid result.
Peter:
The clothes example suffers from the same problem as the time problem: the values of the population must be converted into a usable form before the mean is taken. It is entirely possible to find out what the mean average of the cost of the clothing a person is wearing, but first you must convert the names of the items of apparel to their monetary value. Both problems have nothing to do with the mean, except that when the mean is presented with garbage its output is also garbage.
Why is the sum of squares used in the standard deviation instead of the absolute value?
I have wondered this a long time, but no one has been able to give me a good answer.
This is why I don’t report the mean score when I hand back exams, but the median. We all understand the median: half the class scored higher, and half the class scored lower.
Ooooh! I’ve always wanted better explanations of statistics than I currently have. Would you also be willing to tackle why kurtosis is important, and why we use standard deviation instead of absolute deviation?
Anonymous:
It’s not simple to explain the reason for the square. I might try to explain it in a later post… The simple version is that the root-mean-square comes from the variance (the value of the mean-square-difference), which is intimately related to properties of the distribution. For example, when you try to do line-fitting in linear regression, the best line fit comes from minimizing the variance, not minimizing the mean-difference.
@33
The square is used because usually, in real life, you don’t calculate the standard deviation exactly how Mark has shown it here (by taking the difference between each individual sample and the mean, and squaring it).
What you want to find is the sum of (x – µ)² for all values of x, µ being the mean. Which is the sum of (x² – 2µx + µ²) for all values of x. If you simplify this, you will see that you only need the sum of the squares of the values; you don’t need to work out the difference between each sample and the mean.
I guess that in the early days of computers, not needing to revisit data was a definite advantage. You only need to keep running totals of the sum of values of x, the sum of values of x² and a count of values. If you’re sure you know what you’re doing, you can even undo a bad entry by subtracting from the relevant totals.
Mark – You have a good memory to recall the name of the testing equipment from your childhood. As it happens, Teradyne Corp. manufactures a range of semiconductor test equipment. I first learned about them from a Harvard Business School case in an MBA course.
Mark, I think you are being somewhat misleading about the square. There is nothing `wrong’ with using the absolute deviation instead of the squared deviation. For mathematical reasons, using the absolute deviation is more consistent (so to speak) with using the median, as oppposed to the mean, as a central estimate, but it’s OK. Even for regression problems, there are many papers about median-absolute-deviation instead of mean-squared-deviation estimation.
(The misleading part is when you say squares are used because they provide the `best’ fit– that’s a tautology, as minimizing the mean squared error is commonly the *definition* of what `best fit’ means.)
I’m playing along in Excel and used the STDEV function. But it gives me a different value (206). When I tried Mark’s method, I got 190. After reading the comments, it turns out this is the N vs N-1 thing. Does this mean the STDEV function assumes I have entered sample data rather than population data? Is that the norm in these kinds of calculation programs? Sorry if this sounds too ignorant … my understaning is limited to what I’ve gleaned from the comments. Thanks!
to WCYEE
Well, relying on ‘norms’ in statistical software is tricky and fraught with danger. But, here, Excel’s default probably is sensible (unusual, for Excel, to have things make sense!).
It’s very rare to have the whole population of anything.
Again, thank you for clarifying things.
A request: At some point would you please cover Bayesian statistics? This, of course, could be far in the future.
Thank you, thank you, thank you!
I have linked your blog to my fellow co-horts at Tiffin University in Ohio! (Hi everyone!) We appreciate, as adult learners taking an online statistic class, the ability to link this new-found knowledge to real life applications!
Your topic was extremely timely for us this week! We are heading into the importance of “hypothesis testing” next week.
Thanks again,
Greta
Peter,
Thanks. It makes sense that people using statistics packages are more likely to deal with sample rather than population data. After reading the help files a bit, I discovered that Excel has the STDEVP function for population data that seems to use N rather than N-1 in the calculation. I also tried out R and the sd function assumes sample data (N-1) as well. Thank you for clearing that up.
Mark, thanks so much for writing on these basics. You have a way of describing these stuff which I love.
Mark – if you assume a finite population of N objects and then consider a sample (possibly including duplicates) of n of them, it’s quite easy to analyse the situation with some big meaty-looking sums. You find the expected value of the mean, or variance, over the population by considering the set of all samples and summing over it and dividing by its size in the usual way; the values, and denominators, you want then drop out in the wash. It’s a one-side-of-paper calculation, pleasantly enough.
Interesting things to note:
– if you want sampling WITHOUT replacement/duplication, the unbiased estimator turns out to have the population size in it.
– the square root of the unbiased estimator of the variance is NOT an unbiased estimator of the standard deviation (you need Gamma functions for that, I gather).
New in arXiv as of 8 April 2008.
Quadratic distances on probabilities: A unified foundation
Authors: Bruce G. Lindsay, Marianthi Markatou, Surajit Ray, Ke Yang, Shu-Chuan Chen
Comments: Published in at this http URL the Annals of Statistics (this http URL) by the Institute of Mathematical Statistics (this http URL)
Journal-ref: Annals of Statistics 2008, Vol. 36, No. 2, 983-1006
Subjects: Statistics (math.ST)
This work builds a unified framework for the study of quadratic form distance measures as they are used in assessing the goodness of fit of models. Many important procedures have this structure, but the theory for these methods is dispersed and incomplete. Central to the statistical analysis of these distances is the spectral decomposition of the kernel that generates the distance. We show how this determines the limiting distribution of natural goodness-of-fit tests. Additionally, we develop a new notion, the spectral degrees of freedom of the test, based on this decomposition. The degrees of freedom are easy to compute and estimate, and can be used as a guide in the construction of useful procedures in this class.
What are the similarities and differences between the two kinds of standard deviations and the two kinds of means? we are taking a stats course and are lost! HELP!!!
please i need to know how to calculate sigma values in a basic program.