There’s one topic I’ve been asked about multiple times, but which I’ve never gotten around to writing about. It happens to be one of the first math things that my dad taught me about: linear regression.
Here’s the problem: you’re doing an experiment. You’re measuring one quantity as you vary another. You’ve got a good reason to believe that there’s a linear relationship between the two quantities. But your measurements are full of noise, so when you plot the data on a graph, you get a scattershot. How can you figure out what line is the best match to your data, and how can you measure how good the match is?
When my dad taught me this, he was working for RCA manufacturing semiconductor chips for military and satellite applications. The focus of his work was building chips that would survive in the high-radiation environment of space – in the jargon, he was building radiation hard components. They’d put together a set of masks for an assembly line, and do a test run. Then they’d take the chips from that run, and they’d expose them to gamma radiation until they failed. That would give them a good way of estimating the actual radiation hardness of the run, and whether it was good enough for their customers. Based on a combination of theory and experience, they knew that the relationship they cared about was nearly linear: for a given amount of radiation, the number of specific circuitry failures was proportional to the amount of gamma exposure.
For example, here’s a graph that I generated semi-randomly of data points. The distribution of the points isn’t really what you’d get from real observations, but it’s good enough for demonstration.
The way that we’d usually approach this is called least square linear regression. The idea is that what we want do do is create a line where the square of the vertical distance between the chosen line and the measured data points is a minimum.
For the purposes of this, we’ll say that one quantity is the independent value, and we’ll call that x, and the other quantity is the dependent variable, and we’ll call that y. In theory, the dependent variable, as its name suggests depends on the independent variable. In fact, we don’t always really know which value depends on the other, so we do our best to make an intelligent guess.
So what we want to do is find a linear equation, where the mean-square distance is minimal. All we need to do is find values for (the slope of the line) and (the point where the line crosses the axis, also called the y intercept). And, in fact, is relatively easy to compute once we know the slope of the line. So the real trick is to find the slope of the line.
The way that we do that is: first we compute the means of and , which we’ll call and . Then using those, we compute the slope as:
Then for the y intercept: .
In the case of this data: I set up the script so that the slope would be about 2.2 +/- 0.5. The slop in the figure is 2.54, and the y-intercept is 18.4.
Now, we want to check how good the linear relationship is. There’s several different ways of doing that. The simplest is called the correlation coefficient, or .
If you look at this, it’s really a check of how well the variation between the measured values and the expected values (according to the regression) match. On the top, you’ve got a set of products; on the bottom, you’ve got the square root of the same thing squared. The bottom is, essentially, just stripping the signs away. The end result is that if the correlation is perfect – that is, if the dependent variable increases linearly with the independent, then the correlation will be 1. If the dependency variable decreases linearly in opposition to the dependent, then the correlation will be -1. If there’s no relationship, then the correlation will be 0.
For this particular set of data, I generated it with a linear equation with a little bit of random noise. The correlation coefficient is slighly greater than 0.95, which is exctly what you’d expect.
When you see people use linear regression, there are a few common errors that you’ll see all the time.
- No matter what your data set looks like, linear regression will find a line. It won’t tell you “Oops, I couldn’t find a match”. So the fact that you fit a line means absolutely nothing by itself. If you’re doing it right, you start off with a hypothesis based on prior plausibility for a linear relation, and you’re using regression as part of a process to test that hypothesis.
- You don’t get to look at the graph before you do the analysis. What I mean by that is, if you look at the data, you’ll naturally notice some patterns. Humans are pattern seekers – we’re really good at noticing them. And almost any data set that you look at carefully enough will contain some patterns purely by chance. If you look at the data, and there’s a particular pattern that you want to see, you’ll probably find a way to look at the data that produces that pattern. For example, in the first post on this blog, I was looking at a shoddy analysis by some anti-vaxxers, who were claiming that they’d found an inflection point in the rate of autism diagnoses, and used linear regression to fit two lines – one before the inflection, one after. But that wasn’t supported in the data. It was random – the data was very noisy. You could fit different lines to different sections by being selective. If you picked one time, you’d get a steeper slope before that time, and a shallower one after. But by picking different points, you could get a steeping slope after. The point is, when you’re testing the data, you need to design the tests before you’ve seen the data, in order to keep your bias out!
- A strong correlation doesn’t imply linear correlation. If you fit a line to a bunch of data that’s not really linear, you can still get a strong positive (or negative) correlation. Correlation is really testing whether the data is increasing the way you’d expect it to, not whether it’s truly linear. Random data will have a near-zero correlation. Data where the dependent variable doesn’t vary consistently with the independent will have near-zero correlation. But there are plenty of ways of getting data where the dependent and independent variables increase together that produce a strong correlation. You need to do other things to judge the strength of the fit. (I might do some more posts on this kind of thing to cover some of that.)