As [Tara](http://scienceblogs.com/aetiology/2006/10/aids_and_viral_load.php), [Nick](http://aidsmyth.blogspot.com/2006/09/viral-load-paradigm-shift-not-really.html), and [Orac](http://scienceblogs.com/insolence/2006/10/more_distortion_of_peerreviewed_data_by.php) have already discussed, there’s been a burst of
activity lately from the HIV denialist crowd, surrounding [a new paper](http://jama.ama-assn.org/cgi/content/full/296/12/1498) studying the correlation between viral loads and onset and progression of symptoms in AIDS. For example, Darin Brown, allegedly a mathematician (and recently a troll in the comments here on GM/BM), has [written](http://barnesworld.blogs.com/barnes_world/2006/10/it_must_be_jell.html):
>Even if one is willing to endure the intellectual contortions necessary to
>reconcile these findings with the HIV/AIDS hypothesis, it is impossible to deny
>that they are incompatible with the justification for the treatment strategies
>advocated over the past 10 years.
>
>In case anyone was in a cave, for a decade, the treatment dogma has been:
>
>(1) CD4 counts and “viral load” are accurate predictors of progression to
>”AIDS” and death. In fact,
>
>(2) All three are correlated to each other. As viral loads go up, CD4 counts go
>down, and each indicates progression to “AIDS”. This is because HIV causes loss
>of CD4 cells. This is why they are called “surrogate markers”. This is why
>dozens and dozens of studies used viral load and CD4 counts as outcomes.
>Conversely, as viral load goes down, CD4 counts go up, and the patient is
>”healthier”.
>
>(3) If viral load goes up and CD4 counts go down sufficiently, you should go on
>ARVs immediately. Who knows how many healthy people have been put on these
>drugs on the basis of viral load and CD4 counts alone.
>
>The above 3 points have been drummed beyond belief over the past 10 years. For
>the AIDS establishment to deny now that this is what they have been saying all
>this time boggles the mind, but is not surprising.
When it comes to the science of it, I can’t contribute anything beyond what Tara and friends had to say. But the denialist argument around this is actually a classic example of one of my personal bugaboos concerning statistics. Details below the fold.
To briefly summarize the paper: they study the correlation between HIV viral load and onset and progression of AIDS symptoms. Like work before, the aggregate data for the population of infected people shows a *strong* correlation between viral load and symptoms. But in addition to looking at the aggregate data, they *also* looked at how well an *individual*’s viral load could be used *as a predictor* for the progression of the disease. The conclusion was that while, in the aggregate data, viral load shows a very strong correlation, in individuals, load is not a good predictor.
Naturally, the denialist crowd is all over this: after all, if the HIV viral load is *not* a good predictor of the onset of full-blown AIDS, then how can scientists credibly claim that AIDS is caused by HIV?
This is a perfect example of one of the most common errors in statistics. Statistics is looking at large collections of data in order to find patterns that appear *in the aggregate*. But statistics is about *aggregates*, not individuals. Reasoning from the aggregate back to the individual is error-prone *at best*. The predictive value of aggregate data does *not* translate back into predictive value about individuals.
It’s pretty easy to see why. Let’s take a really simple example. Suppose we’ve got a company with 100 employees: 50 of them make $20,000/year, 30 more make $30,000/year, 10 make $50,000/year, 5 make $100,000/year, and 5 make $200,000/year. So payroll for one year in this company is $3,900,000.
Suppose that for the next year those top five earners are given raises to $300,000; everyone else in the company gets a 5% raise.
The next years payroll is 4,545,000. From the *aggregate* data, we can easily see that the average raise was about 16%. Does that mean that we should be able to conclude that an average *individual* employee got anything close to a 16% raise? Obviously not – they got either 5% or 50%.
That’s *exactly* what the denialists are doing to this paper. They’re *claiming* that if you can’t use *aggregate* data as a predictor for *individual* outcome, that means that conclusions formed from the aggregate data *about the aggregate* must also be invalid.
As Orac points out, if you apply this reasoning to other medical studies, you’ll end up concluding that we know *nothing* about any diseases or their causes. You can’t even prove that *smoking* causes cancer: in the *aggregate*, smoking increases the risk of getting lung cancer very dramatically; in individuals, many (or even *most*) smokers won’t develop lung cancer. Blood serum cholesterol levels certainly shows a strong correlation with heart disease: high cholesterol definitely increases the risk of heart problems like heart attacks. But my great-grandfather had *incredibly* high cholesterol; he ate eggs for breakfast every morning, used schmaltz (chicken fat) as a condiment. But he lived to be *96* years old,and never had any heart trouble. But my father, who has moderately high cholesterol had to have a quad bypass two years ago, or he would have died.
Normal aggregate data *does not* have specific predictive value for
individuals. But that doesn’t mean that you can say that *in general* smoking increases your risk of getting cancer; or that high cholesterol *in general* increases your risk of heart disease. But the *specific* correlation to individuals isn’t necessarily correct: you can’t reason back from the aggregate
to the individual unless you take the additional step of showing *how well the data applies to individuals*. Working back from the aggregate to the individual, you’re always introducing a degree of uncertainty, and the only way to really understand that degree of uncertainty *is the measure it* by experimentation.
The study that we’re looking at was attempting to see whether or not the
aggregate data showed the same kind of *individual* correlation that it showed
in aggregate. They were doing exactly the kind of experiment that *any* good
scientist would do to figure out *how well* the aggregate data can be applied to
individuals. Disappointingly, the aggregate data does *not* form a particularly good predictor for individuals.
So.. Going back to Darin’s rant. Does that mean that the standard medical practice WRT HIV/AIDS – the practice of basing decisions about when to start anti-retroviral therapy on viral load – are wrong? No. No more than my doctor was wrong to advise *me* to change my diet to get my cholesterol down.
Viral load isn’t a great predictor of the onset of symptoms for individuals: it *can* cause us to start therapy too soon for some individuals (exposing them to the side effects of the medication); and it *can* cause us to delay starting therapy too *long* for some individuals (allowing them to develop symptoms sooner than they would have if they had taken the medication). But given our current level of knowledge, we *do* know that there is a very strong correlation between viral load and onset of symptoms *in the aggregate*.
Comparing to cholesterol/heart disease risk is quite informative. Some people with high cholesterol will *never* develop heart trouble; but we try to make them change their diet, and give them medication to try to lower it, even though those medications/diet changes *might not* be doing them any good. And some people who *don’t* have high cholesterol will die of heart attacks that could have been prevented by appropriate medication. What determines whether an individual will develop heart disease is a complex mix of many different factors, and we don’t even know what all of them are. But we use the aggregate data to determine the best estimate that we can of the optimal risk/benefit balance, and use that to determine when to start treating the problem.
Treating HIV is very much the same kind of thing. We have very good aggregate data showing that the viral load correlates with the onset of the disease. But there are many different factors involved in when an HIV infection turns into full-blown AIDS, and we don’t even know what all of them are. But we use the aggregate data to determine the best estimate of the optimal risk/benefit balance, and use it to determine when to start anti-retroviral treatment.
This *is* an extremely common statistical error. People frequently try to use statistics to reason from the aggregate to the specific. It doesn’t it work particularly well, except as a way to produce a very rough initial estimate; and it’s difficult to know *how* rough that initial estimate is without doing the experiments and the math to see.
I completely agree with you, I just want to clarify one thing.
This paper did not look at the onset of AIDS or symptoms. It only tried to determine the degree to which HIV viral load at presentation could predict the rate of CD4 cell decline.
I suspect that even though viral load was not a good predictor of the rate of CD4 cell decline for individuals, it would still be a good predictor for onset of AIDS. But, I suspect this based on the aggregate data used in this and other studies rather than individual data on viral load and AIDS progression, which is not addressed in this study.
Another non-medical example of their type of reasoning:
Because some drunk drivers arrive home safely, drunk driving must not really be dangerous. The connection between drunkenness of driver and number of accidents isn’t reliable – many people have accidents while sober, and many drunk drivers don’t have accidents at all. It would be unreasonable to say that someone should call a cab on the basis of his blood alcohol level alone.
Actually, this argument is *stronger* than the HIV denialists’ argument, unless they can produce a case of someone having AIDS while HIV-negative (the equivalent of the sober person having an accident). Nonsmokers do get cancer, sober people do have car accidents, but where are the HIV-negatives with AIDS? Without them you would be forced to accept that HIV is a necessary (even if not sufficient) condition for developing AIDS.
This is somewhat off topic but funny. My favorite ‘use’ of statistics comes from a local newscast that I saw as a child in Wisconsin. The talking heads were discussing a new study on the effect of dietary sodium on all-cause mortality. They claimed that the study concluded that, “people who don’t eat salt are 20% more likely to die”. I had never had a statistics course at that point, but I imagined a guy in some greasy spoon not putting salt on his hash brown and then promptly dropping dead. In my opinion the misapplication and misunderstanding of statistics (particularly the inference from population parameters to individual outcomes) is the #1 problem in scientific communication to the public.
@Ethan Romero:
Funny that. I thought we were all 100% likely to die.
Thank you, thank you, I’ll be here all week. . . .
98.8 % of statistics are made up on the spot
Vic Reeves
But it is true. They can be twisted to say almost anything. I can’t remember the exact figures but it was claimed that 1 in six men are gay. But this was based on a question that said ‘have you ever considered another man to be attractive’, not even ‘have you ever been attracted to a man’. Not quite the same thing as being gay.
Perhaps a simpler example is height and gender.
On average men are significantly taller than women.
Does this mean you can predict the gender of a person by measuring their height? No.
While you’re changing your diet, don’t forget exercise.
I’d heard a number like one in 7 are gay. Besides, it follows from the poem:
…But the child that is born on the Sabbath day
is fair and wise and good and gay…
If half of the one in seven that are gay are women, does that mean that only 1 in 14 of men are gay?
“For example, Darin Brown, allegedly a mathematician”
‘Allegedly’…hmmm…a very strange comment coming from someone who calls disconnected spaces “separated” and independent events “Bayesian”…
http://www.genealogy.math.ndsu.nodak.edu/html/id.phtml?id=83534
Your “aggregate/individual” argument (puppeted by everyone) is silly to anyone who actually has read the Rodriguez paper (which I doubt you even have) or understands what coef. of determination means (which I doubt you do).
The 4% coef. of determination *IS* an aggregate number, based on the entire population!
“The study that we’re looking at was attempting to see whether or not the aggregate data showed the same kind of individual correlation that it showed in aggregate.”
Tell me, Mark, what is the difference between “individual correlation” and an “aggregate correlation”?
I have already explained how the “aggregate correlation” is a mathematical artifact:
“If you take a cloud of data points that are essentially random (no correlation) and you break them into 5 subgroups by magnitude of the predictor variable and choose the median outcome of the response variable for each subgroup, this will have the effect of obscuring the lack of correlation. It’s the statistical equivalent of squinting your eyes so you can’t see any details anymore.”
Darin:
Your “explanation” for why aggregate correlation is an artifact is pure gibberish, as I suspect you know. *If* that were a valid argument, it would mean that pretty much the entire science of statistical analysis of populations would be completely useless, which it isn’t.
The reason I suspect you know that is because you handwaved your way around the fact that in this comment thread, there are multiple examples of aggregate correlation that *don’t* generate valid results when applied to individuals.
Here’s another example, just to remind you, and save you the trouble of actually reading the other comments:
If I look at a population as a whole, one thing that will stand out very clearly is that among people who work in salaried jobs, there is a strong correlation between level of education and salary. As a trend in the population, it’s absolutely undeniable that there is a trend: high school dropouts tend to make less than high school graduates, who on average make less than college graduates, who on average make less than people with masters degrees, etc.
But this doesn’t work particularly well when applied to individuals. There are definitely many PhDs who get very generous salaries; but there are also plenty of PhDs who barely manage to make a living. Given an individual, if all you know about them is their level of education, you cannot
make any good predictions about their salary.
Do you seriously want to argue that the aggregate correlation between salary and educational level is a meaningless artifact?