My fellow SBer Craig Hilberth at the Cheerful Oncologist writes about a
meta-analysis that purports to show the positive effect of intercessory prayer. Neither Craig
nor I have access to the full paper. But what we know is that the claim is that the meta-analysis
shows a result of g=-0.171, p=0.015.
This really ticks me off. Why? because g=-0.17 is not significant. Meta-analysis generally considers g=0.20 to be the minimum cutoff for statistical significance.
Briefly, what is meta-analysis? The idea of it is, suppose you’ve got a bunch of studies of the same topic. Meta-analysis lets you take data from all of the studies in the group, and attempt to combine them. What you can do is get aggregate means and standard deviations, and measures of the significance and reliability of the aggregate measures.
Meta-analysis is a useful technique, but it’s very prone to a number of errors. It’s very easy to manipulate a meta-analysis to make it say whatever you want; and even if you’re being
scrupulously honest, it’s prone to sampling bias. After all, since meta-analysis is based on
combining the results of multiple published studies, the sample is only drawn from the studies that were published. And one thing that we know is that in most fields, it’s much harder to publish negative results than positive ones. So the published data that’s used as input to meta-analysis tends to incorporate a positive bias. There are techniques to try to work around
that, but it’s hard to accurately correct for bias in data when you have no actual measurements
to tell you how biased your data os.
So getting back to the meta-analysis results that they cited, what’s g? g, also called “Hedges g”, is a measure of how much the overall data set of the combined studies differs from the individual data sets means. G is a measure of the significance
of any aggregate result from combining the studies. The idea is, you’ve got a bunch of studies, each of which has a control group and a study group. You compute aggregate mean for both the study and control groups, take the difference, and divide it by the aggregate standard deviation. That’s g. Along with G, you compute a P-value, which essentially measures the reliability of the g-figure computed from the aggregate data.
Assuming a fixed events model – that is, that the studies are essentially compatible, and all measuring the same basic events – the minimum level at which g is considered significant is |g|=0.2, with a minimum p value of 0.05.
This meta-analysis has |g|=0.17, with a p-value of 0.015. So they’re well-below the minimum level of statistical significance for a fixed events model meta-analysis, and their P-value is less than one third of the level at which a |g|=0.2 would be considered significant.
So – what it comes down to is, they did a meta-analysis which produced no meaningful results, and they’re trying to spin it as a “small but statistically significant result”. In other words, they’re misrepresenting their results to try to claim that they say what they wanted them to say.
Larry Moran reported on this as well. Commenter “nicholas” says that the meta analysis includes both the study by Elisabeth Targ, which is known to have been fraudulent, and the Columbia University study, which is widely believed to be fradulent (One author withdrew from the paper, another was convicted of fraud in unrelated incidents, and the third author has committed plagiarism.)
I just made a comment which is stalled in the hopper, waiting for approval, because it has 3 links.
It occurs to me the world could use a good basics post on statistics as found in medical and other studies: sample sizes, confidence, null hypotheses, confidence limits, goodness-of-fit, that sort of stuff.
Mustafa:
That’s just pathetic – doing a meta-analysis that includes studies with artificially positive results, they still couldn’t produce a statistically significant result.
Reminds me of that Doonesbury strip where one of the characters says, apropos of some shady Pentagon dealings, “Apparently a reliable cheat is still some years off.”
Mark,
I just emailed a copy of the study to you at your gmail account. Look for it.
Small p-values are considered more significant. P-values of 0.05 are the maximum limit for statistical significance, not the minimum.
Of course, I’m a Bayesian, so I consider the p-value approach to be a sub-optimal in any case.
Zombie
What youre asking for is, I think, very worthwhile, but it’s too much for one post. Over at dailyKos, I’ve posted a bunch of Stats101 diaries (I have same handle there and here) not focusing on medicine, though.
Canuckistani
I think Mark is right, but I see where you got confused – I had to read it a couple times myself, and I’m a stats person. I THINK that the p-value Mark is talking about is one for testing the legitimacy of combining the studies. A p that’s too low REJECTS the null that the means (or whatever) being combined are equivalent. So, a low p would mean “don’t combine these”.
I think
I tried to get the whole article, but unfortunately my university doesn’t have online access. This journal looks pretty obscure.
Speaking of statistics… I was really bothered by my school’s paper today. They were talking about a crime survey done in the local city (Binghamton). They claimed there were some problems because of the “small sample” size (220/47380). But that’s meaningless. A 0.5% sample size can be perfectly fine. The problem is the selection criteria, and even a sample of 50% of the population will be “too small” if the sampling criteria aren’t chosen judiciously. One of the extrapolations drawn from the survey is that 61% of residents are unemployed. Hmm… seems like the data might as well be thrown out.
Article in question: http://www.bupipedream.com/pipeline_web/display_article.php?id=4516
OT: Whatever happened to Stellation? Back when I was doing CSCW research, I was really excited to follow the project and was disappointed to see it disappear.
Ben Goldacre at Bad Science points out that the Royal Statistical Society have made the latest issue of their journal ‘Significance’ available for free here.
Mark – Are you mixing up statistical and practical significance? You quote a p-value below 0.05, so it is statistically significant: the magnitude of the effect size is not what’s relevant for this. However, it is apparently not of practical significance, as Hedge’s g
Michael:
Stellation ended up dying for two reasons. One was that IBM bought Rational, and it was hard to justify working on a SCM system that we were going to give away when ClearCase really needed some help from Research. The other was pure resource stuff – I was doing the work of supporting the open-source community around Stellation single-handedly, and trying to do that *and* still have enough time to get any research done was really difficult. After a while, I just couldn’t do it by myself anymore, and there simply wasn’t any budget to get anyone to help me.
A lot of Stellation-ish ideas wound up in the project I worked on for the last two years, which is finally going public. It’s called “Jazz”, and the public website is “Jazz.net”. There’s a public preview release due out soon; I don’t know exactly when, but keep an eye on Jazz.net. It’s a fantastic system.
Peter,
I went and looked at the Wikipedia article on Effect Size, and as near as I can tell, it actually supports the study’s author’s interpretion. g is a measure of effect size, which means it relates to practical significance, not statistical significance. I think it must be the case that |g| > 0.2 is the nominal level for practical significance, not statistical significance as MarkCC stated. I believe the null hypothesis relevant to the p-value is that the underlying difference of means for prayer versus non-prayer is equal to 0. My understanding is that the claimed p-value means that under the assumption that prayer has no effect, the probability of getting |g| > 0.176 for these sample sizes is 0.015. I don’t think the legitimacy of combining the studies is the hypothesis being tested by either of these statistics.
Shorter me: I think MarkCC got this one wrong. As I am not a big supporter of woo, I would be happy if someone were to demonstrate a flaw in my understanding. Unfortunately, it’s hard to write accurate criticism without access to the paper.
Interestingly, the p-value approach and the Bayesian approach can sometimes disagree strongly about correct inferences. I wonder what a Bayesian meta-analysis would conclude.
I can’t see there being much of a difference: the model is pretty simple, it just uses a weighted mean of the effect sizes. Assuming flat priors, I suspect you will get exactly the same result. The Bayesian way of setting up the analysis is more elegant, but the results will be the same.
What would make a difference is something Mark alluded to. Apparently the analysis was done using a fixed effects model, so the assumption is that prayer had exactly the same effect in all studies. However, one could imagine that different studies had different effects, e.g. if different denominations of prayers have different amounts of other-worldly influence. In that case, a random effects model would be more appropriate, and would lead to a higher p-value (and wider confidence intervals).
Bob
Canuckistani: OK, my mistake.
The real problem with this meta-analysis (as other comments here have pointed out) is the data that was combined, which included fraudulent articles. GIGO applies even to meta analysis.
Thanks to all who pointed out my error in interpreting the p-value of the meta-analysis. I’m going to be away from the network most of today, but I’m bringing a few articles on meta-analysis. When I’m sure that I’m properly understanding it, I’ll post a correction. (I don’t want to be half-assed, and post a correction now where I make another similar error before I make sure that I’ve corrected my misunderstandings!)
Peter:
Can I suggest a minor rephrase? Instead of “GIGO applies even to meta analysis”, I’d say “GIGO applies especially to meta-analysis”. Since meta-analysis doesn’t include any experimental methodology of its own, but rather just does mathematical analysis to combine the results of other experiments, it’s a perfect example of GIGO: if you include garbage data in a meta-analysis, then your meta-analysis becomes pure garbage.
I’ve published just this sort of meta-analysis on a medical topic. Small g (or as I would call it ‘less than small effect size’) but p
Mark: Right you are.
GIGO applies ESPECIALLY here. Meta-analysis of garbage is, so to speak, concentrated garbage. š
If you combine real life apples and oranges, you get fruit salad. But, in meta-analysis, combining apples and oranges gives you mush
MarkCC wrote:
Cool. I actually interviewed with CUE back when I completed my Master’s in 2004, so I’ve been aware of Jazz. I didn’t realize it was actually coming out for a public release. That’s very cool.
My Economics professor coauthor Philip v. Fellman and I have published on the meta-problem of analytical comparison of Competitive Intelligence consultants.
Competitive Intelligence (CI) is a multi-billion dollar industry, which is finally being taught at some business schools. It is the white side of what, on the dark side, is Industrial Espionage.
The end of the Cold War flooded the field with ex-spooks entering the private sector. The field of CI is still fairly ad hoc, even with its first textbooks and Proceedings of its professional organization’s conferences.
So me and Philip v. Fellman “went meta” on them, and published about meta-CI. Applying the established CI methodologies on the CI industry and major players.
To begin with, what data is there for or against the null hypothesis: that CI has zero correlation with the firms that do it in-house or outsource it?
The issue becomes: how do you get good bottom-line data from firms that specialize in manipulating the bottom line? It led us to our multiyear effort (several publications so far, the Big Book still in draft form) on Mathematical Disinformation Theory.
But that’s another meta-story…