Nature Blames Statistical Goofs for Proliferation of Irreproducible Results. But is the Replication Problem Being Overgeneralized?

In yet another attempt to describe why scientists get things wrong, Nature this month featured a story headlined, Scientific method: Statistical Errors, by Regina Nuzzo. The story highlights a practice that confuses scientists and journalists alike – calculations of statistical significance.

The material on misuse of statistics could be useful, though not new, and it makes this feature more focused than many other recent stories about the growing concern with irreproducible results. Previous stories in the New Yorker, The Economist and The New York Times are critiqued on the Tracker here, here and here. But the Nature story does not provide evidence to back the contention that the problem pervades all of science, rather than a few fields, and there was something misleading about the anecdote used to open the story.

This anecdote uses a social scientist, Matt Motyl. We’re told his study turned up an apparent connection between political affiliation and the ability to see the world in shade of grey.

“Sensitive to controversies over reproducibility, Motyl and his adviser, Brian Nosek, decided to replicate the study.” With extra data, the p value came out as 0.59 – not even close to the conventional level of significance, .05.”

A reader would likely conclude that this recent concern about reproducibility has already done some good, pushing working scientists to be more careful. But wait. I was sure Nosek was in a similar piece I wrote about statistical shenanigans two years ago. How could that be? But there he was, not in an example but as one of the crusaders for reproducibility. And according to this story, he’s also a recipient of a multi-million dollar grant from a couple of billionaires who are on a crusade to against irreproducible results.

The program’s signature grantee is the Center for Open Science, a project of psychologist Brian Nosek to improve openness and reproducibility of research. The center was established by the Arnold Foundation last year with a $5.25 million grant, and has received an additional $1.5 million for its efforts to test past studies by replicating them.

I guess money like that would make a person “sensitive”. Isn’t it convenient that Nosek and his student are providing the anecdotal evidence that this expensive venture is worthwhile? Why wouldn’t the Nature author mention this affiliation and massive grant rather than painting Nosek and his student as some kind of representative example plucked from the scientific community?

Next we hear from Stanford U. physician John Ioannidis, who has become a fixture in stories on this topic. We’re told that he’s found that “most published findings are false”. Can a medical doctor really make such an evaluation of literature in physical chemistry? cosmology? atomic physics? glaciology?

If he’s just talking about medical research, then it’s the obligation of the journalist to be clear, precise and explicit in stating this, especially in a piece that bills itself as being about “science”.

The claim by Ioannidis is frustratingly vague. By “false” does he mean that the right answer is outside of the error bars? Does he mean the data are in error or that scientists misinterpret the conclusions? What is his evidence that all this published research is false? Published where? Is he counting all the less reputable journals?

There’s a link to an Ioannidis paper but the answers are not readily accessible there. Since these questions are crucial to the premise of the story, they should be addressed in the story.

The part of the Nature story on scientists misunderstanding statistics is pretty good.

There’s a decent section on the history of p-values, explaining their original intent as a way to flag results that might be worth further examination. Later, p-values morphed into “a common index for the strength of evidence.” P-values are not a measure of likelihood a hypothesis is false. They are a measure of the likelihood of getting a given result if a null hypothesis (i.e. no correlation between x and y) is true.

Scientists, according to Nature, share some of the blame for the proliferation of news nuggets about effects that are “statistically significant” but in a practical sense, insignificant:

Critics also bemoan the way that P values can encourage muddled thinking. A prime example is their tendency to deflect attention from the actual size of an effect. Last year, for example, a study of more than 19,000 people showed⁸ that those who meet their spouses online are less likely to divorce (p < 0.002) and more likely to have high marital satisfaction (p < 0.001) than those who meet offline (see Nature http://doi.org/rcg; 2013). That might have sounded impressive, but the effects were actually tiny: meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale.

There are a number of other experts in this story, including Uri Simonsohn from the University of Penn, who has for several years been pointing out some of the misleading ways social scientists use to pull alleged patterns out of the noise.

But does this statistics problem apply beyond medical research and social science as the story’s wording implies?

One result is an abundance of confusion about what the P value means⁴. Consider Motyl's study about political extremists. Most scientists would look at his original P value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong.

Most social scientists might make this error – maybe – that’s plausible at least, but most scientists? Physicists use statistics, but in a very different way, based on a characterization of the specific events in the “background” that might mimic say, a Higgs Boson, some kind of neutrino, a decaying proton. Relative likelihoods are taken into account. More stringent standards are used for significance.

If you cover just one area of science, it’s dangerous to assume that the methods and standards adopted by your particular branch of science are universal, or that problems besetting one area of science necessarily afflict all others as well. This kind of overgeneralization can feed into the misconceptions the public has about evolution and climate science. It can give fuel to AIDs deniers, homeopaths, anti-vaccine nuts and other cranks.

Still, the Nature piece includes some enlightening material, especially for those journalists who don’t understand the notion of statistical significance and especially for those who still wrongly think it gives you the probability that a given conclusion is false.

Grist for the Mill:

In this piece, Steve Novella gives a more nuanced explanation of Ioannidis' claim about wrong research.

Wikipedia's history of the p-value was clear and a bit easier to digest since it wasn't conflated with the Ioannidis' claim.

There are a number of good primers, including one in the Field Guide for Science Writers.

Related

Reader Interactions

Leave a Reply Cancel reply