What constitutes proof? How much weight can we put on research results?
I’ve been reporting on memory research for 20 years, and this issue has always been at the back of my mind. Do my readers understand these questions? Do they have the background and training to give the proper amount of weight to these particular research findings? I put in hints and code words (“pilot study”; “this study confirms”; “adds to the evidence”; “conclusive”; and so on), but are these enough?
So here is the article I’ve always meant to write.
First of all: proof. I never talk about proof. “Proof”, in the colloquial sense of absolute certainty, is not something scientists are ever comfortable about claiming. All we can do is weigh the evidence.
Now, weighing the evidence is what it’s all about, and this is something that has become progressively harder in all scientific fields as we delve into the detail. Like quantum physics, like genetics, like medicine, modern psychology is usually about statistical inference. It’s hard to have a situation where everything is so clear-cut that we can point to one group of people that all did something and another group of identical people that all did something completely different, and there is only one point of difference between them that we can pounce on with joy and say: this is it, the smoking gun. This is what has this effect.
People are variable. Rarely do you have an experimental intervention that is so dramatic that it has an absolutely clear effect that doesn’t need abstruse statistics to reveal. And the statistics have become progressively more abstruse. Today there are so many different, and complex, tests, each one appropriate for a specific situation, that no one knows them all. Scientists learn the few they are told are appropriate for the sort of experiments they run, and then try to keep up when they are told a new test is better — more discerning, more subtle, better able to sort the wheat from the chaff. Is it true? At the end of the day it’s a matter of faith; few researchers have the statistical background to really understand the statistics they’re using.
So that tempers how much faith we can put in statistical results.
But the main point is simply understanding that it is a matter of statistics. Research is all about significance — is this result showing a significant difference, or not. And significance is a statistical term with a very precise meaning. It means a statistical test has been passed, that as a matter of probability (5% is standard; 1% is great; 1/10% is absolutely terrific), the experimental result is unlikely to have occurred by chance. That is, in the case of the standard 5%, the difference between experimental groups is only likely to have occurred as a matter of random chance one out of twenty times.
In other words: it could have occurred by chance.
That is why replication is so important.
When I report on a research study, I do so on the basis that it is interesting. That it is part of a body of research, or that it may become part of a body of research.
On its own, no experimental result is proof of anything.
So the important thing is building up experiments, preferably by different researchers, on the same question. We want replication, which is repeating the experiment the exact same way, and we want broad and fine differences in the experimental procedure. And we want different approaches that connect the results to a broader picture.
It’s all about consistency.
Conspiracy theorists can rant against the scientific establishment, and claim that it ignores findings that don’t fit into the established beliefs, but the issue is rather more subtle. No one is more excited than a scientist by a truly new finding, but the less it is consistent with all the other evidence, the greater the evidence must be.
Repeat after me: no single study is proof. Ever. Of anything.
Because scientists make mistakes. Because scientists are human and to be human is to see the world through our minds, not our eyes. Because physical objects (eg, cellular material) can become contaminated; because human subjects are influenced by far too many factors to list, including the experimenter’s beliefs. And because, at the end of the day, results are a matter of statistical probability.
So, we have to weigh the evidence. We weigh it on the basis of numbers of subjects (was it a pilot study, a large study, a very large study — the greater the number of experimental subjects, the less likely it is that the difference occurred by chance), on the basis of type of study (e.g., was it an experimental intervention or a population-based epidemiological study), on the basis of how well the experimenters designed the study, on the statistical significance (is the probability that this result occurred by chance five in a hundred, or one in a thousand, or one in ten thousand?).
And, most of all, we weigh it on the basis of how many studies are all saying the same thing, and how well different results are all chiming in to tell a consistent story. So, if we’re wondering if blueberries really are good for the brain, we look at animal studies and human studies and cell studies. Human studies are important because, at the end of the day, we have to confirm these findings in our own species. But we can’t control all the variables with humans as we can with captive animals, so animal studies are needed to construct the procedurally tight experiments we need to truly compare the effects of , say, a daily dose of blueberries. And cell studies are important to tell us why blueberries might have this effect.
If we can point to a specific effect in the cells that could have the sort of cognitive effect we have observed, then we have a much stronger basis for believing in the effect.
We also weigh it in the knowledge that this is consistent with a much larger body of research looking at the effects of fruit and vegetables and their constituents.
As lay people trying to weigh the evidence (and given the extreme specialization needed now in all the sciences, everyone is a ‘lay person’ in most areas), we also need to realize that different standards are necessary for different results.
I’ve been eating blueberries (or boysenberries or blackberries) every day for years, since I saw the first reports that blueberries were good for the aging brain. Why not? I like them; they fit into my diet (I have them in a smoothie either for breakfast or lunch); they are very unlikely to do me any harm.
My standard for taking a drug would be WAY higher (which is why I heartily recommend a recent article in The Atlantic — “Lies, Damned Lies, and Medical Science”).
When deciding whether to act on research findings, you need to weigh the costs and benefits. You also should be making different decisions depending on whether you are making the decision for an individual or a group. Experimental results are always only pointers at an individual level. Group differences, I say again, are statistical. That means, some individuals will react one way, and some another. No research result will tell you whether something is true for an individual (witness those people who smoke heavily for decades and live till 90 — but the odds are heavily against you).