You may have heard of the replication crisis. It mostly refers to studies in the social sciences that get a lot of media attention when they're first published but then fail to produce the same results when other researchers try to replicate them. The reason it's become a crisis is that a lot of studies fail to replicate, calling into question the whole enterprise of the social sciences.
But one problem with this is that it's hard to replicate a study. Small things can make a big difference sometimes, and doing a precise replication isn't always easy. A recent paper shows this pretty dramatically. First off, here's what the researchers were trying to replicate:
The study was published in 2021 and investigated an association of glucagon-like peptide 1 receptor agonists (GLP-1RA) and chronic lower respiratory disease (CLRD) exacerbation in a population with type 2 diabetes mellitus (T2D) and CLRD.
This is no dashed-off social science experiment. This is a big and very serious medical study with very specific methods and goals. What's more, the researchers weren't trying to replicate the whole thing. They were only trying to replicate one little part: deciding which patients to include in the study. Here are the criteria:
New users of GLP1-RA add-on therapy aged more than 17 years with at least 1 outpatient or 2 inpatient encounters with T2D and CLRD in the year before the index date with no prior insulin or dipeptidyl peptidase 4 inhibitors exposure and no prior type 1 diabetes mellitus, cystic fibrosis, lung cancer, pulmonary embolism, pulmonary hypertension, conditions requiring chronic systemic corticosteroid therapy within a year or pregnancy at the index date.
Got that? The researchers gathered together nine teams of qualified experts whose only task was to slice and dice a database to come up with a cohort of patients who met the inclusion criteria. The results were dismal. Interpretations of what the criteria meant were all over the map and none of the nine teams came close to matching the cohort from the original study. They couldn't even come close to agreeing on how many people were in the cohort:
The green bars represent patients chosen by both the original study and the team of replicators. The largest overlap is only 35%. The number of people chosen for the study ranges chaotically from 2,000 to 64,000. Every team deviated from the inclusion criteria in the original study in at least four different ways.
So replication is no walk in the park, even in the hard sciences where the procedures are presumably more concrete. But if replication is this hard, what chance do we have of properly replicating anything?
"So replication is no walk in the park, even in the hard sciences where the procedures are presumably more concrete. ..."
Of course exact replication is difficult if the original paper is badly written and the authors don't respond to queries. Did any of the groups trying to replicate this ask the authors for clarification about any confusing points?
And if you have actually discovered a robust effect exact replication should not be necessary to observe the effect as it will still be present in slightly different conditions.
Yes a robust effect should be reproducible without going to extremes. The ability to be reproduced is part of robustness. The problem is not really the technicalities of reproduction, it's that a lot of studies reach conclusions that are either not really warranted or do not have the importance that the authors attribute to them. And the popular media pick up studies according to their ability to inspire interest, not according to their soundness.
The fact that a result is "statistically significant" is virtually meaningless in itself. That kind of test does not answer the right questions about most studies. Here's a piece from a while back that discusses p-values:
https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005
Achieving exact replication is, well, exacting. But if you can come close, and others can come close, then at least consensus can emerge, and with luck meta-analysis can yield some sturdy insights.
And even if the insights or consensus aren't, ahem, exacting, they can still be enough to guide decisions and actions, at least in the near term and in the absence of any other (or most importantly, contravening) evidence. And in situations where doing nothing absent "proof" (not that proof exists in science) is foolish or dangerous or not an option at all, then good enough may have to do.
Science strives to minimize error, but it cannot and should not be held to some imaginary standard of being without error at all.
The obvious question is whether or not this study can be replicated. 😉
Also, how well did the cohort chosen by ChatGPT match the master cohort?
They did provide a potential solution:
"Sharing analytical code supported by a common data model and open-source tools allows reproducing a study unambiguously thereby preserving initial design choices."
The chance of properly replicating anything is precisely zero if we don't try. Science is about trying. Even the knowledge that a useful result could not be achieved is still useful, if frustrating.
But medicine is not a hard science. It isn't like physics or chemistry, with underlying immutable laws.
But I've done things like this with learning investigations. You can't just turn people loose with a set of criteria and expect them to all come out the same way. We were using a simple Likert-like coding scheme to characterize student activity in videos and it still didn't work. There is ALWAYS a need for a training period where people do their codes and then everyone gets together to discuss the variations in their coding, what they were seeing, and why they thought about it a particular way. Eventually, a common understanding emerges and from that point on, the codes are reproducible and reliable. We also found that reaching a common understanding is a lot harder and takes a lot longer if there are more than about 4 criteria. And that is true even with experts.
Yes, this discussion if frustrating. First Drum criticizes replication in the social sciences, then he illustrates his complaints with a study in medicine that bears little resemblance to a behavioral study. For one thing, social science studies aim for random selection of subjects and random assignment to conditions. In this medical study, they were defining selection criteria and trying to be consistent across researchers. In most social science studies, research assistants conduct the tasks without knowing the overall hypothesis (usually they are students or paid staff) and the main concern being to use the methods consistently across subjects and groups, for which they are trained using written procedures that are vetted before any data is collected. None of that happens in a medical study.
Too many social science studies published in journals arise from doctoral dissertations. In addition to be newbies at research, they are rushing to get finished with the dissertation and feel pressured to publish so that they can get a job. In my doctoral program, we were required to replicate our own research before being allowed to present our work. Further, a good part of our training involved being assigned a published paper and told to replicate it. That doesn't happen in all programs.
There are problems in social science, largely with p-hacking, not consistency of subject selection. Yes there are difficult tecxhnical details involving replication but I doubt Drum knows what they are. Not only is medicine not a hard science, but medical students receive far less training in methodology than social scientists and nearly all of the techniques in biostatistics were first invented for dealing with human variability in social science. His disdain is offensive under the circumstances.
I really wish Drum would stick to speaking confidently only about the things he actually knows.
It is tough being a scientist......
A lot depends on who conducted the original study. If it was a pharmaceutical company I'd be very suspicions. They tend to measure many more outcomes than the on or two outcomes of primary interest. This increases the probability of a Type I statistical error (i.e. a false positive).
Furthermore, study design is of important. Data that come from a double blind, randomized controlled study are more trustworthy than those from a cohort study. In addition, the results of retrospective (as opposed to prospective) studies are also suspect.
If I understand Kevin correctly the authors did not try to reproduce. They only tried to prove that reproduction is hard, so they chose a paper with a complicated set of test person characteristics and farmed out the task of selecting candidates (only this task) to different teams. Then they compared to the job these teams did to the original paper. I'd bet that the nine teams were not given the quotation of the original paper, so they could not go ask questions.
This is silly. Anybody can get anybody to stumble. This does not mean that walking is hard.
Of course this is for reproducing statistical studies. Classic experimental science is easier to reproduce though it depends on the original author's precision in describing the experiment. A typo can spoil the thing. I once had a fairly simple experiment fail because the original paper had a typo in the temperature.
One more point: Reproduction with somewhat altered parameters adds strength to a finding. If the precise selection criteria for candidates are essential for reproduction you have a finding with extremely limited usefulness since the vast majority of potential patients (if the study is in medicine) will not fit the criteria.
From someone who teaches research methods to PhD students.
If a study is really tough to replicate, meaning that it is sensitive to small choices in research design or sampling, it often means that the study at best points to something with a small effect size. A lot of the studies that had the hardest time replicating were psychological studies with very light treatments, such as a picture of money on the wall behind the computer they are taking a test on. They tended to be done on small and cheap samples. They didn't publish the studies that failed to get to p<0.05, but random chance pushed a few coefficients over that threshold.
There might be some real effect size, but so small that it isn't that important substantively.
The point of reporting statistics with error bars is that within the level of error specified, the results may fail to replicate even if there is a genuine effect and the researcher did everything right. This is due to random variability. In psychology we use p < .05 to specify error. Failure to replicate doesn't necessarily invalidate the findings. It leaves them indeterminate.
Totally agree with ejfagan and add that using a p of 0.05 means you still get a lot of studies that are pretty meaningless. I would note that this study is a bit more complex with exclusion and inclusion criteria than many others. There are tons more studies with much simpler criteria and methods that could be more easily replicated.
Steve
Wow the cited study and it's result seem to me very valuable to the scientific process, in that it highlights a specific methodological problem. Classifying patients using what seems to be objective criteria can result in very dramatic differences in populations.
A follow-up question would be how different are the different groups admitted to the categories. However, you would have to establish a metric to measure those differences, and calculate that metric on each of them and by the time you've done that, you might as well have computerized the whole thing to begin with.
In fact, it's probably better to do it with a computer program, which can be run repeatedly with the same results, and shared with others.
I've long been wary of any field of discipline that has to tack "Science!" onto its name in order to sound significant. It's like being a "VIP" - if you have to tell people that you are, then you aren't.
All these non-replicable, tiny sample, poor stats studies seem to support this.
And this is important: Of you're studying say, a medication, and it works in select cases... if you compare it to cases where didn't or couldn't work, it won't show efficacy.
When you shift the tested cohort, you can really mess things up. And sometimes, it's not obvious what was messed up to someone outside.
Lewis Thomas, early in his medical career, tried to duplicate a study showing that penicillin made rabbits' ears floppy. He failed repeatedly. Years later, other researchers duplicated the effect. It had to do with whether it was rabbit breeding season or not.
On the other hand, if the effect is so weak (Social Science), or so dependent on unknown conditionals (medicine) then we shouldn't be using it to make decisions anyway...
I don't know that this actually shows what you think it shows, Kevin. To me it shows that "facts" that are hard to replicate should be treated a lot more tentatively than they are. And if replication is hard, that means that many of these supposed facts, in medicine and in social science, are in fact tentative.