Sunday Science Lesson: Why people mistakenly think RCTs (etc.) are always better

by Carl V Phillips

I recently completed a report in another subject area which explains and rebuts the naive belief by non-scientists (including some who have the title of scientists but are clearly not really scientists) that some particular epidemiologic study types are always better, no matter what question you are trying to answer. I thought it might be worthwhile to post some of that here, since it has a lot of relevance to studies of THR.

Readers of this page will recall that I recently posted talking-points about why clinical trials (RCTs) are a stupid way to try to study THR. A more detailed version is here and the summary of the summary is: RCTs, like all study designs have advantages and disadvantages. It turns out that when studying medical treatments, the advantages are huge and the disadvantages almost disappear, whereas when trying to study real-world behavioral choices of free-living people the disadvantages are pretty much fatal and what are sometimes advantages actually become disadvantages. Similarly, some other epidemiologic study designs (e.g., case-control studies) are generally best for studying cancer and other chronic diseases, which are caused by the interplay of myriad factors that occurred long before the event, but are not particularly advantageous for studying things like smoking cessation. Asking someone why he thinks he got cancer is utterly worthless, but asking someone why he quit smoking can provide pretty good data.

In-depth data about individual people practicing THR, like the CASAA testimonials, provide overwhelmingly compelling evidence about the success of THR and how it often plays out. These provides better answers to many questions than the types of studies you see in most epidemiology papers, whereas this would be a useless way to study cancer and a potentially seriously flawed way to study a medical intervention. Similarly, the convenience sample surveys of vapers, of which there are many (the first being having been done by me and my colleagues — just thought I would throw that in there) are informative about many points, but would be useless for cancer or medical treatment research. Simple cross-sectional data like CDC’s recent report can show that e-cigarette use is overwhelming concentrated among people who recently quit smoking is compelling evidence of e-cigarettes causing THR, whereas similar simple cross-tabs would be uninformative about many points.

And yet there persists a myth that there is some hierarchical ranking among types of studies. That is, the claim is that no matter what the question is, the best method of answering it is the same (and, moreover, this supposed ranking even trumps the quality of the study). This is so obviously absurd to anyone with any familiarity with science that it is frustrating to even have to explain it. But the myths persist, and sometimes merely pointing out the obvious incorrectness of the myth is not as effective as also explaining where the myth came from. The following is an excerpt from what I wrote recently on that topic:

While RCTs may be the most storied health science experiments, the above shows that they are not the only experimental methodology. Indeed, they are not the most common experimental method in medicine – not even close. Every time a clinician says, “try this for a few weeks and if that does not work we will….”, she is performing a crossover experiment [this was previously discussed in the report, but its meaning should be obvious here]. Every time an individual takes a similar action for himself, he is performing an experiment. This everyday experiment can provide a definitive answer for such immediate questions as “what can I do to feel better?” Indeed, it usually provides better causal information about what affects the individual than an RCT that finds that one such intervention works for 40% of all cases while another works for 60%. To take a simple example, I am extremely confident that eating raw onion causes me acute stomach pain, having performed the crossover experiment numerous times; I am not at all interested in learning the results of an RCT that estimates how often this occurs in the population.

It turns out, however, that sloppy generalization of such results can create a great deal of scientific error. Individuals, including individual physicians, have a natural tendency to over-conclude from personal experience, and to not even interpret that experience correctly. Clinicians have an inclination to draw confident conclusions (“in my professional experience, drug X works better than drug Y”) without recognizing they do not have nearly enough data to distinguish the signal from the random error, without recognizing confounding factors (e.g., they were less likely to assign drug Y to the patients who seemed most likely to quickly get better), and without recognizing that perhaps their patients are importantly unrepresentative of the entire population. In response to these errors, in the 1970s the field with the presumptuous name, “evidence-based medicine” was created (I say that as someone who was once a professor of evidence-based medicine). Its focus included educating physicians that they should elevate their trust in the formal research evidence about treatment effectiveness above their personal experience, and that some types of external evidence are, more often than not, often better than other types. This effort has been remarkably successful, but there has been a cost in the form of the necessarily simplified approach to scientific reasoning that was taught. The core message had to be simple enough to be memorable and to be taught to extremely busy physicians and medical students in a few hours.

It was these simplistic messages that created such prevalent myths as there are exactly four or five or six types of epidemiologic studies (in reality there are not bright lines among these, there countless variations on each, and there are numerous other methods that do not fit well into any of the usually listed types), there is an epistemic hierarchy among that list of study methods (in reality the optimal study to answer a particular question varies depending on the question and what type of study can be done well under the circumstances), that “an RCT is the gold standard” (even if we are talking about the best possible RCT of a medical treatment, it falls far short of the genuine gold standard, the Platonic ideal data, so in real-world epidemiology the best we can hope for is rather alloyed gold; moreover, if the RCT is bad or the question is not amenable to RCT research, it will be far short of even that), and “individual case reports [“anecdotes”] are uninformative” (that is true for some questions, but not true for others).

There is nothing wrong with having simplified rules-of-thumb about how to interpret evidence about medical treatments. We all make most of our assessments and decisions in life based on simplified rules-of-thumb, and if we choose good rules it works out pretty well. But it would be foolish to mistake rules-of-thumb for absolute truth, let alone natural law, and even more foolish to generalize them beyond the particular area where they work. This evidence-based medicine simplification of epidemiologic inquiry is close enough to true for the purpose for which it was designed, teaching people how to assess information about medical treatments. It also happens to be fairly close to accurate for cancer research. But it fails miserably in other cases, and fortunately physicians and everyone else ignore it in most of those cases. Everyone intuitively knows that a crossover experiment ([“in order to figure out whether a particular switch turns on a particular light, the study you need consists of: flip the switch”] or “try this for a week and see if you feel better”) is often the best scientific method for answering a particular causal question, and employ that method despite “case-crossover study” [i.e., those studies like flipping a switch or trying something for a week, a major topic of the report this is quoted from] not even appearing on those lists that supposedly contain all the types of epidemiology studies.

Restricting yourself to the constraints of these simplifications would be naïve and foolish for deciding whether to avoid eating onions. It is similarly naïve and foolish – or perhaps something more nefarious than that – to insist that the evidence in the present case conform to ill-fitting rule-of-thumb restrictions on methods of inquiry that are, wisely, not actually even obeyed by their target audience when they clearly are a bad fit.

Simplified rules of thumb have their place. We tell children to not get into cars with strangers. But if we generalized that simplification as if it were a universal rule, rather than being designed to guard against errors by people with limited skills, we would frequently find ourselves stranded at airports. Ostensible experts reciting simplified rules-of-thumb from one specific area of inquiry as if they were real scientific rules, let alone as if that they apply to all areas of inquiry, is the equivalent of them not taking a taxi because their mommy told them not to get in cars with strangers. It is easy to see that this is a statement of their ignorance, not their expertise: All you have to do is ask them to justify their claims about particular study designs always being superior. They cannot, of course, because the claims are not true. But watching them struggle to come up with any answer is generally sufficient to demonstrate that they simply do not have any idea what they are talking about.

In that report, I was addressing absurd demands by those who seek to deny overwhelming evidence in exactly the same way the ANTZ do with THR. The game in both cases is to insist we do not know anything in the absence of RCTs or other particular types of studies, even though these would clearly be less informative than the simpler types of evidence we already have. I describe this as confusing necessity with virtue. Just because sometimes we have to employ a particular methodology, that does not make it universally better. In order to identify a subtle cause of cancer, we need to collect a lot of particular data, which is very difficult to do without using one of a few particular study methods, and then perform some moderately complicated statistical analysis on it. Similarly, if we want to know if one drug is a slightly more effective treatment than another, we are unlikely to get a good answer without clinical trials. But if we want to figure out if THR caused someone to quit smoking, we can just ask him. The answer might still be wrong of course (it is theoretically possible he would have spontaneously quit the same day even without THR) — such is always the case with science. But the answer is far more definitive than anything we ever get out of case-control studies or RCTs.

The complicated and expensive methods are not better because they provide better answers — they typically provide really lousy information — but because they are necessary to deal with particular complications. If those complications do not exist, those methods have no advantages. Suggesting complicated methods should be used when they are not necessary is the equivalent of saying that all health researchers should always wear biohazard suits when working because for some health researchers this is sometimes a necessary expedient.

Bringing this back to specifics, would we know more if our testimonials and surveys of vapers were collected systematically rather than as convenience samples of volunteers? Absolutely, but we still know a lot based on them. It requires real scientific thinking, not just blind adherence to simplified recipes, to know which questions are answered and which are not. (Fortunately such scientific thinking is mostly within the skill set of anyone who understood high-school-level science classes.) Would we have better evidence if the CDC had collected more retrospective information from their subjects (such as simply asking them if they used e-cigarettes to quit smoking)? Absolutely, but we can be pretty confident of what we are seeing even without that. Would we have more useful information if we did some RCTs? Well no. RCTs are pretty useless in this area unless the question is something like, “if we conscripted all smokers into a clinical setting where we required them to try switching to e-cigarettes, how many would switch?”, which is not a particularly interesting question to answer.

3 responses to “Sunday Science Lesson: Why people mistakenly think RCTs (etc.) are always better

  1. Irrespective of the methodology, fanciful “concerns” about risks to The Cheeeldrin outweigh proven health benefits to adults.

    • Carl V Phillips

      It is certainly true that trying to deny the overwhelming evidence of THR via the claims addressed here is not the primary play in the ANTZ playbook (as opposed to the topic I was addressing in the original report — health effects of wind turbine generators on nearby residents — where it is their dominant game). Still it plays a role. At least once a week I see some claim about there being “no evidence”. So it is worth understanding where the myths come from in order to better to counter them.

  2. That was a very interesting read thank you Carl.

Leave a comment