by Carl V Phillips
This week, in my major essay (and breezy follow-up), I argued that the dominance of hate-filled nanny-staters in public health now is actually a product of medic and technocrat influence more than the wingnuttery itself. The worst problem there has to do with inappropriate goals that stem from a medical worldview morphing into a pseudo-ethic. The seemingly inevitable chain of events created by that pseudo-ethic resulted in public health professionals hating the human beings who we think of as the public because we are a threat to what they think of as the public, which is just the collection of bodies we occupy.
But this is not the only damaging legacy in public health of the thoughtless application of medical thinking. The science itself has also suffered, most notably (though far from only) because of the fetishization of clinical experiments (aka RCTs: randomized controlled trials) and denial of research methods that are more appropriate for public health. This is something I have written and taught about extensively. I will attempt to summarize it in a couple of thousand words.
It can be argued that modern public health science got its start with the formalization of new observational epidemiology methods by Hill, Doll, Wynder (who may have “borrowed” his insights from the Nazi-era Germans), and colleagues in the 1950s. Observational studies are at the heart of epidemiology, the science of public health, as they are with most social sciences because realistic experiments are not possible like they are in, say, chemistry. It turns out that experiments are possible in clinical medicine also, and work quite well there too — often better than any other source of evidence.
You can see where this is going if you read the previous essay.
Just as the thoughtless porting of clinical goals (“fix this body; that is all that matters”) crippled public health’s political view and ultimately destroyed its compassion, porting the much simpler scientific context of clinical medicine damaged the ability of the field to understand science. It is ridiculous how much of my career I have spent responding to the damage that this has done (though I guess I cannot complain when I get paid for it).
It is easy to find a zillion sources for the naive and incorrect claims that an RCT is at the top of some hierarchy of study types or is the gold standard. Both of these are totally wrong. “Gold standard” refers to something that is exactly right and which other measures can be calibrated against. Obviously this is not accurate since RCTs are just one of many imperfect methods; they are sometimes best but nothing is gold in this world. Less obvious, but equally true, is that there is no hierarchy of study types. Under some circumstances one study type provides better information, and under other circumstances another does. It depends both on the question being asked (e.g., when the question is, “is it possible that X occurs?” then a single case study is the perfect study design) and physical realities (e.g., if it is not possible to control the exposure, then an experiment is obviously not going to be useful). The typical naive hierarchy puts RCTs on top, followed by the various systematic observational studies (which are, even more absurdly, sometime themselves ordered), followed by case series, individual case studies, and “expert opinion”.
Why is this mythology so pervasive? It is a classic case of offering a simplified point to an audience that is not going to understand the more complex reality, but is in need of something. The target audience are people who start out so misguided that they need some guidance, but that are not going to sit still long enough to really understand at a deep level. It is akin to telling a toddler that pulling the cat’s tail makes her sad. It may be inappropriate to project the feeling “sad” onto a cat. Strictly speaking we are concerned with the concept of simple physical pain (which may not make an impression on the sympathetic but pre-moral toddler), plus the possibility of terror that might result in behavior changes for the cat that make her life less rich (which would be far too complicated for the kid to understand). But “sad” might motivate the right behavior change. And it is not so far off of the technical reality that it is totally a lie. Of course, if the kid grows up thinking “sad” is a complete and precise description of the effects of abusing critters, some bad results might ensue. But chances are he will not stick with this simplification as he learns to deal with the complex real world. Substitute physicians for the toddler, and remove the bit about growing up, and it explains why RCTs are fetishized (that is, they are considered to have magical properties that they are associated with but do not actually possess) and other simplifications.
To start with, medics have a bad habit of thinking they can learn the relevant science through personal experience. (Note that this and similar observations here are based on an extensive literature on the topic and the reported wisdom of those who educate physicians, which once included myself, as well as my own concurring observations.) That is, they think they are basically like plumbers. Plumbers acquire a useful set of specialized skills and knowledge through some technical training and then an apprenticeship and continuing experience. That works well for them to figure out how to deal with the variety of situations and outcomes they face. Physicians are trained and acculturated in a similar fashion, which was about right a hundred years ago. Of course, most medics would resent the comparison, and they are certainly right that they face a much more complicated list of situations than plumbers. More importantly, the cause and effect in their world, and even the outcomes themselves, are much harder to judge.
The result of these differences is that the approach that works great for plumbers fails for medics. Each individual’s experience is woefully inadequate to judge, say, which of two drugs is better to treat a particular patient. The plumber learns that a particular fitting often fails if it is not angled just right because he can see the failure happen. The physician may not even know whether a drug worked for a particular patient (unless it is for a condition with no spontaneous remissions and the drug is the only intervention attempted) and would need to see a hundreds of such patients — administering one treatment to some and the other to some, and keep careful track of successes — to see a statistical difference in the effectiveness. This is unlikely to happen. More likely is that she will fixate on a small number of apparent successes and think those are sufficient evidence (“I have seen this work many times”). Thus, it needs to be hammered into to medics that they should generally defer to expert assessments of the body of formal research findings rather than substituting what they think they have learned from their own experience.
This last bit explains the greatest absurdity of those naive hierarchy lists, which putting “expert opinion” at the bottom of the list. This is simply silly since real expert opinion is the ultimate (indeed, only) source of interpretation of the evidence. Study results do not speak for themselves. But the purpose of these lists is to convince medics to stop paying attention to their highly inexpert opinion that comes from their limited experience, but it is easier to sell it to them if their experience is referred to as expert opinion. Once written down, this gets misinterpreted as suggesting that the synthesis of all available evidence by the real experts — the best source of organized knowledge — is unimportant.
As a related digression, when I talk to reporters about THR, the standard estimate that smoke-free tobacco products are roughly 99% less harmful than smoking usually is part of the conversation. Sometimes they will ask “where does that number come from?” or “whose estimate is that?” I explain that it is mine, that it traces back to calculations I did about smokeless tobacco epidemiology in 2006, and that we extrapolate it to e-cigarettes, which are probably about as low risk or at least close to it based on what we know about the chemistry of the exposure. Reporters are usually quite curious, so I explain more, including how this number is certainly not precise, but that it has stood the test of time, never being seriously challenged and being accepted by the other experts on the topic. In other words, it is expert opinion — mine and others’. What else could it possibly be? But it is not about one battery of calculations I did a decade ago. It is because the experts synthesize those calculations as well as all other available relevant knowledge, and come to the conclusion that it is about right. The key here is assessing all of the available formal study evidence and synthesizing it. No individual study result can substitute for that.
Returning to the narrative, the next step, after overcoming the urge to think that their limited and imperfectly-measured experience constitutes sufficient research, is getting the medics to not pay undue attention to minimally useful evidence. Again this has to be simplified because most of them are not scientists and do not understand scientific inquiry, but we are trying to get them to be science-based. A few are good scientists in addition to being medics, of course (and they understand the points I am making here and find them as frustrating as I do), but the goal is to get the rest of them to behave as if they understood the science. Like people in most professions, they are so busy with the day-to-day activities and technical details of their practice that they are unlikely to engage in much analysis. So they are told that case studies are the lowest form of study (just ahead of “expert opinion”), with collections of case studies just above that. When a medic reads a captivating case report about some utterly improbable series of events (reported because it was unique and thus interesting), we do not want him trying to recreate this in his practice. In general, it is not a good idea for any of us to base our assessment of how the world normally works on a one-off story we have heard. So the rule of thumb is a pretty good idea.
But this simplification — almost always good advice for someone engaged in the biological side of medicine — has been turned into a fetish. There is even a pejorative term for case-based evidence (“anecdotes”, of course). The thing is, case studies / testimonials / anecdotes are often quite useful when studying public health topics. To take the obvious example, the testimonials about the miraculous role that e-cigarettes played in the life of many people who tried a dozen times to quit smoking using every “approved” method are very informative. They do not tell us what portion of smokers e-cigarettes might work for, but they do tell us that the described phenomenon does happen. And because there are so many of them, it tells us it happens a lot, not just rarely. This works because for human behavioral phenomena — those that incorporate preferences, feelings, and volition — such individual experiences can be informative to the individual and thus to those who they report that experience to. People, unlike molecules or plants, are capable of contemplating what they are experiencing and what they are deciding, and reporting details about it. They are not always right, of course, because none of us fully understand ourselves. However, they are generally pretty accurate with observations like “I tried to quit smoking using every pharmaceutical out there” and “I quit for a month five different times, but I was always so miserable I started again”, along with “after a few weeks of using e-cigarettes and finding a flavor I liked, I was happy to never smoke again, and I haven’t done so for two years.” When thousands of people tell stories like that, you have learned a lot, and it is knowledge that is unlikely to be generated by any study design that is supposedly higher on that mythical hierarchy.
But the public health people, stuck in the misleading simplification that originates in medicine (and is generally good advice it that comparatively simple realm), say “no, that is not good evidence about anything because those are anecdotes, and anecdotes are never good evidence.” If asked to explain why case studies are not good evidence about anything, or specifically about the existence of people for whom e-cigarettes are apparently the only way to quit smoking, they have no answer. (“It was on a list of rules we had to memorize for a test in school, and I have never questioned it” is an answer that few will admit.) They have no answer because it is wrong. They could answer, say, “I know that a couple of anecdotes about someone recovering from gastric ulcers after adding large quantities of nutmeg to their diet does not mean medics should recommend nutmeg as a treatment.” And that would be sound reasoning; biological conditions spontaneously change, and it is easy to be superstitious about the cause. But that observation about biological interventions does not generalize to studies of human behavior.
There is another misguided generalization of this “anecdotes are not informative” myth that I have dealt with extensively, specifically in the context of the health problems caused by industrial wind turbines (electric generators) near people’s homes. There are countless testimonials of people who experienced a particular set of health problems when wind turbines started operating in their neighborhood, and also found that the problems abated when they spent time away, and then once again occurred when they returned to the exposure. People who do not understand science and believe the simplification from the faux-hierarchy dismiss these as “just anecdotes”. But not only are these “anecdotes” useful, but they are actually near-perfect experimental evidence, which is generally the most compelling evidence when the experiment actually measures what you are interested in, as it does in this case. The individual experiences, even though they are the testimony of a single person, involve changing the exposure and observing the effects, which defines experiment and provides very compelling evidence. (For those interested in more details, I wrote that up in the article available here.) You will probably notice a similarity with individuals’ smoking cessation experiments, wherein someone tries a particular intervention, such as switching to e-cigarettes, and can definitely figure out whether it works (for that individual) from personal observation.
Returning to the simplified advice to medics, the other disastrous failure in medical research comes when observational studies are so confounded as to be useless. The most obvious example comes when we look at an experimental treatment (drug, surgical technique, whatever) that is only used in circumstances where the standard method looks hopeless. If we naively compare outcomes, it can look like the experimental method is terrible compared to the standard approach, even if it is genuinely a more effective treatment, because it is attempted only for patients who are in particularly bad shape. (Note that you might recognize the similarities of this to the critiques that West and Hajek recently leveled at a junk-science study that claimed e-cigarettes do not help cancer patients quit smoking.) Conversely, if the new option is preferred only by the best physicians working in the best facilities, then it might look like it is better than it is. The solution to this is an experiment (RCT) that randomizes who gets what treatment, and carries it out under similar circumstances. And thus the simplistic advice to the medics is that the RCT is the best study method.
The main advantage (for relevant purposes, the only advantage) of a RCT is that it eliminates the systematic confounding such as that described in the previous paragraph. (Confounding is the problem of people who are getting one treatment differing from those getting the other, for reasons other than the treatment itself, and thus having different outcomes that are not caused by the treatment.) But this needs to be balanced against some serious disadvantages of the method. Conveniently, it turns out that those disadvantages are usually pretty minimal for a real clinical intervention. (Though there are exceptions to this, like the mess that the huge Women’s Health Initiative randomized trial of hormone replacement therapy turned into.) The disadvantages are, however, enormous when we try to do experiments in social sciences like public health. This is where that unfortunate history of public health — being a naive port of clinical ways of thinking into a social science, with those in the field not even recognizing that they are doing social science — really becomes a disaster for the science side of public health, just as it became for the ethical and political side.
RCTs became a fetish in public health because they are often (not always) the best study design in clinical medicine. So we have RCTs of NRT products that are interpreted as saying NRTs are nearly useless. This is actually not quite fair, because in public health we are asking different questions than we do in clinical medicine. If the question is the clinical question “if I ‘administer’ this ‘treatment’ (NRT) to a ‘patient’ presenting with this ‘disease’ (smoking), is it likely to ‘cure’ her?”, then the answer is no. The RCTs show a dismal success rate. But if the question is the real public health question, “does having NRTs available for people who want to quit smoking do any good at all?”, then the answer is yes. On the other side of that is the absurdity, “never mind that smokeless tobacco is obviously responsible for ten to twenty percent less of the male population in Sweden smoking, we do not have an RCT that shows it works.” It boggles the mind that people dismiss clear real-world evidence for lack of a completely inappropriate experiment.
However, I honestly think that the willingness to dismiss the real-world evidence this is not entirely politicized posturing. People in public health have been so damaged by the medical simplification that they really think that an RCT would be better evidence than real-world observations. Similarly, an RCT of e-cigarette use cannot not tell us much of anything we do not already know, and could never be a proper measure of the real-world phenomenon of e-cigarette use in any case. But still there are suggestions that we somehow need such trials.
THR advocacy in the face of the “public health” establishment is saddled with not just the political burden associated with THR being an “impure” behavior that they hate, but also the entrenched anti-science that public health mis-learned from its origins in clinical medicine.
[Update: the next post applies this specifically to research on THR.]