by Carl V Phillips
This continues from last week’s post. In that post, I pointed out what a death caused by smoking even means. (Recall: It technically means a death hastened by even one second. This means that basically every death in an ever smoker could be included, though this is clearly not how people interpret the figures and even those who are trying to exaggerate the number do not actually game it this way. Still, it is not clear what the claims do mean.) I then explored what data you would ask for if you could have any data you wanted to answer the question, a critically important thought experiment in epidemiology that is almost never done. (Recall: You would want to run an alternative history of the world where no one smoked but all else was the same, and compare the death counts.) Today I am going to move from that into what we can actually do with the data we can get, and why it fails to do a very good job answering the question.
Estimating attributable all-cause mortality
Given that we cannot ask for divine data intervention or rerun the history of the world as an experiment, we are stuck having to compare people who chose to smoke to people who chose to not smoke. If we are interested in estimating deaths caused, the ideal way to use such data is to look at the lifespans in each group and compare them. To the extent that mortality occurs earlier in the smoking group, the difference is an estimate of the effect of smoking. There are two important problems with this. No, actually there are a hundred important problems, but two of them happen to be enormous.
First, because smoking is a choice rather than an accident of demonic possession (which would make this much easier), it is inevitably associated with other choices and characteristics, many of which might change someone’s longevity. This creates confounding, which means that the exposed group has different outcomes (for the endpoint of interest) than the unexposed group that are not caused by the exposure. Dealing with confounding is often a major challenge and in the present case it is a huge problem that makes the task nearly impossible, even for those trying to do it right (which does not describe the actual researchers in the present case).
The way researchers attempt to deal with confounding is by measuring covariates that can be used to adjust the statistics to try to correct for it, thereby removing the effects of confounding from the estimate of the actual effect (of smoking in this case). There are variations on this theme that match subjects based on these covariates and improvements such as limiting the study population to those who are already a lot alike; these have similar limitations.
The problem with such corrections is even if the researchers seriously think through what the pathways for confounding might be and what data they would like to have to correctly address them (which 99% of epidemiologists never do) they discover that they do not have the right data. What happens in practice 99.9% of the time, then, is that researchers just throw whatever variables they happen to have into their statistical model. Moreover, even to the extent that these are the variables a thoughtful researcher might want, they are often measured badly or are poor proxies. E.g., smokers are more likely to have jobs that involve hard manual labor, which shortens life expectancy, but the data only includes measures of income and education level, which are proxies for this, but poor ones. And even if the dataset contains a variable for “did hard manual labor” it is probably going to be badly measured.
Needless to say, just throwing in convenient variables is not good enough. Even worse, they have to be put in the model using the right functional form, but almost no consideration is given to this. (E.g., Imagine that people suffering the effects of eating more than 40% of their calories from meat creates important confounding, but there is no such confounding among those with quantities below that. If there is a variable for meat consumption, it will almost certainly be put into the model as a linear continuous variable. That functional form is a terrible proxy for the actual effect from that threshold, so even though this confounding could have been corrected for using available data, it is not.) Because of all this, there will inevitably be “residual confounding” — bias in the estimate of interest caused by confounding that remains even after the attempts to correct for confounding.
Confounding is the most troublesome cause of error here, but not the only one. There is always a problem of measurement error. It can be completely random and thus produce pure noise. [More technical aside: This means that the stated confidence intervals, which are supposed to be a measure of the random error, are always narrower — often hugely so — than they should be.] But for a particular study method the measurement error might be reasonably expected to be biased in a particular direction. And there are other sources of study error I will not go into.
Now if you are merely trying to figure out whether smoking causes lung cancer — a dichotomous question — you can can hammer away at your data and bring in outside knowledge to make a convincing case that no plausible level of correction for the apparent confounding and other errors cannot plausibly make the association disappear. You can try different functional forms. You can adjust for possible levels of measurement error. You can postulate values for missing covariates that you would really like to have and see what effect they have. Epidemiology can definitely be a very robust science when done with this level of seriousness. It gets much dicier when we are asking it to produce a quantitative measure, however.
[Aside: I said “can” because this obviously requires more than taking one cut at the data and observing that the assumptions embodied in that one statistical model do not make the association disappear. Yet the is exactly what most epidemiologists do — and they probably do not even understand that they are making particular assumptions by using the particular statistical model, let alone have thought about what assumptions these were. It is even worse than that: Researchers often try multiple models and report only one with an extreme result, intentionally biasing what is published away from the truth. Epidemiology is a very difficult science that is hard to do well and very easy to do wrong, but is practiced mainly by people who have no idea how to do it right and often are trying to do it wrong.]
In the comments from last week, someone mentioned that I am making you delve into the philosophy of science. That excursion is useful if you want to really get this, because the issues are far deeper than typical discussions might suggest. The greatest contribution I think I have made to philosophy of science per se is an observation about the contrast between scientific methods that are partially self-confirming and those that are not. (I use “contribution” loosely since by publishing this observation here, I am dramatically increasing the total number of people who ever read it.)
Riffing on an observation of the great Ian Hacking, we can note that when Galileo looked through his telescope and claimed to have seen moons orbiting around Jupiter that fit the pattern of Kepler’s laws of orbital mechanics, there was something self-confirming about the observation. Yes, he could have just been fantasizing the whole thing. But assuming he was not, the fact that the data roughly matched a proposed general law was a strong indication that he really was seeing what he thought he was seeing. After all, it would have been really unlikely that there was something out there that was acting in 75% agreement with the prediction from physics theory. (Background: A fact that is obscured by grade-school science education is that Galileo’s primitive telescope produced such noisy data that most people looking through it could not make out anything at all. It was nothing like the precise instruments that anyone can buy today.)
There are countless of other examples like this. Though there is inevitably error in all empirical work, sometimes the results are close enough to some focal hypothesis that they not only support that hypothesis but, when combined with that hypothesis, also supports our confidence that the research results were pretty accurate. As another example, consider a high school chemistry lab practicum, where the goal is to do some tests and figure out what chemical is in a sample. If you are given a list of candidate chemicals and you do a series of tests that produce results close to what you would expect if it were potassium hydroxide and quite different from what any of the others would produce, then it probably is indeed KOH and you probably did your measurements right. If the chemical were something else, it would be extremely unlikely that you did your measurements and experiments wrong in just such a way that they generated data close to what you would expect if the sample were KOH.
By contrast, epidemiology measurements have no such self-confirmation. Yet those who practice epidemiology and repeat its claims seem completely unaware of this fundamental contrast with the baby-steps science classes they took. Whenever you do an epidemiology study, no matter how badly you screw it up, you produce a result — you generate quantitative estimate of the effect you are attempting to measure. Whatever errors you make in the study, you produce a result. However bad the data is, you produce a result. Even if the data is completely made-up and you mis-program your statistical software, you produce a result. There is no underlying theory that makes any particular quantitative result a focal point as in the telescope or chem lab examples. It is not the case that, say, the candidate true relative risk values are 1.0 and 2.1, and because you estimated 2.2 you have supported the 2.1 hypothesis and apparently did a pretty good job with your measurement.
Thus, no matter how bad your research is, the results themselves do not look any more or less right. The only exception is if they are serious outliers compared to previous empirical results or fail obvious reality checks (e.g., “if it really were that bad, exposed people would be dropping like flies, and they are not”). It is only then that can we say that the results suggest that the research was done badly. We can more or less never say that the results suggest that the research was done right.
So this means (no, I am not aimlessly wandering into the weeds here!) that even though you can hammer away at a quantitative result to make a very convincing argument that study error cannot explain away the entire estimated association between smoking and lung cancer, you cannot claim that the actual quantitative point estimate is supported because the yes-vs-no dichotomous conclusion stands up to hammering. Thus, attempts to estimate total attributable mortality or any other population totals are always dependent on quantitative estimates that we have little reason to believe are accurate. The only way to use study results to confirm that a study was apparently done right and the quantitative estimate is reasonable is to do multiple studies that are designed so it is unlikely that each produces the same errors, and then find that the results are consistent (not just on the same side of the null, mind you, but quantitatively consistent with one another).
For the case of estimating the effects of smoking, the multiple studies that are done are subject to basically the exact same confounding. Moreover, we have good reasons to believe this bias consistently creates overestimates. When every method of estimating the same number has the same problems (as is the case with confounding here) and there is no theoretical focal point, then we have no reason to believe that the estimates are not all wrong in the same way, even if they agree with one another (which they do not in this case, by the way, making the problems even clearer). In particular, due to the imperfect controls for obvious confounding, we have a good reason to believe they are consistent overestimates.
Further complicating this, even if an effect estimate is exactly right for a particular study, it may not necessarily translate to other populations. Thus cross-population confirmation can help us with the dichotomous question (if smoking causes heart attacks, then we should see some effect in both Brits in 1970 and Italians in 2010) but this helps little with the quantification (since the rate at which it causes heart attacks in those two populations will be different because of differences in other causes that interact with it). There are no physical constants in epidemiology. Notice that this means, among other things, that a study estimate from 1985 is not going to be a measure of what is happening today, let alone for predicting the next 50 years. A study estimate for Americans does not apply to Pakistanis. Yet those who play with these numbers almost invariably ignore this.
Epidemiologists also ignore even the most the obvious errors in their effect estimates — approximately always. It is truly appalling. If you look at those stupid and pointless “limitations of this study” paragraphs in research reports, you frequently find a statement that there probably is bias from residual confounding or some other error (not so much in the tobacco control literature where the authors seem to not even understand this, but elsewhere in epidemiology it is practically a mantra.) But they ignore this observation when reporting their results to three decimal places and their overly narrow confidence intervals.
In the rare cases where authors seem to even care that they have just announced that their results are wrong in the penultimate paragraph of their paper, you often find an argument that plausible levels of these biases could not move their estimated relative risk to 1.0. That might well be true (though, often this protest is unconvincing), but that would merely allow the dichotomous statement that, say, smoking causes some heart attacks. But they then implicitly claim that because the true value could not be 1.0 then their point estimate is exactly right. Think about it a moment and you will realize that is what is being done. They do not actually say this of course, because it would be obviously wrong even to them, but that is what they are claiming: “There probably is some residual confounding, resulting from smokers being different from nonsmokers in a lot of ways, that almost certainly makes our RR estimate of 2.1 too high, but it is implausible that this error is great enough that the true value is 1.0, and therefore 2.1 is correct.”
The right thing to do, of course, would be to correct for the reasonable estimates about the confounding and other study errors when reporting the quantitative estimates. But they do not.[*] So when we are trying to estimate actual numbers, like death counts, they are based on incorrect inputs. In this case, almost certainly high. But even if someone doubts that the error due to residual confounding is necessarily an upward bias, there is no doubt that the estimates are wrong one way or another, probably by a lot. And yet the figures are reported as if they were precise and accurate measurements. They are not.
[*I will concede that this is a little bit my fault. When I first started presenting the work that appeared in that linked paper, c.1999, it made a tremendous splash — it won awards, top thinkers in the field said it was the most important insight in the field at the time, stuff like that. But I started focusing elsewhere; THR’s gain was epidemiology methods’ loss, I guess. Over time, others have developed methods that are far better than what I was toying with back then and are still working on it (e.g., Igor Burstyn — perhaps you have heard of him without realizing he is one of the great serious thinkers in the field). But while these improvements are now sometimes employed in more serious corners of epidemiology, you would never know they even exist if you just read what appears in the “public health” research.]
The second fundamental problem with this method of measurement is more subtle but also important. If you compared those two distributions of how long smokers and nonsmokers live, you would (assuming they were accurate) have a measure of years of potential life lost due to smoking. But this does not translate into a count of deaths, nor make such a count well-defined or even meaningful. 10 years of potential life lost could be 1 person dying 10 years earlier than he would have, or 10 people dying 12 months earlier, or 3653 people dying one day earlier. You cannot tell which from the data, making translating the observations into a count of deaths rather difficult. (To see this, imagine you have perfect unconfounded data for 10 people in each category, with the nonsmokers dying at 71,72,…,80 and the smokers dying at 70,71,…79. Since these are not the same individuals like in the hypothetical rerunning history experiment, you cannot say whether each smoker lost one year or whether the one who died at 70 would have made it to 80, and the others were unaffected. Or any of more than 10^7 other possible combinations.)
Also, there is the problem I addressed last week, of deciding which of these even count — is 11 months enough, or should it be a year? As I explained, if you count everyone who died even a tiny bit earlier, then you could say that smoking caused almost every single death among everyone who smoked.
One thing that you might take away from these observations is that trying to count up deaths from smoking is an inherently misguided exercise that exists, at best, at the margins of real science. You would be right about that. It is not the right measure and it is not measured well.
Estimating total mortality by adding up individual diseases
It turns out, however, that comparing mortality statistics is not how most of these deaths-from-smoking calculations are done. Instead what is usually done — including the CDC statistics that are most often repeated and mis-extrapolated — is to try to add up deaths from individual diseases rather than looking at overall mortality. At first blush, that might seem to solve some of the problems noted above, but in reality it mostly just hides some and exacerbates others, while adding a few new ones to the mix. But I trust we are all tired already, so I will pick that up in next week’s class. [Update: The final post in the series actually took a month. Sigh.]