# Sunday Science Lesson: How they estimate deaths from smoking etc.

by Carl V Phillips

This continues, and finishes, the series that started here and continued here. (FWIW, the second one is probably more interesting than either the first or this one, if you want to read just one.) Recall that this series is the very long answer to a question along the lines of “how do they determine how many deaths among smokers, and especially among ex-smokers, are attributable to smoking?” In the first post, I discussed the causal contrast that anyone addressing that question should be thinking of (but probably is not), and noted that the outcome is badly defined (caused the death to occur how much sooner? they never say and probably do not even understand that is an important question).

In the second post, I pointed out that the straightforward way to measure this is to compare a population of smokers or ex-smokers to an otherwise identical population of never-smokers and look for the earlier deaths. It turns out that such estimates of all-cause mortality are not actually what is usually done. Doing that requires having a really large representative dataset, with enough data about individuals to be able to try to fix the enormous problems of confounding I discussed. And you have to figure out some magic to get the right functional form for the control variables for all-cause mortality, which is much harder than doing it for a single disease. (This is quite important, though most people doing epidemiology probably do not even understand this point — so don’t feel bad if you didn’t.) This is not impossible, but it comes pretty close.

So how do most estimates like this add up the deaths?

The method that is most often used is to add up estimated deaths attributed to the exposure across a list of specific diseases that the exposure is believed to cause. That is, they list the diseases that are believed to be caused (meaning some of the cases of those, of course) by the exposure. For each they take the estimated of the portion of deaths attributed to the exposure, and then multiply it by the total number of deaths from that disease in the population to get an absolute number. They then add that across the list of diseases.

There are some good reasons for doing this rather than trying to directly measure the outcome. It is potentially (see below) anchored in some (see below) more robust numbers that are much easier to get. There are much better estimates of what portion of lung cancers are caused by smoking than it is possible to calculate for the portion of all deaths. Modest studies can be purpose-built to get at the former, while only a really huge and detailed dataset lets you reasonably estimate the latter. Indeed, if we were trying to estimate the deaths from something that is much less harmful than smoking, this would really the only choice because the excess deaths may not even produce a signal beyond the noise in the overall death statistics, even if you have a good measure of the exposure, which you often do not. If the estimates available from the many studies of individual diseases are aggregated into an overall estimate, it offers a potentially very good estimate for the effects of the one exposure on the one disease.

In addition, if done right (again, see below), this disease-list based method it is a way to reduce confounding. Smokers have a higher rate of death from cirrhosis, for example, but we are pretty confident that is not caused by the smoking. If you compare all deaths among smokers to all deaths among non-smokers, you almost inevitably will end up attributing some of those deaths to smoking, even if you try to “control for” drinking (which is pretty much never adequate). Thus, this method lets you just leave those deaths out.

However, there are several problems inherent in this method. There is some risk of double counting or exclusion of some of what should be counted, depending on what inputs are used. Someone whose death was caused by bladder cancer might also have died from stroke if surgery for the cancer caused the stroke (this makes both the cancer and the stroke a cause of that death). A study of bladder cancer would likely count that as a death from bladder cancer (and estimate the excess among smokers accordingly), while a study of cardiovascular disease might also count it. Of course many studies of CVD might exclude anyone with cancer to avoid this complication, but that introduces a problem if there are excess strokes in that population that are not actually caused by the cancer. And it is possible the study of bladder cancer might miss it. There is no clean way to do this — studies of individual diseases are simply not designed to be added up.

The added-up estimates will also suffer from publication bias. If you do a single study of a big dataset to estimate overall mortality, you will publish it. If you do a little study of bladder cancer and find a null result, you might not. If you do any study of lung cancer and find “too low” an estimate for the effect of smoking, you might bury for political reasons. Of course, for all such studies including the big dataset study, there are ways to bias the reported results and this makes everything in this space (perhaps everything in “public health” research) dubious. But in this case there is yet another layer created by the method.

There is also the problem that many of the available estimates are for the wrong population. This becomes a huge problem when you apply that “multiply by the number of deaths from the disease in the population” step. For example, the baseline rates of many cancers vary wildly, even among high-income populations. If smoking increases the risk for one of those cancers to some degree, then it might account for 30% of the cases in a population where the cancer is less common but only 10% of them where the baseline is three times as high. If you applied the 30% estimate to the total number of deaths in the latter population, you are overestimating the smoking-attributable deaths by a factor of three. This error is not possible for the “one big dataset” method.

Then there are some more subtle and idiosyncratic problems. Recall that I noted that a few deaths are delayed by smoking. I also noted that it may be more important, depending on exactly what common language claim is being made, that previous mortality due to the exposure removes some deaths. That is, if someone already died from smoking in 2005 and thus did not die from something else in 2010, then an estimate of the deaths in 2010 will ignore her. Depending on exactly what claim is being made (e.g., how many more people died that year because smoking exists), that might be an error. If you are looking at the excess deaths over time in a population, this error will be avoided (if you know what you are doing) because you will avoid the phrasing that creates the error. But the disease-by-disease method introduces various ways in which that problem can arise (e.g., smokers who die of heart attacks at 55 do not live long enough to die of prostate or oral cancer, which almost always only kill much later than that).

Related to these, the method is particular problematic for making estimates for ex-smokers: The definition of the exposure is unclear and may vary a lot across the input studies. Even the exposure definition for smoking is rather imprecise and thus can vary. Smokers vary in terms of quantity consumed, depth of inhalation, types of products smoked, and other factors that radically change how hazardous it is (notwithstanding the silly anti-tobacco propaganda that portrays it as always exactly the same). But ex-smokers vary even more, layering on the even bigger effects of time since quitting and length of the period smoked before quitting. Since this also varies a lot based on who is being studied, it is easy to apply an estimate to the wrong population.

It turns out that these problems alone are enough to make this method very rough, at best. That alone makes the reporting of statistics to two (let alone three or more) significant figures absurd, even apart from all the uncertainty in the input estimates. Still, it is not a terrible way to get a rough estimate, and it does have the advantages I noted.

So what does CDC do?

When you see an estimate of a total number of deaths, even for a non-US population (which, of course, is erroneous), it almost always traces to US CDC estimates. It turns out that what they do is weird hybrid of the all-cause-mortality method and the disease-by-disease method, which suffers from many of the problems specific to each. Their exact methods and the data they use are guarded secrets (flatly contrary to the proper behavior of real scientists), but the general approach is known.

They do most or perhaps all of their calculations using a single dataset, an ACS cohort from decades ago. But they do their estimates on the disease-by-disease basis. This means that they do not take advantage of benefit of the disease-by-disease approach, using the many estimates of the disease-specific risk to triangulate the true value; anything odd in their dataset ends up in their results with no external check. And their dataset is definitely odd, with a bias toward middle-class white people, in addition to being all Americans and from a generation ago, which mean the typical interpretations of it applying to the entire current US population, let alone to the future or other populations are extremely rough.

Their approach avoids blaming smoking for cirrhosis and transport deaths, so that is a plus. But over the years, they have “discovered” that smoking causes a longer list of diseases, and so have added those in. This has reduced this advantage of the disease-by-disease method (in the direction of increasing the claimed deaths, of course). If excess breast cancers among smokers are all basically due to confounding (which is plausible) and they just left it out like they used to, no problem. But now they have added it to the list, and so whatever association shows up in their one dataset, even if entirely due to confounding, is counted among the deaths.

By doing the analysis disease-by-disease, they have a chance to do a better job of controlling for confounding by using the right variables and functional forms. But they almost certainly do not do so. Instead, it is a safe bet that they — like most people in “public health” — pick their covariates relatively thoughtlessly, and if they do fiddle with them at all, it is to try to produce a “better” (i.e., higher) estimate of the risk.

One advantage of them using a single dataset goes back to the original question that started this series. If a study of ex-smokers only looks at, say, those who smoked regularly more than 10 years and quit less than 30 years ago, then it would get a higher estimate of the effects of being an ex-smoker than if that included even the person from the question, a guy who smoked socially for ten years in the 1970s. If you then applied those estimates to that guy, it would clearly overestimate his risk of his death being caused by his former smoking. But whatever measure of ex-smoking CDC uses, it is necessarily going to be consistent; the estimate and the counting up of ex-smokers average together the same mix of people. There are still a few problems from the heterogeneity (the ex-smokers who actually are at risk of dying from it will die slightly sooner, and this shift will inflate CDC’s numbers), but the worst problem is solved.

So, this finally brings us back to the incredulity that prompted this discourse and all of its delightful tangents, the implicit thought of “how could they possibly say that that guy’s death was X% likely to have been caused by his ancient smoking, as if he were the same as someone who smoked 50 years and quit last year, and therefore count him as X% of a smoking-caused death?” The answer is that they do not make that particular error. When they count him in the ex-smokers, they also average in his not-any-higher risk for disease among all the higher-risk ex-smokers, and those cancel out.

So no problem there. Of course, there is still the matter of the other 6000 words worth of problems I covered in the series.