by Carl V Phillips
In the previous post, I gave some background about the new proposed rule from FDA’s Center for Tobacco Products (CTP) that would cap the concentration of the tobacco-specific nitrosamine (TSNA) known as NNN allowed in smokeless tobacco products (ST). Naturally, I think you should read that post, but to follow the scientific analysis which begins here, you do not need to.
Before even getting to the even worse nonsense about NNN itself, it is worth addressing CTP’s key premise here: They claim that ST causes enough cancer risk, specifically oral cancer, that reducing the quantity of the putatively carcinogenic NNN could avert a lot of cancer deaths.
Readers of this blog will know that the evidence shows ST use does not cause a measurable cancer risk. That is, whatever the net effect of ST use on cancer (oral or otherwise), it is not great enough to be measured using the methods we have available. That does not necessarily mean it is zero, of course. Indeed, it is basically impossible that any substantial exposure has exactly zero (or net zero) effect on cancer risk. But even if all the research to date had been high-quality and genuinely truth-seeking — standards not met by much of the epidemiology, unfortunately — there is no way that we could detect a risk increase of 10% (aka, a relative risk of 1.1) or, for that matter, a risk decrease of 10%. Realistically, we could not even detect 30%. For some exposure-disease combinations it is possible to measure changes that small with reasonable confidence (anyone who tries to tell you that all small relative risk estimates should be ignored does not know what he is talking about). But it is not possible for this one, at least not without enormously more empirical work than has been done.
Despite that, FDA bases the justification for the rule on the assumption that ST causes a relative risk for oral cancer of 2.16 (aka, a 116% increase), or a bit more than double. This eventually leads to their estimate that 115 lives will be saved per year. Before even getting to their basis for that assumption, it is worth observing just how big this claimed risk is. (I will spare you a rant about their absurd implicit claims of precision, as evidenced in their use of three significant figures — claiming precision of better than one percent — to report numbers that could not possibly be known within tens of percent. I wrote it but deleted it and settled for this parenthetical.)
A doubling of risk, unlike the change of 10% or 30%, would be impossible to miss. Almost every remotely useful study would detect an increase. Due to various sources of imprecision, some would have a point estimate for the relative risk of 1.5 (aka, a 50% increase) and some 3.0, but very few would generate a point estimate near or below 1.0. Yet the results from most published studies cluster around 1.0, falling on both sides of it.
You would not even need complicated studies to spot a risk this high. More than 5% of U.S. men use smokeless tobacco. The percentages are even higher, obviously, for ever-used or ever-long-term-used, which might be the preferred measure of exposure. This would show up in any simple analysis of oral cancer victims. With 5% exposed, doubling the risk would mean about 10% of oral cancer cases among nonsmoking males would be in this minority. A single oral pathology practice that just asked its patients about tobacco use would quickly accumulate enough data to spot this. It is not quite that simple (e.g., you have to remove the smokers, who do have higher risk) but it is pretty close. The point is that the number is implausible.
In Sweden, ST use among men is in the neighborhood of 30% (and smoking is much less common). A doubling of risk for any disease that is straightforward to identify, like oral cancer and most other cancers, would be much more obvious still. But no such pattern shows up. The formal epidemiology also shows approximately zero risk. Most of the ST epidemiology is done in Swedish populations, basically because relatively common exposures are much easier to study.
So how could someone possibly get a relative risk estimate of more than double?
The answer is that they created the absurd construct, “all available U.S. studies” and then took an average of all such results. (They actually used someone else’s averaging together of the results. They cite two papers that did such averaging and — surprise! — chose the higher of the results, though that hardly matters in comparison to everything else.) This is absurd for a couple of reasons which are obvious to anyone who understands epidemiologic science, but not so obvious to the laypeople that the construct is designed to trick.
You might be thinking that it is perfectly reasonable to expect that different types of ST pose different levels of risk. Indeed, that seems to be the case (however, the difference is almost certainly less than the difference among different cigarette varieties, despite the tobacco control myth I mentioned in Part 1, the claim they are all exactly the same). But nationality obviously does not matter. Should Canadian regulators conclude that nothing is known about ST because there are no available Canadian studies? This is like assessing the healthfulness of eating nuts by country; the difference is not about nationality but mostly about what portion of those nuts are peanuts (which are less healthful than tree nuts). If the category of nuts is to be divided, the first cut should be health-relevant categories of nuts, not nationality. Nutrition researchers and “experts” are notoriously bad at what they do, but few would make this mistake like FDA did.
The error is particularly bad in this case: It turns out the evidence does not show a measurable difference in risk between the products commonly used in the USA and those commonly used in Sweden. The data for all those is in the “harmless as far as we can tell” range. But it appears that an archaic niche ST product, a type of dry powdered oral snuff, that was popular with women in the US Appalachian region up until the mid-20th century, posed a measurable oral cancer risk. It turns out that a hugely disproportionate fraction of the U.S. research is about this niche product — disproportionate compared to even historical usage prevalence, let alone the current prevalence of about nil. There is nothing necessarily wrong with disproportionate attention; health researchers have perfectly good reasons to study the particular variations on products or behaviors that seem to cause harm. Also, it is much easier to study an exposure if you can find a population that has a high exposure prevalence, in this case Appalachian women from the cohorts born in the late 19th and early 20th centuries.
It is not the disproportionate attention that is the problem. The problem is the averaging together of the results for the different products. Even if that might have some meaning if the average were weighted correctly, it was very much not weighted correctly.
The 2.16 estimate was derived using the method typically called meta-analysis, though it is more accurately labeled synthetic meta-analysis since there are many types of meta-analysis. It consists basically of just averaging together the results of whatever studies happen to have been published. Even in cases the are not as absurd as the present one, this is close to always being junk science in epidemiology. The problems, as I have previously explained on this page, include heterogeneity of exposures, diseases, and populations, which are assumed away; failure to consider any study errors other than random sampling error; and masking of the information contained in the heterogeneity of the results. To give just a few examples of these problems: Two studies may look at what could be described in common language as “smokeless tobacco use”, but actually be looking at totally different measures of quite different products. Similarly, one study might look at deaths as the outcome and another look at diagnoses, which might have different associations with the exposure. A study might have a fairly glaring confounding problem (e.g., not controlling for smoking), but get counted just the same, obscuring its fatal flaw as it is assimilated into the collective. One study might produce an estimate that is completely inconsistent with the others, making clear there is something different about it, but it still gets averaged in.
But beyond all those serious problems with the method in general, all of which occur in the present case, this case is even worse. It is worse in a way that makes the result indisputably wrong for what FDA used it for; there is simply no room for “well, that might be a problem but…” excuses. It is easy to understand this glaring error by considering an analogy: Imagine that you wanted to figure out whether blue-collar work causes lung disease. This might not be a question anyone really wants an answer to, but it is still a scientific question that can be legitimately asked. Now imagine that to try to answer it, you gather together whatever studies happen to have been published in journals about lung disease and blue-collar occupations. As a simplified version of what you would find, let us say that you found two about coal miners, one about Liberty ship welders, one about auto body repair workers, one about secretaries, and two about retail workers. So you average those all together to get the estimated effect on lung disease risk of being a blue-collar worker.
See any problem there? If you do, you might be a better scientist than they have at FDA.
Obviously the mix of studies does not reflect the mix of exposures. Why would it? There is absolutely no reason to think it would. Notwithstanding current political rhetoric, only a miniscule fraction of blue-collar workers are in the lung-damaging occupations at the start of the list. The month-to-month change in the number of retail jobs exceeds total jobs in coal mining. But the meta-analysis approach is to calculate an average that is weighted by the effective sample size of each study, with no consideration of the size of the underlying population each study represents. The proper weighting could easily be done, but it was not in my analogy nor in the ST estimate FDA used (nor almost ever). If all the studies in our imaginary meta-analysis have about the same effective size, this average puts more weight on the <1% of the jobs that cause substantial risk than the majority that cause approximately zero risk. (Assume that you effectively controlled for smoking, which would be a major confounder here creating the illusion that even harmless blue-collar jobs cause lung disease, as is also a problem with ST research).
As previously noted, it is not only possible, but almost inevitable that studies will focus on the variations of exposures that we believe cause a higher risk. No one would collect data to study retail workers and lung disease. If they have a dataset that happens to include that data, they will never write a paper about it. (This is a kind of publication bias, by the way. Publication bias is the only one of the many flaws in meta-analysis that people who do such analyses usually admit to. However, they seldom understand or admit to this version of it.)
It turns out that this same problem is no less glaring in the list of “all available U.S. studies” of ST. In that case, about 50% the weight in the average is on the studies of the Appalachian powered dry snuff[*], which accounts for approximately 0% of what is actually used. Indeed, the elevated risk from the average is almost entirely driven by a single such study (Winn, 1981), which is particularly worth noting because this study’s results are so far out of line with the rest of the estimates in the literature. A real scientific analysis would look at that and immediately say that study cannot plausibly be a valid estimate of the same effect being measured in the other studies; it is clearly measuring something else or the authors made some huge error. Thus it clearly makes not sense to average it together with the others.
[*] As far as we can tell. The methods reporting in the studies was so bad — presumably intentionally in some cases — that they did not report what product they were observing. We know that the Winn study subjects used powdered dry snuff because she admitted it in a meeting some years later, and this was transcribed. She has made every effort to keep that from getting noticed in order to create the illusion that the products that are actually popular cause measurable risk. For some of the other studies we can infer the product type from gender and geography (i.e., women in particular places tended to be users of powdered dry snuff, not Skoal).
It is amusing to note what Brad Rodu did with this. Recall that the over-represented powered dried snuff was used by Appalachian women. So effectively Brad said, “ok, so if you are going to blindly apply bad cookie-cutter epidemiology methods rather than seeking the truth with scientific thinking, you should play by all the rules of cookie-cutter epidemiology: you are always supposed to stratify by sex” (my words, not his). It turns out that if you stratify the results from “all available U.S. studies” by sex (or gender, assuming that is what they measured — close enough), there is a huge association for women (relative risk of 9) and a negative (protective) association for men. ST users in the USA are well over 90% male. Brad has some fun with that, doing a back-of-the-envelope to show that if you apply that 9 to women and zero risk to men, you get only a small fraction of the supposed total cases claimed by FDA. And this is a charitable approach: If you actually applied the apparent reduced risk that is estimated for men, the result is that ST use prevents oral cancer deaths on net.
Notice that in my blue-collar example, you would also get a large difference by sex, with almost all the elevated risk among men. Of course, there is no reason to expect that sex has a substantial effect on either of these, or most other exposure-disease combinations. Results typically get reported as if any observed sex difference is real, but that is just another flaw in how epidemiology is practiced. The proper reason for doing those easy stratifications is to see if they pop out something odd that needs to be investigated, not because any observed difference should be reported as if it were meaningful. When there is a substantial difference in results by sex for any study where the outcome is not strongly affected by sex (e.g., not something like breast cancer or heart disease), it might really be an inherent effect of sex, but it is much more likely to be a clue about some other difference. Maybe it shows an effect of body size or lifestyle. Or perhaps the “same” exposure actually varied by sex. In the ST and blue-collar cases, we do not have to speculate: it is obvious the exposure varied by sex.
The upshot is not actually that when assessing the average effect, you should stratify the analysis by sex (though it is hard not to appreciate the nyah-nyah aspect of doing that). It is that averaging together effects of fundamentally different exposures produces nonsense. If there is a legitimate reason to average them together (which is not the case here), the average needs to be weighted by prevalence of the different exposures, not by how many studies of each happen to have appeared in journals.
It gets even worse. I put a clue about the next level of error in my blue-collar example: the shipyard welders worked on Liberty ships. In the 1940s, ship builders had very high asbestos exposures, the consequences of which were not appreciated at the time. Today’s ship welders undoubtedly suffer some lung problems from their occupational exposures, but nothing like that. Similarly, regulations and better-informed practices have dramatically reduced harmful exposures for coal miners and auto body workers. In other words, calendar time matters. Exposures change over time, and the effects of the same exposure often change too, with changes in nutrition, other exposures, and medical technology. There are no constants in epidemiology. (That last sentence, by the way, a good six-word summary of why meta-analyses in health science are usually junk.)
One of the meta-analysis papers FDA cites breaks out the study results between studies from before 1990 and after that. It turns out that the older group averages out to an elevated risk, while that later ones average out to almost exactly the null. This is true whether you look at just U.S. studies or studies of all Western products. Does this mean that ST once caused risk, but now does not? Perhaps (a bit on that possibility in Part 3). Some of it is clearly a function of study quality; I have poured over all those papers and some of the data, and the older ones — done to the primitive standards of their day — make today’s typical lousy epidemiology look like physics by comparison. A lot of this difference is just a reprise of the difference between the sexes: the use of powdered dry snuff was disappearing by the 1970s or so (basically because the would-be users smoked instead). In case it is not obvious, if you have a collection of modern studies that show one result and a smaller collections of older studies that show something different, you should not be averaging them together.
In short, a proper reading of the evidence does not support the claim that ST causes cancer in the first place. But even if someone disagrees and wants to argue that it does, that 2.16 number is obviously wrong and based on methodology that is fatally flawed three or four times over. That is, even if one believes that ST causes oral cancer, and even he believes it could even double the risk (setting aside that such a belief is insane), relying on this figure makes the core analysis that justifies this regulation junk science.
The next post takes up the issue of NNN specifically.