by Carl V Phillips
Some of you may have seen this new paper by Nutt et al. that purports to show the comparative costs imposed by various tobacco products. There might be some temptation to cite it as evidence of the benefits of switching from smoking to smoke-free alternatives. But I urge you not to do that for the reasons explained below.
I am corresponding with the authors and other researchers about this now. I wrote a 4000 word review of the paper that goes into a lot of details. I will eventually post some version of that – taking into consideration anything the authors might have to say about it.[UPDATE: August 2014: Upon linking to this post, I remembered this promise. I did not turn it into quite the finished product that I wanted, but time moved on. So to keep the promise, I have appended the whole thing as it exists in my file archives as a comment below. Note in particular the bit about gutka that is the basis of linking from the other post, does not appear in the main text here, but is there.] Perhaps I will back off on some of my criticisms after that exchange (one of the problems with the paper is that the methodology is not explained and the results are presented in a fairly opaque way, so it is possible I misinterpreted something). But wanted to not delay in warning people off of misinterpreting this paper or making the tactical and ethical mistake of citing the numbers in it. What I am presenting here are fairly general points that I am quite confident about, and do not expect would change upon learning more.
1. The numbers presented are probably not what you think they are.
The results are presented in the graph below. (That is basically the entire presentation of the results so, as I said, it is difficult to get the details right. Clicking through to the original from the above link might give you a higher quality image, but you will see it is still very difficult to estimate the magnitudes of many of the bars, and the similar colors make it difficult to know which bar matches what label.)
It is easy to glance at the graph and think “this is a calculation of the comparative risks from the various products”. But that is wrong for two reasons. First, what is graphed is not actually the risks (which most readers would interpret as meaning “the health impacts”) but an arbitrary combination of health risks, purchase price, amount of criminal activity surrounding the product, environmental impacts, and numerous other factors. Some of them measure the cost for each individual who uses the product, while others are some vague notion of the “global burden” caused by the level of use of the category. The way in which they are combined is never justified, or even explained, and there are some specific choices that I consider very dubious, which I will discuss in a later post.
Second, it is not research-based calculation. All those numbers create the illusion of quantitative research, but what was apparently actually done is that a small group of researchers and pundits sat in a room took guesses about the numbers, made arbitrary choices about how to combine them, and then presented that as if it were a fact-based quantification. It was a study of a focus group, not scientific research about the costs. Put another way, this is basically an opinion piece with a graph that implies it was quantitative research. Because of that, citing it as evidence is much the same as citing any other editorial. Citing the actual numbers – which are basically just made up – is a mistake that would be harmful to the cause of THR.
2. If you dig into the details, you will see that this offers fairly tepid support for THR, and substantially overstates the costs of e-cigarettes and smokeless tobacco. As readers of this blog know, there is overwhelming evidence that smokeless tobacco is about 99% less harmful than smoking, and every reason to believe that e-cigarettes and NRT are about the same. There is no evidence to suggest there are measurable differences in risk among these products, contrary to what Nutt et al. claim. The authors assign much larger risk numbers to smoke-free products, and then add in what seems to be purchase price (incorrectly, by my reading) and some other considerations, and thereby elevate those bars in the graph to the neighborhood of 5% as high as the one for cigarettes. This is simply not right, and understates the case for THR. (It is worth specifically noting that the “Smokeless unrefined” category is not even about tobacco, but other more hazardous dip products.)
Moreover, the authors assign the NRT products shorter bars, despite no evidence existing to support such a claim, and in the text they specifically suggest that THR should aim to switch smokers to NRT products (which, of course, few smokers find satisfying), rather than e-cigarettes or smokeless tobacco. Thus, if you look at the details, this paper makes a much poorer case for THR than the evidence does.
3. And THR only looks as good as it does because that graph overstates the cost from smoking. A large portion of the cost attributed to smoking turns out to be exaggerated claims about effects other than the smokers’ health. It blames the crime (black markets) surrounding smoking on the cigarette and not on the punitive taxes that are actually responsible for black markets. Similarly, it counts the purchase price as part of the cost, but again that is mostly taxes and not the cost of the product itself. A large part of the cost is the mythical extra resource costs imposed on society by smokers (medical costs, lost work), but the reality is that the health effects of smoking clearly reduce, not increase, someone’s net consumption (basically, people who die ten or twenty years early do not produce much less in their lifetime but consume a lot less). Thus smoking should actually get a credit for this, according to the authors’ approach. Additionally, the “global burden” type measures seem to reflect the fact that cigarettes are more popular, which has nothing to do with the costs of each individual using one product rather than another. If these erroneous costs assigned to smoking were removed, the bars for THR products would probably be in the range of >10% of the one for cigarettes. Thus, to claim that this paper shows that smoke-free products’ comparative costs are those shown in the graph is to make that claim based on the bad numbers for smoking.
We know, of course, that smoke-free products impose enormously lower costs than smoking. The point here is that this paper does not add any support for that. We object when Glantz et al. make claims based on misinterpretations or junk numbers, and we should hold ourselves to a higher standard. This is both for ethical reasons, and because it becomes impossible to credibly criticize ANTZ junk science claims if we are trafficking in similar claims ourselves.
In short, please do not cite this paper as supporting THR. It does not provide very good support, grossly understating the benefits of THR. To the extent it does seem to provide support, it is not proper to cite it because (a) it is just made-up numbers and (b) the reason those numbers even look as good for THR as they do is because the numbers for cigarettes are wrong.
I will provide more details as they become available.
[Update 23mar15: I promised to post more and did not do it. But I was reminded by a recent spate of naive repeating of the above graph. I sent the following text to a discussion list that included the authors. They replied, though I did not ask for permission to share that. I will summarize by simply saying that I did not change my mind about any of my criticisms as a result of the reply.
This reminds me of the Levy paper from 2004, which was inaccurately cited for years as a basis for the absurd claim that smokeless tobacco causes 10% the risk from smoking. Since it was not an estimate of the risk (about which there was ample information), but rather a study of a small group’s beliefs (or, more accurately, their convictions), it obviously offered no such information. I very unhappily foresee the same thing happening with this one.
As with Levy, this is basically a focus group study, which makes it a study of the participants, not of the world. There is no evidence of any review of the empirical research or calculations; rather, the focus group relied on whatever they happened to know (or thought they knew) coming in, filtered through an arbitrary process for quantifying and combining. It would be fine if it were presented as what it really was – a study of a few people’s opinions, not a scientific measure of the world – but like Levy, it is presented as if it were a scientific measurement of some physical reality.
As a measure of reality, it is the functional equivalent of a bunch of guys sitting around at a bar debating which American football team has the best defense. Indeed the methodology is basically the same (minus the beer, presumably): assemble a group of people who know something about a topic (but are otherwise a very small random subset of all the people who know about the topic), trigger a discussion, and let them argue their points based on whatever they happen to believe sitting there that day. Fortunately in the case of the bar conversation, no one writes down the results or quotes the numbers they come up with.
At least with the Levy study, the criteria and process were well described, unlike in the present case. The methodology in Levy was a non-realtime barstool conversation which at least allowed the participants time to review literature and do calculations. There is no evidence they did so (and ample evidence that most of them certainly did not), but they could have. Apparently in the present case (thought it is not clear), the discussion took place in realtime over two days. The mind boggles at how it would be possible to seriously address the second and third steps in the methodology within that time (determining what products are to be considered and establishing criteria for comparison), let alone moving on to the remaining five steps that created the numbers.
The goal of the present process was also far less concrete than Levy, creating a hybrid seat-of-the-pants index measure of some sort. This makes it difficult to declare that the numbers are wrong, but that is only because it is impossible for them to be right. For something to be right in science, it needs to be a measure of some defined real phenomenon. This is most definitely not a measure of any real thing, though it is portrayed as such. It is addressing a question like “how many angels can dance on the head of a pin?”
[To head off one inevitable response to that of “this was not intended to be science in the first place”: It pretends to be, and therefore should be analyzed as science. It is basically an opinion piece on the topic, “which product should we worry the most about?” But it is dressed up as if it is presenting scientific measures based on real data. Expressing normative opinions is fine. Dressing them up as scientific results is not.]
Though the results cannot be right and thus it is hard to say that they are wrong, it is not difficult to show that there are flaws in the reasoning that was used. An arbitrary undefined index cannot be right or wrong, unlike a real scientific measure (compare: “who has the best defense” with “who allowed the most points scored”), but the reasoning and background that went into creating it can clearly be wrong.
The problems begin with determining the products themselves:
-The authors point out that little cigars are basically the same as cigarettes. And yet they are considered separately (and earn a radically different score). The category boundaries are, of course, quite arbitrary (carving little cigars out from the cigarette category is not much different from carving Marlboros out as their own separate category). This is just the most obvious example of that.
-The division between “smokeless refined” and “snus” is a worse case of drawing a line where none exists (there are not even regulatory differences in this case).
-Gutka is not smokeless tobacco. Apparently it was considered such (it was mentioned in the introduction and described as “dry snuff” that is “common in SE Asia”, which it most definitely is not). This makes a hash of the ST results, which are thus fundamentally misleading (even to the extent that the numbers are meaningful).
-The use of the term “ENDS” is unethical. This is an aside that does not affect the results, but it does reflect badly on the authors and I cannot let it pass without mention. (For those not aware of this, consider British Empire era studies of the world’s peoples and their behaviors and artifacts, in which the researchers considered the subjects subhuman and thus ignored both their own self-knowledge and imposed derogatory jargon on them instead of using their own vocabulary. For obvious reasons, this is considered completely inappropriate and unethical in modern science. And yet tobacco researchers routinely treat their subjects as subhuman, and while this term is hardly the worst of it, it is one of those soft “-isms” that creeps in and makes even those supposedly sympathetic commentators from on-high part of the denigration. This would never be allowed in a science that is seriously concerned about human study ethics.)
Similar problems continue with the mere statement of the criteria being considered:
-The rehashing of criteria used for banned drugs is immediately evident in the evaluation criteria table. The mortality and morbidity entries include “misuse or abuse” of the products categories. This conceptualization might or might not be useful for other drugs; it is certainly inappropriate for tobacco, and speaks to this process being the equivalent of a quick-and-dirty port of a piece of software to another platform without consideration about whether it works there. Creating a division of “Product-specific mortality” (the basically non-existent misuse or abuse category) versus “Product-related mortality” (the harms caused by the actual use of products, like cigarettes causing cancer) is very strange and speaks to the process not being well reasoned.
-The “dependence” category tries to hide an arbitrary notion of “addiction” (not just dependence) behind the more respectable word. It includes “the product creates a propensity or urge to continue use despite adverse consequences”. Even setting aside the fact that the product is not an actor, this is one of the many badly failed attempts at defining addiction (everything has adverse consequences).
-Those who look at the tobacco experience as outsiders might not realize it, but the “Loss of relationships” entry is completely infuriating to those of us who study real people in the context of the tobacco wars — to say nothing of to the real people themselves. Whatever is bad about it, smoking has always been a basis for building relationships, and it is anti-tobacco measures that cause their loss.
-The categories of crime and international damage, as well as the personal loss of tangibles and family adversities, also largely measure the effects of regulatory policies and not the products. I will come back to those.
-As an economist, I have to object to the misuse of the term “economic cost” to refer to only costs to capitalists and government. Economics is the study of tradeoffs among all costs and benefits.
The problems become worse with the scoring and weighting:
-The paper explicitly claims that the measures on the [0,100] scale are linear. That is, it claims that numbers from it can be arithmetically compared. But the scale is an arbitrary index measure (an amalgam of incommensurate measures, added together) so this is clearly not true. This represent a scientific claim, and it is dead wrong. (As a bizarre minor point, they do not even seem to get this right, later reporting that the highest entry in their scale that they normalize to [0,100] is actually 99.6.)
-The Weighting section consists of only four paragraph that (sort of) explain the concept of weighting different inputs to create an index measure. It describes one process for operationalizing that. And then it stops. It gives no information at all about what was done. The reader does not need to be told what weighting is (if they do not already know, they are not going to be able to understand this anyway), but rather needs to know what weighting was used. And we do not. This read like something out of pomo critical theory, not the methodology section of a paper that produces numerical results.
-Similarly, the Scoring section is content-free. It explains only that the range was anchored at 100 and zero. There is not a word about what measures were used for the various items on the list of criteria.
To summarize the Methods section, the reader is given a list considerations and a textbook description of how to create an index scale. And then the results are presented. Methods reporting in health science is typically sloppy, with even clearly-defined methods used by the authors not reported properly. But what happened here is far worse.
It is difficult to imagine coming up with any quantitative scale for half of the considerations listed. It could be done, but throwing a scale together – let alone populating it – during the five minutes allotted to it during a two-day group meeting would not qualify as having done it. Any one of these (creating a quantitative scale for, e.g., “the extent to which the use of the product increases criminal behavior” or “the extent to which the use of the product causes family adversities”) would require at least a paper in itself. It would need to be clear about what was being measured and how it was being used. It would need to check the quantification against what objective measures can be found for how bad something is, and show that it is a decent measure for it. Most of all, it would need to be presented as one of the infinite arbitrary quantitative indexings possible for even one such consideration, and not as the measure of it. It is pretty clear that the reason that none of this was not reported is that it was not done. I would guess that the actual methodology was that whoever in the group asserted themselves as the expert on, say, crime, arbitrarily assigned numbers to each product and after a few minutes of science-free “group consensus building” these were accepted. To return to very similar study of football, whichever bar patron who asserts his expertise on who has the better deep pass coverage has his opinion entered into the forming expert consensus (and while there are very good statistics on deep pass coverage available, unlike for most of what went into the present paper, it is a safe bet that they were not used as the basis for the entry).
So that would explain why the methods for quantifying the individual categories were not reported – there basically were no methods. This still does not explain why the weighting factors were not presented and justified. While it might be impossible to explain the methods that went into deciding why one product scored the 100 for “International damage” and why another got 76.32, it is trivial to report how this score was entered into the index. (“After scoring each criterion individually on a scale of [0,100], we creating the index with the equation, Mortality*10 + Loss of wealth (in units of…)*5 +…., which was then rescaled linearly to [0,100].”) Once that was reported, of course, there would be a need to justify it. It seems that the only reason that this was not reported was that it would call attention to how arbitrary the process was. This is inexcusable.
[The Results section seems to hint at more information about the methods. It appears that the weightings can be inferred from Figure 4 in the results, though this is sufficiently badly labeled that it is a leap of faith to interpret it that way. I put this section in brackets because I am not sure of the interpretations; “Figure 4 is a garbled mess” is a plausible alternative interpretation. But interpreting it as best I can makes this process appear even more bizarre. First, no justification for the weights was given. And yet the weights were apparently determined to at least three significant figures for a methodology that is so arbitrary that one sig fig is already implies more precision that really exists.
Second, the weighting for “related” mortality is an order of magnitude higher than for “specific” mortality. But any sensible index would add up all deaths and count them the same. This suggest the weighting is not only arbitrary and unjustified, but it does not even attempt to eliminate the bits of it that could be made non-arbitrary (weighting crime against mortality is necessarily arbitrary; weighting mortality against other mortality need not be at all arbitrary, but it appears they made it so).
Third, it appears, based on the discussion surrounding Figure 4, that some of the criteria are weighted based entirely on the metric “how much harm is caused per person doing this”, which is certainly what the reader will interpret the results as meaning. However, some of them are weighted based on total impact (“global usage”). This not only makes a hash of the index – needlessly measuring two incommensurate outcomes and squashing them together in an arbitrary way – but it is contrary to the way the results are presented and will inevitably be interpreted (presumably by design).]
So, lacking reporting of how the entries to the index were scored, nor even how they were combined (unless one wants to infer a bit from Figure 4), the reader is left to infer what was done the Results. The inference is not pretty. Basically this seems to just be a rehashing of easily-debunked myths, made worse by dressing them up with numbers.
I will just identify a few points that jumped out at me, and I suspect other readers can find many more with only a few minutes’ consideration. (Note that the bar chart, the only reporting of the results, is rather hard to read, using five almost-identical shades of blue, four almost-identical shades of orange, etc. It is also almost impossible to resolve at the low end. I am reading it as best I can. Needless to say, this is inexcusable in itself.)
-The financial, crime, international etc. costs all seem to be about taxes. As with everything else, this must be inferred because there is almost no useful information presented. But this seems likely from the moment one reads the list of considerations, and the skewing of these measures toward heavily-taxed products confirms it. It is badly wrong to conflate the effects of the product with the effects of the policies surrounding the product. Cigarettes (unlike, say, alcohol or amphetamines) do not cause any crime other than littering. Taxes on cigarettes cause crime. There can be no justification for attributing the effects of taxes to the product, and this error is made enormously worse by pretending that the results are about the product, with no acknowledgment that this is the not case. Moreover, such measures do not generalize – tax rates vary radically, and in low-tax jurisdictions, these harms do not exist.
-This and other problems are exemplified by the claim that a third of costs from cigarettes are “to others”. This seems to be a combination of the effects of the taxes and the supposed costs that are mislabeled “economic costs”. But the latter is also wrong, since it is clear that the impact on net production-minus-consumption from smoking is positive (that is, it mortality from smoking reduces consumption more than it reduces production; indeed, it is not entirely clear whether it net increases healthcare consumption alone, and when all reduced consumption is considered, the results are clear). So smoking should actually get a credit for the impact on others according to the criteria used.
-It is claimed that the American and Swedish smokeless tobacco products are radically different in terms of health risk. There is no evidence to support this claim, and all available evidence suggests it is false. This has been pointed out repeatedly for a decade.
-It appears (again, the smaller numbers are almost impossible to discern) that it is claimed that the health risks from e-cigarettes are lower than those from smokeless tobacco. There is no reason to believe this is the case, and good reasons to doubt it. It is clearly being claimed that NRT are lower risk (and referred to as “even purer”, whatever that means, in the Discussion), but there is not a shred of evidence to support this claim, and no good reason to believe it is true.
-Most of the costs attributed to smokeless tobacco and e-cigarettes seems to be “Family adversities” which seems to mean “being out the purchase price”. But while smokeless tobacco users sometimes suffer from punitive taxes (and again, it is the taxes, not the product causing the harm), e-cigarette users do not, and vaping is cheap. Yet the contribution is still half the total for e-cigarettes and similar in absolute magnitude to the contribution for other products.
-There is somewhat better health risk information about cigars and pipes than there is for NRT products, but not nearly enough to be making the specific claims that appear. There is even less about hookahs. Additionally the use behaviors for these products varies so radically – even beyond quantity, which is a pretty good measure of heterogeneous use for the other categories – that making blanket claims about the categories is absurd.
-Gutka etc. do not even belong on this list, given that their impacts are clearly not caused by the tertiary ingredient, tobacco. But since they are here, they should be measured correctly. The evidence about their effects suggests their harms are comparable to those from cigarettes. This is not to say they are exactly the same (I would never make unjustified precise claims like that), but they are a lot closer to cigarettes than to smokeless tobacco by any reasonable measure.
The Discussion leaves little doubt that the authors are encouraging misinterpretation of these numbers. Arithmetic comparisons are reported as if they have concrete meaning, purposefully hiding the fact that the index is arbitrary. The barstool scorings of scientific facts are reported as if they were scientific analyses. Also, 12 discrete points are described as a continuum – a trivial problem with this, all else considered, but still worth mentioning.
The authors conclude that moving tobacco users from combustion products to non-combustion products would be good for public health. We knew that, but we learned nothing from this paper that changes our knowledge about that. They emphasize the NRT products as the preferred target for switching, even though that is the one low-risk product category that we know does not appeal to many smokers. So what is basically happening is that they collect a few people’s (unsubstantiated and probably wrong) guesses that NRT is substantially less harmful than other smoke-free alternatives, and then use that as a basis for the conclusion that NRT is less harmful. This is pure circularity (in addition to probably being detrimental to the cause due to the lack of consumer interest in NRT, and the faith-based orthodoxy among ANTZ to insist they are better nonetheless, and thus appealing low-risk products should be banned).
Similarly, the THR message itself is based on this. The impact of cigarettes is grossly inflated because it blames the product for the effects of taxes and incorrectly claims that it causes net consumption increases. Even with that, the numbers overestimate the true relative costs from smoke-free products substantially (which are approximately zero). But the point is that the THR message in this is based on an obviously incorrect analysis, so using this to promote THR is dishonest even apart from the more subtle points about the numbers themselves being arbitrary.
The analogy to an alcohol harm reduction strategy of encouraging drinks with lower alcohol concentration reinforces the appearance of this methodology as a bad software port, and suggests a lack of understanding of tobacco use by the lead author. Tobacco harm reduction reduces risk far more dramatically than the reductions from changing doses from other drugs, and for reasons that have nothing to do with dosage. Moreover, the fixation on dosage reduction is a disaster for THR; attempts to reduce the delivery of nicotine from low-risk products, which people with this mindset seem to support, interferes with THR.
In summary, this exercise would be – at its best – a subjective assessment that was nothing more than a group off-the-cuff commentary dressed up with numbers. If it were presented as that, it would at least be honest, though still not informative. But it is inaccurately presented as if it is a genuine quantitative analysis. Moreover, it is not even that good, because some glaring errors are evident. Even though no “right” version of these numbers could be produced, wrong inputs are certainly possible, and they are widely evident. Probably worst of all, these numbers are going to be generally misinterpreted as being estimates of the comparative real harm of product use (i.e., health impacts on the user), which they are clearly not; the authors do nothing to discourage this inevitable error.
I realize that some of the authors are on this list. Before I post this critique in other forums, please respond in this one to any criticisms you feel are unwarranted, factually incorrect or unfair.
This paper is going to be interpreted by naïve readers in the ways I describe, and as such the authors have created a problem. But the ethical test comes in how we report it. We have excoriated anti-THR activists for making unwarranted claims based on thin to non-existent evidence. Are we willing to hold work that is (mostly) pro-THR to the same standards?