by Carl V Phillips
The recent controversy (see previous two posts), about Stanton Glantz’s “meta-analysis” that ostensibly showed — counter to actual reality — that e-cigarette users are less likely to quit smoking than other smokers, has left some readers wanting to better understand what this “meta-analysis” thing is, and why (as I noted in the first of the above two links) Glantz’s use of it was inherently junk science.
What is “meta-analysis”?
Synthetic meta-analysis consists of synthesizing the results of previous quantitative estimates of ostensibly(!) the same phenomenon into a single result, theoretically to get a better measure than is otherwise obtainable. For epidemiology, particularly including research about medical therapies, and other social sciences, this generally means averaging together various study results, each of which was already an average across the population that was studied.
In what follows, except when I note otherwise, I am writing about that particular type of meta-analysis, which is what is usually meant when it is used in health sciences. I will leave out the modifier, synthetic, and the scare quotes, which I used to emphasize that this jargon does not mean what the word literally means. (In a sidebar in the first linked post, I briefly explain how other study methods that fall under the rubric “meta-analysis”.)
There are two basic approaches (though, of course, there are countless variations of these, and hybrid versions): 1. Taking the statistical results of each of the studies (e.g., the estimated odds ratios), and creating a weighted average of them (weighting based on the size of the studies). 2. Obtaining the original data from each study and pooling all the individual observations into a single dataset. These theoretically accomplish the same thing, though the second offers several advantages in the rare instances it can be done (e.g., adjustment for confounding can be done by estimating the effect of “confounder variables” across all the data, rather than just accepting the adjusted result from each study).
The unstated fiction that justifies a meta-analysis (or, I should say, supposedly justifies this) is the “separated at birth” assumption: We pretend that all of these studies were actually bits of one large study, but the datasets were divided up and sent to different researchers to analyze. Since each of these smaller subsets will product less reliable results, due to random error, than the (fictitious) whole dataset, the synthesis is intended to put them back together and get the more reliable result. That is what this kind of meta-analysis can accomplish, and nothing more.
Three things should be immediately apparent from this:
1. If it is absurd to pretend the studies were parts of a single large study, the whole enterprise is absurd. I will address that at length below.
2. Garbage in, garbage out. This is the simplest criticism of meta-analyses. If there are serious problems in the original studies (for purposes of answering the question at hand), the meta-analysis does nothing to fix them. It just incorporates their results and thus enshrines the biases. Indeed, it is really worse. It is “garbage in, garbage papered over“, because the meta-analysis not only does not fix the issues but it hides them. Someone reading the original faulty/biased/inappropriate-for-purpose study would at least have a chance of recognizing the problem, but someone just looking at the meta-analysis would not.
3. The only advantage of this method is averaging out random errors. In some sense this is a subset of point 2, but it is important enough to separate out. There are many different ways an epidemiologic study’s results can differ from the true value it is trying to estimate, but the only such error that is ever quantified in 99.99% of public health papers is random sampling error. Those confidence intervals you see are a heuristic quantification of roughly how much uncertainty there is from random error — i.e., that bad luck-of-the-draw resulted in an odd sample of the population. Confidence intervals ignore confounding, selection bias, measurement error (the data not representing the true values), and other problems. A common question about confidence intervals is “how likely is it that the true value falls outside of that range?” It turns out that this is rather complicated to answer, even if there were no other errors, and so most answers offered are wrong. But it is fairly easy to answer it in light of those other errors: The real answer is “extremely likely.”
I explain this because the single benefit of synthesizing those supposedly separated-at-birth studies is to reduce random error — it effectively creates a larger sample, and the larger the sample, the smaller the probability that the estimate is far from the true value due to chance alone. It averages out the errors from pure bad luck that produces unrepresentative samples of the population in each of the smaller studies. If a study had a major selection bias problem, it is possible that other studies in the collection could have a random scattering of selection bias impacts that average out. But much more likely is that other studies have bias in the same direction, caused by similar selection issues. This is even more true with confounding bias, since confounding is a characteristic of the underlying population, which should(!) be the same for the various studies. The “residual confounding” (confounding that is not “controlled for” with covariates) might not be quite so homogeneous, but it will probably be the case that the bias across studies is mostly in the same direction. Some measurement error (e.g., a typo in coding the data) is random, but some is not (e.g., people consistently underreporting how alcohol they drink). Thus, the non-random errors are extremely unlikely to be remedied by the meta-analysis. They are just buried and enshrined in the summary statistic.
Why would you want to do meta-analysis?
The short answer, for anything in the realms that are covered by this blog is: You wouldn’t. Not if your goal was truth-seeking and you wanted to do valid science, that is. Even apart from the specific problems described below, it simply serves no legitimate purpose.
To see this, consider a simplified example where synthetic methods make enormous sense. Astrophysicists are trying to tease out an extremely subtle signal from somewhere in the sky. They would do well to synthesize the data from many nights of observations from their telescope, and also observations from the same area and spectrum from other telescopes. It is possible that each one of these studies, considered alone, would produce nothing useful because there is too much random noise in the data, but when combined they produce useful information. Each study of the sky is observing the same phenomenon, so that condition is met, and the problem (in this story) is that random error from every single study overwhelms any signal.
Now consider the closest analogy to that in the world of epidemiology. We are trying to figure out whether drug X or drug Y is a more successful treatment for a disease. Imagine that a hundred hospitals across the country were interested in this, so they randomized patients with the disease to X or Y and recorded their outcomes. The problem is that each hospital only treated ten patients, and so for each of those reports, random error (e.g., by pure chance, assigning three patients who were doomed to not recover to X, but only two to Y) overwhelms the small difference in effect we are hoping to measure. A person just reading through the collection of reports would not be able to sort out the signal, as with the astrophysics case. But a meta-analysis could combine them all, as if they were a single study of a few thousand people, which could be enough to estimate the different effect of the drugs.
So how often do we actually have a situation like that for medical experiment data? Very rarely.
What we have instead are data from ten, maybe twenty, rarely fifty, such experiments that each included more subjects. Each produces enough information — imperfect, as always, of course — that it can be interpreted without meta-analysis tricks. Take a look at this distribution of study results Glantz reported. The figure in that post is a standard representation of a review of study results (leaving out the last row, which is the scientifically meaningless synthesis of the other rows). The rows are a list of studies. The center point of each graph to the right is the point estimate from that study, with the error bar representing how much propensity for random error the study had (the size of the grey blob for each study shows the same information). The exact size of the bar has an arcane meaning that can just be ignored by non-experts, but everyone can understand that the wider the bar, the smaller the study, and thus the greater probability of more random error.
Now pretend this figure summarizes the collection of decent studies testing drug X versus drug Y from the above story. You can look at that and immediately say “almost all the results are to the left, meaning X (let’s say) did better, so it is pretty clear that X works better. That is what we know right now.” But as long as you do not make the mistake of synthesizing all your results into the bottom row, thus throwing away most of the available information, you can say more. For example: “There are a couple of studies on the other side of the null. One is small, so that could have been just some extreme random error, though it would have to be really extreme. But for the larger one, with its lower likelihood of much random error, that is really not plausible. Something needs to be explained, rather than just pretending these studies differ due only to random error and lumping them together.”
In fact, if you know how the errors bars are calculated and see how far apart they are, you can observe that the outlier study (“West”) and the third on the list (“Vickerman”) produce results that are utterly incompatible with the “part of the same large study separated at birth” assumption. That is, it is implausible that they were both reasonably unbiased studies of the same phenomenon. The probability of seeing a pair of results from studies of that size that differ so much by chance alone is down in the range of “might have never happened, even once, in the history of all medical research.” Thus we have no business assuming they differed due to chance and just throwing the results together. We need to think about why those studies that favored drug Y did so. Maybe it will reveal circumstances where drug Y really is better. Maybe there is something identifiably wrong with the study called “West” that means we should not be using its results at all. Or maybe that thinking would tell us, “hey, ‘West’ turns out to be the only one here that actually measures what we are interested in; all the others were really measuring something else or were hopelessly biased. So it alone, rather than a combination of all the studies, gives us our best current estimate.”
Notice that all of this information is lost in the meta-analysis summary. Meta-analysis, then, serves the Orwellian language role of preventing particular lines of thinking because there is no available vocabulary to build the thoughts upon. If a meta-analysis is not serving the purpose of seeking an otherwise unobservable signal within a lot of noise (as in the astrophysics example), then it is destroying signal that does exist. Well, not destroying it, of course. It is still possible to go back and use that information. But it is hidden to consumers of the meta-analysis result. More to the point, using that other information is what should have been done. That other information tells us not just that the studies should not have been combined as if they were separated at birth, but tells us we need to figure out which of these clearly contradictory studies were faulty measures and why. (In this case it was most of those toward the left, due to selection bias, as I discussed previously.)
The more complicated statistical methods are, the easier they are for liars to use to produce a result that actually contradicts the real evidence. Imagine that there were “public health”-type liars in astrophysics, trying to concoct new supernovas that do not really exist. The statistical analysis of all those 1s and 0s the telescopes produce is so arcane that no one reading the press release could ever catch them at it. This could only be done by another expert who reanalyzed the original data or a replicated the studies (though that is reasonably likely to happen in that field, unlike in public health). Most statistics in epidemiology are easy, and tell us little more than we can learn from a simple cross-tab of the study results. A critical thinker with high-school-level science can do that. So meta-analysis is a liar’s dream, offering a chance to bury that information beyond reach.
So, you might ask, why is meta-analysis ever done? Surely there must be reasons other than reverse-engineering a result that supports a personal policy preference, as Glantz did. There are various reasons. In descending order of legitimacy (or, rather, ascending order of illegitimacy) they are:
1. Our assessment of the quantity of interest might really be very close to some razor’s-edge decision point, and we need to make a decision. This is a broadening of the legitimate use of meta-analysis from that “hundreds of tiny experiments” example. What we really need to do is decide whether drug X or Y is going to be used from this point forward, but the results of the 15 (decent-sized and not apparently hopelessly biased) studies are normally distributed around it being a tie. This is apparently the result of random error. It is so close to a tie that (as with the astrophysics study) we cannot just spot the difference by thinking through the body of evidence. We need to break the tie as best we can to make a decision, and a meta-analysis can do that. But it is important to note that the real scientific assessment should remain “it is really too close to be sure, given the limits of our evidence, but as best we can tell, X has the edge”, rather than “the meta-analysis proves X is better.”
2. To dumb things down. At this point, we have already left the realm of scientific legitimacy and are serving other purposes. When reporting pre-election survey results, the news media often do a meta-analysis of several surveys (they would call it “averaging” them, which is an accurate description), to be able to provide a point estimate of support for each candidate. This is done for entertainment purposes. Those who are serious about truth-seeking, like people working for the campaigns to analyze survey results (if they are competent), are not going to just look at the simplistic averages reported on television. They are going to try to make sense of the full corpus of original information. The differences among the surveys contain information too.
Is there any legitimate reason for dumbing down the tests of drugs X and Y under the scenario where we pretended the Glantz figure represented those studies? No. Decisions should be made by people who are expert enough to make sense of the distribution — that most studies clearly favored X, though one outlier strongly supported Y. Averaging the studies together subtracts, rather than adds, information. If they need a soundbite for the media after making their decision, they can go with, “Almost every study favors X. One major study favored Y, and this has been a source of controversy. But the other evidence leads us to conclude that study must have gotten it wrong.”
3. To achieve “statistical significance”. This is purely a legal game, not real scientific inquiry. As with the debate game metaphor I offered in the first post in this collection, sometimes there are artificial rules for games that ape scientific truth-seeking but impose departures from it. Imagine that there are twenty studies that each produced an odds ratio of 0.8, but each is small enough that the result is not “statistically significantly” different from zero. A scientific analysis of that information would say “the OR seems to be about 0.8, and we should make decisions accordingly.” But a legalistic game — like a drug approval process or product liability trial — might have a rule that without a “statistically significant” result, all that evidence does not “count”. So someone does the meta-analysis to spit out a “statistically significant” synthetic point estimate to adhere to these rules.
Legalistic rules are not necessarily a bad thing for society. Rules for strict adversarial systems (like games, approval processes, or trials) cannot be as flexible as science, which is characterized by doing whatever works to figure out the truth. Criminal defendants are often found not guilty because of rules of evidence, even though it is obvious to anyone who is truth-seeking based on the evidence that they are guilty. Those rules might serve a greater purpose of discouraging police misconduct or protecting innocent defendants who the prosecution is trying to railroad. Of course it is also possible to argue that a particular rule does not serve the greater social good. But what is definitely true is that these rules do not exist to best seek the truth in a scientific way. They are designed to try to create a game that is pretty good at getting to the truth in spite of many players involved being willing to do or say anything to win. That is, rules of the legal game inevitably depart from truth-seeking scientific behavior, and so should not be confused with it.
[Aside: Such confusion is rife, of course. In real scientific thinking, “statistical significance” is merely a rule of thumb about whether to gather more data to reduce random error, or to move on to figuring out what the data means. It is an arbitrary line with no inherent importance. I am using the scare quotes to point out that this term sounds like it is a lot more significant(!) than it really is. For decades in epidemiology, it has been agreed (among those who really understand the science) that if you speak of statistical significance when talking about your results, you are doing something wrong.
Reporting your confidence intervals, to give readers a quick rough heuristic for how big the random error problem might be, is useful. But the mere fact of whether that confidence interval overlaps the null or not (which is the same as the result being statistically significant) is of no important consequence. Imagine one good honest study that estimated, say, e-cigarette use increases the risk of lung cancer by 20%. It would still be our best estimate of the risk even if the confidence interval overlapped the null. It would be out-and-out wrong to say that because of the lack of statistical significance, the study does not suggest there is increased risk.]
4. Because we can. There are tens of thousands of semi-skilled researchers in public health searching for ways to get publications to further their careers. The scientific value of the work does not really matter. Simplistic meta-analyses are trivial exercises that do not require any of the hassle of doing new field research, let alone tough scientific thinking. Just do a literature review and run the results through some software. And yet everyone writing in the area will cite it. In fairness, those who can add a bit of tough thinking can do better than the most simplistic meta-analyses, like Glantz’s, but it is still a “because we can” motivation.
5. To hide the flaws (or mere heterogeneity) in a collection of studies behind a summary statistic. It is the most scientifically illegitimate reason for a meta-analysis, and is clearly the reason for Glantz’s.
Characteristics that make a meta-analysis invalid for any purpose
Apart from the fact that there are vanishingly few cases where a meta-analysis could serve a serious scientific goal, there are often affirmative reasons why it is actively wrong to do it. Much of this has been covered already, but it is worth highlighting.
Meta-analysis does not work if the studies are attempts to measure different phenomena. Glantz threw into his mix studies of people who happened to have encountered and used e-cigarettes and studies of people who were encouraged to use e-cigarettes after volunteering for a clinical smoking cessation trial. Whatever the effects of each of these two very different exposures, there is clearly no reason to believe they would be the same. If studies of one got results that were different from studies of the other, it would not be because of random errors that need to be averaged out, but because they were measuring the effects of different exposures. Suggesting otherwise because “both measure effects of e-cigarettes” is the same as the astrophysics meta-analysis combining data from different areas of the sky because “they are all measures of the sky.” (Anyone who cites smoking cessation trial results as if they inform about the real-world effects of e-cigarettes is making the same mistake — see what I have written before.)
But worse than that, it is vanishingly rare that any two non-clinical epidemiologic studies ever measure the same phenomenon. A study of American smokers’ use of e-cigarettes and one of British smokers’ use of e-cigarettes are studying different phenomena. Even more so for other populations. They should produce different results, apart from random or nonrandom error, and averaging them together produces something as meaningless as “what is the average household square footage across a group of 1000 Americans, 400 Brits, 300 French, and 200 Dutch.”
And it is even worse than that. Maybe different populations of Westerners do have similar effects for the phenomenon in question. But we can be sure we are studying populations who are going to have different effects from the exposure if one study is of people who recently had heart surgery and are really motivated to quit smoking, another is random smokers, another is volunteers for a cessation study, another is smokers who are so desperate about their inability to quit that they call a quitline, and so on. And more so when some of the studies look at populations in 2010 and others in 2014, a difference would not matter if you were studying the effects of vitamin intake on a cancer, but is huge for e-cigarettes and behavior.
Then there is the problem that the exposures vary. “Used e-cigarettes” is not a well-defined exposure. It can obviously vary tremendously, and so studies that select or define the exposure differently are measuring different phenomena. The easiest illustration can be found in cessation trials. While cessation trials could not possibly measure the real world impacts that Glantz purports to be trying to measure, they could theoretically be made fairly compatible with one another. But they are not and will not be, and so meta-analysis of them alone is inappropriate. Each will offer subjects different products. Far more important, each will offer somewhat different levels of information and persuasion to subjects, and these differences will not even be reported in the methods. (And, of course, the populations will be different too.) The effects of these experiments will differ because the experiments differ, not merely because of random error that can be averaged out.
Now you might recall that Glantz ran a weak “sensitivity analysis” looking at how the results differed across some of the most glaring of these differences. But who cares? That is not the point. The point is that there is no conceivable legitimate meaning to a statistical average of their results, whatever any sensitivity showed.
A more legitimate reply to these concerns is that if you have enough studies that sort of represent some real-world distribution of exposures and populations, then the average could mean something. In theory, a bunch of cessation clinics that provide their patients with e-cigarettes and advice about them might collect data about outcomes. Each study would be a different exposure in a different population, but collectively they might represent the breadth of the general phenomenon of clinics doing engaging in that practice, and so the average result might represent some real average from the world. But nothing in sight looks like that — and it would take some serious thinking to ascertain if something did, not just throwing whatever came along into a statistical soup — so that is moot.
Notice what was just discussed is not about the original studies being faulty. All of these problems still exist if each study was as perfect for measuring the effect of interest for the specific definition of the exposure, in their particular population, as measured in a particular way as they could be. An additional layer of fatal flaw is added when they are not that.
In particular, consider the selection bias problem that I explained in the previous posts. Because of selection bias, most of the studies Glantz used in his meta-analysis were terrible measures of the supposed phenomenon of interest in the first place. Some were biased by the original authors’ intent, while others were never purported to offer a measure what Glantz used them for (and the authors told him so), but he created the selection bias by interpreting that way. Obviously averaging together a bunch of incommensurate results that are terrible measures of the phenomena of interest is even worse than merely averaging together incommensurate results. Once again, the role of meta-analysis in this case is to hide the fact that they are terrible measures behind statistical games.
Glantz’s meta-analysis is not just junk science because of details about the studies, though those are problems in themselves. It is junk science because there are probably not even two of the studies in his collection that are similar enough to average together, let alone all of them. I cannot imagine there ever being behavioral studies of tobacco use that could be legitimately combined into a meta-analysis, nor any scientific reason for wanting to do so.