What is peer review really? (part 9 — it is really a crapshoot)

by Carl V Phillips

I haven’t done a Sunday Science Lesson in a while, and have not added to this series about peer review for more than two years, so here goes. (What, you thought that just because I halted two years ago I was done? Nah — I consider everything I have worked on since graduate school to be still a work in progress. Well, except for my stuff about what is and is not possible with private health insurance markets; reality and the surrounding scholarship has pretty much left that as dust. But everything else is disturbingly unresolved.)

I explained earlier in this series that the picking of journal reviewers (and recall this is about health science journals except where otherwise specified; other sciences are better, some a lot better) is just a roll of the dice. Every now and then, a paper will be reviewed by someone who is highly qualified — which mostly means being a sufficient expert on the methods — and puts serious effort into it. For obvious reasons there is no data on how often that occurs, but it is pretty clearly down in the single-digit percentage range. I would guess that a slight majority of the time, exactly one reviewer has some passable skills and puts in enough effort to catch some simple problems, though quite often the entire review process is utterly useless other than for tinkering with cosmetic improvements and minor clarifications.

One implication of this, which I have noted, is that basically no gatekeeping function is served by the journal review process beyond what an editor could do at a glance (e.g., making sure the paper is not some bizarre manifesto). Even for a paper that is patently fundamentally wrong, but has the superficial appearance of a proper journal article, it only takes a few submission attempts to get it accepted. (That is, unless it is “politically incorrect” — a serious challenge to vested interests like, say, an analysis of the failure of peer review — in which case it is nearly impossible to get it accepted even if no serious flaws are identified.)

I recently reviewed a paper for a journal I have a good relationship with. It is a solid journal (by the standards of health science, so that is not super high praise) that tries to publish the truth (in contrast with anti-tobacco journals like Tobacco Control or Pediatrics, which publish political propaganda). The paper was not about tobacco or another substantive topic I am particularly expert in, which does not matter at all: As I noted, good reviewers need to be very expert in the methods, but a passing knowledge of the subject matter is sufficient. That is to say, someone with expertise like mine should be reviewing most every epidemiology or social science paper in health research, but that probably does not happen even 1% of the time. As I have noted previously in this series, editors typically send papers to be reviewed by authors of previous papers in the subject area, without regard to their methodological expertise or even competence. Many of those authors do not even understand how to apply the research methods correctly in their own work. And even among those who are perfectly competent researchers and can grind out a legit paper, vanishingly few have the particular expertise required to assess and critique the analytic core of whatever paper that is dropped on their desk. No such skill is needed to be a health science researcher, after all, even a good one (which most are not). So sending this paper to me was a fortunate move on the part of one editor.

I am not going to identify the paper or provide details that will identify it (though the publisher does open review, so you could track it down later if you really cared; thus I am definitely not violating any confidence). I will make up an analogy for what they did for this lesson. It was clear that the authors really were trying to report the truth (i.e., this was not an attempt to create propaganda), but not really properly trained/experienced scientists (as is the case with most people writing in health science).

So in this made-up version of the paper (a close analogy to the real methods and numbers, with the topic area changed), the authors surveyed public high schools in New Hampshire. There are 110 of them. Of those, 30 did not respond, a few for unknown reasons, but most because their school district refused to participate in the survey (which was distributed through the districts). Because they were only disseminating the survey through public school districts, the authors did not attempt to survey the non-public schools in the area; they were not sure exactly how many of those there even were, though they suggested it was more than half as many as there were public schools. They then reported simple tabulation results (i.e., they basically just presented the data), with some obvious cross-tabulations (e.g., “of the schools that said they provided free breakfasts to needy students, 85% of them did so for every student who was eligible for free lunches”), with no regressions or any other analysis beyond counting up.

There were three reviewers. One of the other two said the paper was perfectly fine the way it was and should be accepted. The other fell into the “not completely clueless and/or lazy” category and discussed, as I did, how the methods reporting needed to be more clear. That was useful, though rather understated; it was really like a scavenger hunt to piece together what was reported in the Methods section as well as information that only trickled out later in the paper, in order to figure out the basic information that comprised the first three sentences of the previous paragraph. So one out of the three reviewers did not even read the paper carefully enough to figure out that it was extremely difficult to figure out who the target population and the study population were. I was the only reviewer who noted that a simple table with all the response counts would have been the best way to clearly present all the information that appeared in multiple tables and prose. I was also the only one who noted that the authors needed to report actual numbers, not just percentages, especially for their many unquantified subsamples (e.g., “of the 26 schools that said they provided free breakfasts, 22 of them (85%) did so for….”).

Actually, when I say that one reviewer out of three missed the failure to explain basic methods, which even a casual reader should have noticed, I am being generous. After all, the petit anthropic principle applies: wherever you go, there you are. That is, I really should not count myself, because if I were not a reviewer then no one would even be doing this analysis. So really the meaningful data is that one out of the two who were randomly (from the position of my worldview) selected from the pool of potential reviewers caught this obvious problem. That conforms to my suggestion above that a small majority of reviewers are not even slightly useful. Not that this is terribly strong support for that estimate, of course: If we assume those two were (effectively) randomly selected from the population of potential reviewers, the 95% confidence interval around the resulting point estimate of .5 of reviewers being somewhat useful is (.09, .91).

The reason I present those statistics is because of the real punchline of this story. A quick refresher (or perhaps a first time explanation) for those who need it: Those intervals you see around statistics when you read papers are frequentist confidence intervals. They basically offer a rough yardstick for how uncertain the estimate is due to random sampling error. If you randomly call a bunch of people and ask them something, and then use those responses to estimate the values for the whole population, you are probably not going to get exactly the right number because by chance alone; you might, by chance, call people who answer “yes” more often than if you asked every single person in the population. The confidence interval is basically a qualitative way of telling the reader how big that potential problem is (yes, despite it being numbers, it is really best thought of as a qualitative measure, like just choosing a word from {“huge”, “big”, “substantial” … “trivial”}).

If you care (and you really shouldn’t) technically those two numbers that form the interval are the following: The lower number is the smallest possible true value of the actual value for the population such that if that really were the true value, there would be no more than a 2.5% chance that by luck-of-the-draw alone you would accidentally get a result as high as or higher than the point estimate you actually got from your data; the upper number is analogous. Did you follow that? It matters not at all if you didn’t, because the exact technical meaning of the two numbers, as well as their exact values, matters not at all. As I said, it is really just a rough scale of how vulnerable to random sampling error a point estimate was. The same information could be communicated if those statistics that probably do not have much random error (i.e., the effective sample size was large) were printed in dark ink, and as the statistic became more uncertain due to potential random error it was printed in increasingly lighter ink. The numbers themselves mean nothing important, though few consumers of health science papers, or authors for that matter, understand this. I would be shocked if one out of a hundred authors who report those intervals could properly define what they are, as I did (and you said “huh?” about) in this paragraph. Most would say something like “there is a 95% chance the true value is in this interval” or even “the value could be anything between…”, which are out-and-out wrong.

As a further aside that is not relevant to the peer review story, but is here because this is a Sunday Science Lesson: The word “effective” in “effective sample size was large enough” matters. Some types of studies can have an enormous sample size but a small effective sample size. That could not be the case in the present peer review story, where the statistics are just the percentage of people giving a particular response to a question. But if you are looking at, say, whether nonsmoking teenagers who vaped were more likely to start smoking than nonsmoking teenagers who did not vape, you could survey twenty thousand random 18-year-olds and only find two kids who were nonsmokers at 16, but who vaped, and then later started smoking. In epidemiology parlance, there are four “cells” representing the nonsmokers (at 16) who either vaped (at 16) or did not, crossed with either started smoking (by 18) or did not. There would be thousands of subjects in the two non-vaping cells, and might be a few hundred in the “vaped but did not start smoking” cell, but only the two in the vaped and started smoking cell (plus a few thousand who smoked at 16 who are not in any of those cells). In that case, your vulnerability to random error — your effective sample size — is a function not of the 20,000, but for all practical purposes a function of the smallest cell count, the 2. The point estimate of the relative risk of starting smoking, comparing different vaping statuses, would have to be printed in very light ink. (Note, however, that the point estimate would still be the point estimate. There was a recent kerfuffle in the vaping research world when an anti-vaping study based its main result on data where the count for what was basically this cell was 6. Some pseudo-scientific critics trying to claim that this small number invalidated the whole analysis. No, sorry — it just means there is a very wide confidence interval, which was reported.)

Returning to that “rough heuristic” point: If you ever see anyone suggesting that the specific values of the confidence interval bounds have real meaning (e.g., saying “the true value could fall anywhere between .09 and .91”), you can immediately be sure that he does not understand the statistics he is using. You can further be sure that he is willing to make pronouncements about things that he does not understand. This is true even setting aside the more important point that the confidence interval assesses only one possible source of error: that your sample was — due to bad luck alone — not representative of the whole population you were attempting to characterize. There are also measurement error problems (e.g., the survey respondents may not give accurate answers about whether they vaped or not, either intentionally or accidentally) and sampling bias (e.g., your sampling method might have missed demographic groups who are more likely to start smoking). I made my name in epidemiology by suggesting methods for quantifying these types of errors along side that single, often far less important, error that does get quantified. (That was two decades ago. To this day, and despite important practical improvements on what I came up with back then, approximately no one ever does such quantification.) And if someone is trying to draw causal conclusion, rather than just report the proportions, there is also the matter of confounding which is a whole other deep challenge.

Anyway, back to the paper I reviewed: The authors reported confidence intervals for each of the percentages they reported. E.g., “25% (95% CI: 16%, 36%) of the schools reported they supplemented their free lunch funding allocation with additional general funds.” Those who understand the statistics, or anyone who has been reading carefully, will be saying “wait, what?!” It took only a minute to confirm that what the authors reported were the confidence intervals that would apply if their 80 subjects (the schools who responded) were a random sample from a very large population. But they were very much not that. The methodology was an attempt at complete enumeration — to get data about every single member of the targeted population, all public high schools in New Hampshire. It was not quite complete, since 30 of them did not submit surveys, but that is obviously not the same as having a small sample from a very large population. It is not even the same as having a random sample of 80 from 110. I have to wonder what they would have done if all 130 had reported. Would they have still reported a random sampling error statistics even though they there was not sampling?

If what I am saying is not already obvious, try this. If there were 10,000 schools and they sampled 80 of them, then the point estimate for this question would be that it was true for 2500 of them, with a confidence interval of 1600 to 3600. That works. But with a sample of 80 from 110, they are saying the point estimate for the whole population is 28, with a confidence interval of 17 to 40. But, um, that 25% means that they already observed 20 who are affirmative for this question, which means 17 is not even a candidate true value, let alone the lower bound for the confidence interval. Oops.

It turns out that if you pretend that the 80 was a random sample from 110, you can calculate a proper confidence interval using what are known as “exact” methods. The resulting interval would be just a couple of percentage points wide (if you sample most of the population, there is not much room left for random sampling error). But that would be wrong too, since the non-responses were inevitably not random. E.g., perhaps one of the two large districts in the state (far more impoverished than average) refused to participate, or perhaps understaffed schools who do not have time to indulge the research are systematically different.

Of course, all that is overthinking the problem a bit. The real problem is that these authors remembered those confidence interval thingies they learned about in the one class they ever took in research methods, and which they see in most every paper they read, and thought “ok, I am supposed to do this, so I will.” They then took the numbers that are spit out from the plug-and-play computer program they were using. Since they do not understand the magic that the program is doing, they do not understand those spit out numbers are estimates based on the assumption that you are randomly drawing 80 subjects from an infinitely large population. Like most people publishing in health science, they just aped what others seem to do without any idea what it means. As I said, these authors seemed to be perfectly legitimate in their goals, trying to report the results of their study accurately — i.e., they were not “public health” propagandist types — but simply did not know what they were doing. That tends to happen when almost everyone in a field treats research methods as if they are trivial recipe-following that requires no thought, and treats statistical software as if it were crystal ball whose workings are beyond mortal ken.

But here is the point that makes this a story: Neither of the other two reviewers of the paper noticed the problem. I read their reviews, and neither said a word about this wee problem of the authors calculating random sampling error statistics when there was no random sampling.  There is some chance that the better of the other two reviewers (assuming he read my review like I read his) found himself saying “oh, right; I really f-ed up not noticing that.” I am pretty sure that the other reviewer would not even understand what I was talking about, and probably never bothered to read the other reviews to try to learn from them.

In case it is not clear, this is an error at the level of a paper reporting that 2+3=9. This is not a case where something is sketchy, like an apparent bias that is ignored or a suspicious omission, and definitely not a case where reasonable people can disagree. This is out-and-out wrong. It takes a little more knowledge to notice than the 2+3 error, of course, but this is knowledge that anyone who is remotely qualified to be reviewing this paper for a journal should have. And yet zero out of two reviewers (again, leaving me out, per the “there you are” principle) noted it. And I can tell you that I was not exactly shocked. This blatant failure is not at all surprising to someone who understands the reality of the process that creates the vaunted “peer-reviewed journal article” status in the health sciences.

In my review I pointed out that the authors could make a knowledge-based assessment of whether the missing 30 were probably like the 80, or how they were different. I also pointed out that if they were attempting to account for the uncertainty about the potential subjects they did not attempt to contact (what I am calling private schools in this fictional version), which was never clear, they would need instead to create some priors about how those differed from the public schools and calculate some Bayesian uncertainty intervals rather than pretending they were members of the target  population that were merely not selected via random sampling. I am sure I was wasting my virtual breath with that, though, and that the authors (and probably the other reviewers and editor) would have little idea what I was talking about. So my real suggestion was that the authors simple eliminate the whole issue of sampling error statistics and just define their results as the counts from the 80 responses they got, with no quantitative claims about how the missing potential subjects would have responded.

My copies of the other reviews came with the notice from the journal that the paper had been accepted (despite my advice that this be a revise-and-resubmit so I could make sure they fixed this glaring error and other problems). It has not yet appeared so I do not know whether the problems were actually fixed. I would say there is at least a 10% chance that both the authors and the editor failed to understand that I was pointing out an indisputable out-and-out error, and thus the error remains in the article version. If the editor had not thought to send the paper to me despite this not being my area of research — I am guessing because he got two reviews and was thinking, “um, these are not good enough” (which bodes in his favor, thus the 10% rather than 50%) — then the indisputable error would have, 100%, appeared in the official peer reviewed journal article.

One more science lesson note: If 0 out of 2 other reviewers caught this error, then the point estimate for the proportion of potential reviewers for this journal (excluding me) who would catch it is 0%. Presumably that is too pessimistic. We are talking about a sample size of only 2. So what is the 95% confidence interval? It is (0%, 84%), which means that it is quite plausible that this was just bad luck and things are not nearly so bad. Two points about that: First, the confidence interval does not mean that any value between 0% and 84% is an equally good guess. If this observation were all the knowledge we had about that percentage (for those who know what this means, I am saying: if our priors were flat), then the proper interpretation would be that the true value is indeed probably down near zero, at most a few tens of percentage points, even though it was not out of the question it was much higher. Second, you may have noticed in journal articles when you have a point estimate of zero (or of 100%), the confidence interval is almost never reported. That is another example of authors and reviewers not understanding their statistics. Confidence interval estimates are not reported by simple statistical software if one of the cells is zero because the default estimation method has a division-by-zero problem. But it is not difficult to estimate a confidence interval using other methods if you have any understanding of the statistics you are reporting. I suspect that most health researchers who run into this think “the magical software did not report a confidence interval, so it must not exist.”

Finally, to reiterate the crapshoot point: Let’s very optimistically assume that one out of three reviewers would have caught this problem, and optimistically assume each time the paper is submitted it is reviewed by three reviewers (two is more realistic). Then there is still about one chance in three that no selected reviewer would catch the problem. In this case, that roll of the dice would be unfortunate for the authors (who were trying to do a proper report) and really would not have created important misinformation. But imagine authors who are trying to sneak something through that they know is a fatal problem in their analysis, but they do not want to bother to fix it or — as is often the case in public health — they want to report a particular conclusion even though it is not actually supported by their data. If those same numbers applied, they would probably only have to submit it to three or four journals (and I am talking about “legitimate” journals, not all the ones that will publish anything for a fee), simply ignoring that pesky reviewer from the previous submissions who caught them, before they won the dice roll and got it published as-is.

6 responses to “What is peer review really? (part 9 — it is really a crapshoot)

  1. It’s been TWO YEARS since the last installment in this series?

    That doesn’t even seem possible.

    • Carl V Phillips

      Yeah it did not seem right to me either, but 8 is date stamped March 2015.

      • natepickering

        You’ve spent a lot of time (you and no one else, from what I can see) trying to draw attention to the problem of public innumeracy, and explaining how that problem A) enables the sorts of abuses we see from tobacco control, and B) starts to infect what should be (and sometimes once were) legitimate fields of scientific inquiry.

        How do you suppose we got to this point? When people of yours and my generation were small children, we learned our arithmetic by rote and repetition. Somewhere around the early 90s, it was decided that this was bad, and a more esoteric approach should be taken, where the answer to the problem is negotiable and it’s the journey that matters.

        I’m unacquainted with the relevant statistics (I’m sure you are not), but it seems to me that the problem of innumerate adults is drastically more widespread now than it was 25 years ago.

        • Carl V Phillips

          Good question, and I have some response to offer, though definitely not The Answer. I am dropping this in here as a placeholder because I am not sure when I will have time to answer, but will get back to it. Preview: People are actually less innumerate than before, just not good enough.

  2. Roberto Sussman

    I fully agree that statistics software packages (like all technical software) are two blades swords. These are powerful tools that really allow you to manage and (possibly) solve problems that involve complicated mathematics and large data sets, but the reverse blade is that you have to know what you are doing when using these packages, as it happens with all advanced and powerful tools. There are lots of examples in which authors using powerful software packages produce a lot of empiric fireworks to disguise theoretical shallowness. Results obtained in this manner can be misleading or even suspect. This happens in all disciplines and in all shades of grey, but the darker shades of grey seem to be the rule, rather than the exception, in a lot epidemiological studies, specially the meta-analysis on lifestyle issues. However, what you mention is far more worrying: not identifying the random variables correctly in a study based on simple statistic inference is an inexcusable basic fundamental error. Following your report, the reviewers also failed to see this error. I assume these authors and reviewers must have taken (at least) an elementary undergraduate statistics course. It is hard to believe that this disastrous situation could be so widespread in public health sciences. Perhaps this happens when academic advancement becomes too much conditioned to endorsing the “right” politics. In this context, forgetting basic staple concepts of statistics becomes a minor issue.

  3. Pingback: What is peer review really? (part 1) | Anti-THR Lies and related topics

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s