Null Hypothesis Significance Testing: The Fault in Our Stars


[…] The same is true on amazon, where the book’s average rating has actually gone up a bit in the past six months (although not in a statistically significant way). […]

Actually, the ratings have decreased in a statistically significant way (alpha < .05). I used the two most recently archived pages from, which do not cover exactly 6 months. Still, ratings before 2013-02-03 were higher than those after that date.

  • Before (2110 ratings): mean = 4.76 (SD = 0.014)
  • After (1232 ratings): mean = 4.67 (SD = 0.021)

A t-test (two-sided, unequal variances) yields p = 0.0009 (d = -0.12); and for the non-parametric fans, the Wilcoxon rank-sum (Mann-Whitney) test yields p = 0.0001.

Using 2012-10-19 as dividing date, yields similar results:

  • Before (1051 ratings): mean = 4.77 (SD = 0.020)
  • After (2291 ratings): mean = 4.71 (SD = 0.015)

A t-test (two-sided, unequal variances) yields p = 0.0188 (d = -0.09); the Wilcoxon rank-sum test yields p = 0.0008. Of course, significance testing might be a questionable procedure in this case – and also in general.

This is actually a census of all Amazon ratings, so there’s no need to test whether ratings differ. The sample is the population. However, the written reviews could be regarded as a subsample of the ratings of all readers.

Is it a random sample? I don’t think so. So can we draw proper conclusions from the significance test results? Nah. I won’t provide a comprehensive discussion of the benefits and problems associated with the null hypothesis significance testing (NHST). I’ll just name one of my favourite objections, which Cohen (1990, p. 1308) phrased nicely: “The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world.” In the present, the null hypothesis would mean, that average rating of newer readers is exactly the same as the average rating of those who pre-ordered the book etc.

Anyway, the effect size suggests that the drop in ratings is very small, so it should be safe to argue that the book keeps appealing to new readers.

PS: Sorry for nitpicking; this should in no way diminish the article, which I think is highly insightful.

PPS: I spend a good 15 minutes in R trying to beat the data into shape, but I feel much more comfortable in Stata, so I switched and had the analysis in a few minutes. Here’s the do-file in case anyone in curious. (Haha, as if!)

Continue reading ‘Null Hypothesis Significance Testing: The Fault in Our Stars’ »