## Null Hypothesis Significance Testing: The Fault in Our Stars

fishingboatproceeds

[…] The same is true on amazon, where the book’s average rating has actually gone up a bit in the past six months (although not in a statistically significant way). […]

Actually, the ratings have decreased in a statistically significant way (alpha < .05). I used the two most recently archived pages from archive.org, which do not cover exactly 6 months. Still, ratings before 2013-02-03 were higher than those after that date.

• Before (2110 ratings): mean = 4.76 (SD = 0.014)
• After (1232 ratings): mean = 4.67 (SD = 0.021)

A t-test (two-sided, unequal variances) yields p = 0.0009 (d = -0.12); and for the non-parametric fans, the Wilcoxon rank-sum (Mann-Whitney) test yields p = 0.0001.

Using 2012-10-19 as dividing date, yields similar results:

• Before (1051 ratings): mean = 4.77 (SD = 0.020)
• After (2291 ratings): mean = 4.71 (SD = 0.015)

A t-test (two-sided, unequal variances) yields p = 0.0188 (d = -0.09); the Wilcoxon rank-sum test yields p = 0.0008. Of course, significance testing might be a questionable procedure in this case – and also in general.

This is actually a census of all Amazon ratings, so there’s no need to test whether ratings differ. The sample is the population. However, the written reviews could be regarded as a subsample of the ratings of all readers.

Is it a random sample? I don’t think so. So can we draw proper conclusions from the significance test results? Nah. I won’t provide a comprehensive discussion of the benefits and problems associated with the null hypothesis significance testing (NHST). I’ll just name one of my favourite objections, which Cohen (1990, p. 1308) phrased nicely: “The null hypothesis, taken literally (and that’s the only way you can take it in formal hypothesis testing), is always false in the real world.” In the present, the null hypothesis would mean, that average rating of newer readers is exactly the same as the average rating of those who pre-ordered the book etc.

Anyway, the effect size suggests that the drop in ratings is very small, so it should be safe to argue that the book keeps appealing to new readers.

PS: Sorry for nitpicking; this should in no way diminish the article, which I think is highly insightful.

PPS: I spend a good 15 minutes in R trying to beat the data into shape, but I feel much more comfortable in Stata, so I switched and had the analysis in a few minutes. Here’s the do-file in case anyone in curious. (Haha, as if!)

``` * written by Johannes Schult (jutze@jutze.com) * last updated 2013-05-29 * * A quick significance test. * H_0: The average rating of The Fault in Our Stars is the same before and after 2013-02-03 * * Time 1 ratings are from 2013-02-03: http://wayback.archive.org/web/20130203060552/http://www.amazon.com/The-Fault-Stars-John-Green/dp/0525478817 * Time 2 ratings are from 2013-05-29: http://www.amazon.com/dp/0525478817/ * version 12.1 set more off clear set obs 10 generate var1 = 5 in 1 replace var1 = 4 in 2 replace var1 = 3 in 3 replace var1 = 2 in 4 replace var1 = 1 in 5 replace var1 = 5 in 6 replace var1 = 4 in 7 replace var1 = 3 in 8 replace var1 = 2 in 9 replace var1 = 1 in 10 generate var2 = 1 in 1/5 replace var2 = 2 in 6/10 generate var3 = 1772 in 1 replace var3 = 226 in 2 replace var3 = 65 in 3 replace var3 = 33 in 4 replace var3 = 14 in 5 replace var3 = (2743-1772) in 6 replace var3 = (393-226) in 7 replace var3 = (122-65) in 8 replace var3 = (60-33) in 9 replace var3 = (24-14) in 10 * uncomment the next ten lines to use alternative split time point 2012-10-19: * http://wayback.archive.org/web/20121019073306/http://www.amazon.com/The-Fault-Stars-John-Green/dp/0525478817 *replace var3 = 899 in 1 *replace var3 = 91 in 2 *replace var3 = 35 in 3 *replace var3 = 20 in 4 *replace var3 = 6 in 5 *replace var3 = (2743-899) in 6 *replace var3 = (393-91) in 7 *replace var3 = (122-35) in 8 *replace var3 = (60-20) in 9 *replace var3 = (24-6) in 10 rename var1 rating rename var2 time rename var3 amount expand amount drop amount compress oneway rating time, tabulate ttest rating, by(time) unequal * effect size: Cohen's d di -(r(mu_1)-r(mu_2))/(sqrt(((r(N_1)-1)*r(sd_1)^2+(r(N_2)-1)*r(sd_2)^2)/(r(N_1)+r(N_2)))) ranksum rating, by(time) clear exit * DFTBA!```