The R notebook associated with this post is available on github.

Cohen’s *d* is a popular measure of effect size. In the one-sample case, *d* is simply computed as the mean divided by the standard deviation (SD). For repeated measures, the same formula is applied to difference scores (see detailed presentation and explanation of variants in Lakens, 2013).

Because *d* relies on a non-robust measure of central tendency (the mean), and a non-robust measure of dispersion (SD), it is a non-robust measure of effect size, meaning that a single observation can have a dramatic effect on its value, as explained here. Cohen’s *d* also makes very strict assumptions about the data, so it is only appropriate in certain contexts. As a consequence, it should not be used as the default measure of effect size, and more powerful and informative alternatives should be considered – see a few examples here. For comparisons across studies and meta-analyses, nothing will beat data-sharing though.

Here we look at another limitation of Cohen’s *d*: it is biased when we draw small samples. Bias is covered in detail in another post. In short, in the one-sample case, when Cohen’s *d* is estimated from a small sample, in the long run it tends to be larger than the population value. This over-estimation is due to a bias of SD, which tends to be lower than the population’s SD. Because the mean is not biased, when divided by an under-estimated SD, it leads to an over-estimated measure of effect size. The bias of SD is explained in intro stat books, in the section describing Student’s *t*. Not surprisingly it is never mentioned in the discussions of small n studies, as a limitation of effect size estimation…

In this demonstration, we sample with replacement 10,000 times from the ex-Gaussian distributions below, for various sample sizes, as explained here:

The table below shows the population values for each distribution. For comparison, we also consider a robust equivalent to Cohen’s *d*, in which the mean is replaced by the median, and SD is replaced by the percentage bend mid-variance (`pbvar`

, Wilcox, 2017). As we will see, this robust alternative is also biased – there is no magic solution I’m afraid.

m: 600 600 600 600 600 600 600 600 600 600 600 600

md: 509 512 524 528 540 544 555 562 572 579 588 594

m-md: 92 88 76 72 60 55 45 38 29 21 12 6

m.den: 301 304 251 255 201 206 151 158 102 112 54 71

md.den: 216 224 180 190 145 157 110 126 76 95 44 68

m.es: 2.0 2.0 2.4 2.4 3.0 2.9 4.0 3.8 5.9 5.4 11.1 8.5

md.es: 2.4 2.3 2.9 2.8 3.7 3.5 5.0 4.5 7.5 6.1 13.3 8.8

*m* = mean

*md* = median

*den* = denominator

*es* = effect size

*m.es* = Cohen’s *d*

*md.es* = md / pbvar

Let’s look at the behaviour of *d* as a function of skewness and sample size.

Effect size *d* tends to decrease with increasing skewness, because SD tends to increase with skewness. Effect size also increases with decreasing sample size. This bias is stronger for samples from the least skewed distributions. This is counterintuitive, because one would think estimation tends to get worse with increased skewness. Let’s find out what’s going on.

Computing the bias normalises the effect sizes across skewness levels, revealing large bias differences as a function of skewness. Even with 100 observations, the bias (mean of 10,000 simulation iterations) is still slightly larger than zero for the least skewed distributions. This bias is not due to the mean, because we the sample mean is an unbiased estimator of the population mean.

Let’s check to be sure:

So the problem must be with the denominator:

Unlike the mean, the denominator of Cohen’s *d*, SD, is biased. Let’s look at bias directly.

SD is most strongly biased for small sample sizes and bias increases with skewness. Negative values indicate that sample SD tends to under-estimate the population values. This is because the sampling distribution of SD is increasingly skewed with increasing skewness and decreasing sample sizes. This can be seen in this plot of the 80% highest density intervals (HDI) for instance:

The sampling distribution of SD is increasingly skewed and variable with increasing skewness and decreasing sample sizes. As a result, the sampling distribution of Cohen’s *d* is also skewed. The bias is strongest in absolute term for the least skewed distributions because the sample SD is overall smaller for these distributions, resulting in overall larger effect sizes. Although SD is most biased for the most skewed distributions, SD is also overall much larger for them, resulting in much smaller effect sizes than those obtained for less skewed distributions. This strong attenuation of effect sizes with increasing skewness swamps the absolute differences in SD bias. This explains the counter-intuitive lower *d* bias for more skewed distributions.

As we saw previously, bias can be corrected using a bootstrap approach. Applied, to Cohen’s *d*, this technique does reduce bias, but it still remains a concern:

Finally, let’s look at the behaviour of a robust equivalent to Cohen’s *d*, the median normalised by the percentage bend mid-variance.

The median effect size shows a similar profile to the mean effect size. It is overall larger than the mean effect size because it uses a robust measure of spread, which is less sensitive to the long right tails of the skewed distributions we sample from.

The bias disappears quickly with increasing sample sizes, and quicker than for the mean effect size.

However, unlike what we observed for *d*, in this case the bias correction does not work for small samples, because the repetition of the same observations in some bootstrap samples leads to very large values of the denominator. It’s ok for n>=15, for which bias is relatively small anyway, so at least based on these simulations, I wouldn’t use bias correction for this robust effect size.

# Conclusion

Beware of small sample sizes: they are associated with increased variability (see discussion in a clinical context here) and can accentuate the bias of some effect size estimates. If effect sizes tend to be reported more often if they pass some arbitrary threshold, for instance p < 0.05, then the literature will tend to over-estimate them (see demonstration here), a phenomenon exacerbated by small sample sizes (Button et al. 2013).

Can’t say it enough: small n is bad for science if the goal is to provide accurate estimates of effect sizes.

To determine how the precision and accuracy of your results depend on sample size, the best approach is to perform simulations, providing some assumptions about the shape of the population distributions.

# References

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Lakens, D. (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol, 4, 863.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.