Bias is defined as the distance between the mean of the sampling distribution (here estimated using Monte-Carlo simulations) and the population value. In part 1 and part 2, we saw that for small sample sizes, the sample median provides a biased estimation of the population median, which can significantly affect group comparisons. However, this bias disappears with large sample sizes, and it can be corrected using a bootstrap bias correction. In part 3, we look in more detail at the shape of the sampling distributions, which was ignored by Miller (1988).
Let’s consider the sampling distributions of the mean and the median for different sample sizes and ex-Gaussian distributions with skewness ranging from 6 to 92 (Figure 1). When skewness is limited (6, top row), the sampling distributions are symmetric and centred on the population values: there is no bias. As we saw previously, with increasing sample size, variability decreases, which is why studies with larger samples provide more accurate estimations. The flip side is that studies with small samples are much noisier, which is why their results tend not to replicate…
When skewness is large (92, middle row), sampling distributions get more positively skewed with decreasing sample sizes. To better understand how the sampling distributions change with sample size, we turn to the last row of Figure 1, which shows 50% highest-density intervals (HDI). Each horizontal line is a HDI for a particular sample size. The labels contain the values of the interval boundaries. The coloured vertical tick inside the interval marks the median of the distribution. The red vertical line spanning the entire plot is the population value.
Means For small sample sizes, the 50% HDI is offset to the left of the population mean, and so is the median of the sampling distribution. This demonstrates that the typical sample mean tends to under-estimate the population mean – that is to say, the mean sampling distribution is median biased. This offset reduces with increasing sample size, but is still present even for n=100.
Medians With small sample sizes, there is a discrepancy between the 50% HDI, which is shifted to the left of the population median, and the median of the sampling distribution, which is shifted to the right of the population median. This contrasts with the results for the mean, and can be explained by differences in the shapes of the sampling distributions, in particular the larger skewness and kurtosis of the median sampling distribution compared to that of the mean (see code on github for extra figures). The offset between 50% HDI and the population reduces quickly with increasing sample size. For n=10, the median bias is already very small. From n=15, the median sample distribution is not median bias, which means that the typical sample median is not biased.
Another representation of the sampling distributions is provided in Figure 2: 50% HDI are shown as a function of sample size. For both the mean and the median, bias increase with increasing skewness and decreasing sample size. Skewness also increases the asymmetry of the sampling distributions, but more for the mean than median.
So what’s going on here? Is the mean also biased? According to the standard definition of bias, which is based on the distance between the population mean and the average of the sampling distribution of the mean, the mean is not biased. But this definition applies to the long run, after we replicate the same experiment many times. In practice, we never do that. So what happens in practice, when we perform only one experiment? In that case, the median of the sampling distribution provides a better description of the typical experiment than the mean of the distribution. And the median of the sample distribution of the mean is inferior to the population mean when sample size is small. So if you conduct one small n experiment and compute the mean of a skewed distribution, you’re likely to under-estimate the true value.
Is the median biased after all? The median is indeed biased according to the standard definition. However, with small n, the typical median (represented by the median of the sampling distribution of the median) is close to the population median, and the difference disappears for even relatively small sample sizes.
Is it ok to use the median then?
If the goal is to accurately estimate the central tendency of a RT distribution, while protecting against the influence of outliers, the median is far more efficient than the mean (Wilcox & Rousselet, 2018). Providing sample sizes are large enough, bias is actually not a problem, and the typical bias is actually very small, as we’ve just seen. So, if you have to choose between the mean and the median, I would go for the median without hesitation.
It’s more complicated though. In an extensive series of simulations, Ratcliff (1993) demonstrated that when performing standard group ANOVAs, the median can lack power compared to other estimators. Ratcliff’s simulations involved ANOVAs on group means, in which for each participant, very few trials (7 to 12) are available for each condition. Based on the simulations, Ratcliff recommended data transformations or computing the mean after applying specific cut-offs to maximise power. However, these recommendations should be considered with caution because the results could be very different with more realistic sample sizes. Also, standard ANOVA on group means are not robust, and alternative techniques should be considered (Wilcox 2017). Data transformations are not ideal either, because they change the shape of the distributions, which contains important information about the nature of the effects. Also, once data are transformed, inferences are made on the transformed data, not on the original ones, an important caveat that tends to be swept under the carpet in articles’ discussions… Finally, truncating distributions introduce bias too, especially with the mean – see next section (Miller 1991; Ulrich & Miller 1994)!
At this stage, I don’t see much convincing evidence against using the median of RT distributions, if the goal is to use only one measure of location to summarise the entire distribution. Clearly, a better alternative is to not throw away all that information, by studying how entire distributions differ (Rousselet et al. 2017). For instance, explicit modelling of RT distributions can be performed with the excellent
brms R package.
Other problems with the mean
In addition to being median biased, and a poor measure of central tendency for asymmetric distributions, the mean is also associated with several other important problems. Standard procedures using the mean lack of power, offer poor control over false positives, and lead to inaccurate confidence intervals. Detailed explanations of these problems are provided in Field & Wilcox (2017) and Wilcox & Rousselet (2018) for instance. For detailed illustrations of the problems associated with means in the one-sample case, when dealing with skewed distributions, see the companion reproducibility package on figshare.
If that was not enough, common outlier exclusion techniques lead to bias estimation of the mean (Miller, 1991). When applied to skewed distributions, removing any values more than 2 or 3 SD from the mean affects slow responses more than fast ones. As a consequence, the sample mean tends to underestimate the population mean. And this bias increases with sample size because the outlier detection technique does not work for small sample sizes, which results from the lack of robustness of the mean and the SD. The bias also increases with skewness. Therefore, when comparing distributions that differ in sample size, or skewness, or both, differences can be masked or created, resulting in inaccurate quantification of effect sizes.
Truncation using absolute thresholds (for instance RT < 300 ms or RT > 1,200 ms) also leads to potentially severe bias of the mean, median, standard deviation and skewness of RT distributions (Ulrich & Miller 1994). The median is much less affected by truncation bias than the mean though.
In the next and final post of this series, we will explore sampling bias in a real dataset, to see how much of a problem we’re really dealing with. Until then, thanks for reading.
[GO TO POST 4/4]
Field, A.P. & Wilcox, R.R. (2017) Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behav Res Ther, 98, 19-38.
Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.
Miller, J. (1991) Reaction-Time Analysis with Outlier Exclusion – Bias Varies with Sample-Size. Q J Exp Psychol-A, 43, 907-912.
Ratcliff, R. (1993) Methods for dealing with reaction time outliers. Psychol Bull, 114, 510-532.
Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.
Ulrich, R. & Miller, J. (1994) Effects of Truncation on Reaction-Time Analysis. Journal of Experimental Psychology-General, 123, 34-80.
Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.
Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.