UPDATE: this series of posts, along with much more material, is now part of this article:
Reaction times and other skewed distributions: problems with the mean and the median
[Preprint] [Reproducibility package]
In this series of 4 posts, I replicate, expand and discuss the results from
Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.
Part 1 = replicate Miller’s simulations + apply bootstrap bias correction
Part 2 = expand Miller’s simulations to group comparison
Part 3 = problems with the mean
Part 4 = application to a large dataset
Data & code are available on github. The content of the 4 posts is also described in this article.
Reaction times (RT) and many other quantities in neuroscience & psychology are skewed. This asymmetry tends to differ among experimental conditions, such that a measure of central tendency and a measure of spread are insufficient to capture how conditions differ. Instead, to understand the potentially rich differences among distributions, it is advised to consider multiple quantiles of the distributions (Doksum 1974; Pratte et al. 2010; Rousselet et al. 2017), or to model the shapes of the distributions (Heathcote et al. 1991; Rouder et al. 2005; Palmer et al. 2011; Matzke et al. 2013). Yet, it is still common practice to summarise reaction time distributions using a single number, most often the mean: that one value for each participant and each condition can then be entered into a group ANOVA to make statistical inferences. Because of the skewness of reaction times, the mean is however a poor measure of central tendency: skewness shifts the mean away from the bulk of the distribution, an effect that can be amplified by the presence of outliers or a thick right tail. For instance, in the figure below, the median better represents the typical observation than the mean because it is closer to the bulky part of the distribution.
Mean and median for a very right skewed distribution. The distribution is bounded to the left and has a long right tail. This is an ex-Gaussian distribution, which is popular to model the shape of reaction time distributions, even though the matching between its parameters and mental processes can be debated (Sternberg, 2014).
So the median appears to be a better choice than the mean if the goal is to have a single value to tell us about the location of most observations in a skewed distribution. The choice between the mean and the median is however more complicated. It could be argued that because the mean is sensitive to skewness, outliers and the thickness of the right tail, it is better able to capture changes in the shapes of the distributions among conditions. But the use of a single value to capture shape differences will lead to intractable analyses because the same mean could correspond to various shapes. Instead, a multiple quantile approach or explicit shape modelling should be used.
The mean and the median differ in another important aspect: for small sample sizes, the sample mean is unbiased, whereas the sample median is biased (see illustrations in previous post). Concretely, if we perform many times the same RT experiment, and for each experiment we compute the mean and the median, the average mean will be very close to the population mean. As the number of experiments increases, the average sample mean will converge to the exact population mean. This is not the case for the median when sample size is small.
The reason for this bias is explained by Miller (1988):
“Like all sample statistics, sample medians vary randomly (from sample to sample) around the true population median, with more random variation when the sample size is smaller. Because medians are determined by ranks rather than magnitudes of scores, the population percentiles of sample medians vary symmetrically around the desired value of 50%. For example, a sample median is just as likely to be the score at the 40th percentile in the population as the score at the 60th percentile. If the original distribution is positively skewed, this symmetry implies that the distribution of sample medians will also be positively skewed. Specifically, unusually large sample medians (e.g., 60th percentile) will be farther above the population median than unusually small sample medians (e.g., 40th percentile) will be below it. The average of all possible sample medians, then, will be larger than the true median, because sample medians less than the true value will not be small enough to balance out the sample medians greater than the true value. Naturally, the more the distribution is skewed, the greater will be the bias in the sample median.”
To illustrate the sample median’s bias, Miller (1988) employed 12 ex-Gaussian distributions that differ in skewness. The distributions are illustrated in the next figure, and colour-coded using the difference between the mean and the median as a non-parametric measure of skewness.
Here are the parameters of the 12 distributions with the associated population parameters.
The first figure in this post used the most skewed distribution of the 12, with parameters (300, 20, 300).
To estimate bias, we run a simulation in which we sample with replacement 10,000 times from each of the 12 distributions. We take random samples of sizes 4, 6, 8, 10, 15, 20, 25, 35, 50 and 100, as did Miller. For each random sample, we compute the mean and the median. For each sample size and ex-Gaussian parameter, the bias is then defined as the difference between the mean of the 10,000 sample estimates and the population value.
First, we check that the mean is not biased:
Each line shows the results for one type of ex-Gaussian distribution: the mean of 10,000 simulations for different sample sizes. The grey area marks the 50% highest-density interval (HDI) of the 10,000 simulations for the least skewed distribution (the same interval is shown in the next two figures for comparison). The interval shows the variability across simulations and highlights an important aspect of the results: bias is a long-run property of an estimator; there is no guarantee that one value from a single experiment will be close to the population value. Also, the variability among samples increases with decreasing sample size, which is why results across small n experiments can differ substantially.
Here are the median bias estimates in table format:
Columns = sample sizes; rows = skewness
The values are very close to the values reported in Miller 1998:
An illustration is easier to grasp:
As reported by Miller (1988), bias can be quite large and it gets worse with decreasing sample sizes and increasing skewness.
Based on these results, Miller made this recommendation:
“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”
According to Google Scholar, Miller (1988) has been cited 172 times. A look at some of the oldest and most recent citations reveals that his advice has been followed.
For instance, Lavie (1995) noted:
“In the following experiments I used the mean RTs for each participant rather than the medians, as the increase in number of go/no-go errors under the high-load conditions resulted in a different number of responses in between conditions (see Miller, 1988).”
Tipper et al. (1992):
Analysis by Miller (1988) shows that for large numbers of trials, differences in the numbers between conditions […] has no impact on the medians obtained.
More recently, Robinson et al. (2018):
“[…] comparing medians among conditions with an unequal number of trials may lead to false positives (Miller, 1988)”
Du et al. (2017):
“We chose the mean rather than median of RT as […] the sample median may provide a biased estimation of RT (Miller, 1988)”
The list goes on. Also, in a review paper, Whelan (2008), cited 324 times, reiterates the advice:
“The median should never be used on RT data to compare conditions with different numbers of trials.”
However, there are several problems with Miller’s advice. In particular, using the mean leads to many issues with estimation and statistical inferences, as we will see in the 3rd post. In this post, we tackle one key omission from Miller’s assessment: the bias of the sample median can be corrected, using a percentile bootstrap bias correction, as described in this previous post.
For each iteration in the simulation, bias correction was performed using 200 bootstrap samples. Here are the bias corrected results:
The bias correction works very well on average, except for the smallest sample sizes. The failure of the bias correction for very small n is not surprising, because the shape of the sampling distribution cannot be properly estimated by the bootstrap from so few observations. From n = 10, the bias values are very close to those observed for the mean. So it seems that in the long-run, we can eliminate the bias of the sample median by using a simple bootstrap procedure. As we will see in the next post, the bootstrap bias correction is also effective when comparing two groups.
Update: 06 Feb 2018
Finally, let’s have a look at how well bias correction works as a function of sample size and skewness. The smaller the sample size, the less the bootstrap can estimate the sampling distribution and its bias. Ideally, after bias correction, we would expect the sample medians to be centred around zero, with limited variability. This ideal situation could look something like that:
Here we’re considering the most skewed distribution with the smallest sample size. The x-axis shows the median bias before bias correction, whereas the y-axis shows the median bias after bias correction. In this ideal case, the correction works extremely well irrespective of the original bias (average bias is the red vertical line). As a result, the average bias after bias correction is very near zero (horizontal green line).
The reality is very different. The bias correction is only partially effective and is inhomogeneous.
Let’s add some markup to help understand what’s going on.
If the original bias is negative, after correction, the median tends to be even more negative, so over corrected in the wrong direction (lower left triangle).
If the original bias is positive, after correction, the median is either:
– over corrected in the right direction (lower right triangle)
– under corrected in the right direction (middle right triangle)
– over corrected in the wrong direction (upper right triangle)
This pattern remains, although attenuated, if we consider the largest sample size.
Or if we consider the least skewed distribution.
We can look at the different patterns as a function of sample size and skewness.
The figure below shows the probability of over correcting in the wrong direction given that the original bias is negative (lower left triangle of the marked up figure).
In the ideal situation illustrated previously, the expected proportion of over correction in the wrong direction, given an original negative bias, is 6.7%. So here we clearly have an overrepresentation of these cases. When the original bias is negative, in most cases the bootstrap is unable to correct in the right direction. The situation gets worse with increasing skewness and smaller sample sizes.
The figure below shows the probability of under correcting in the right direction given that the original bias is positive (middle right triangle of the marked up figure)
In the ideal situation illustrated previously, the expected proportion of under correction in the right direction, given an original positive bias, is 44.7 %. So here, we have an overrepresentation of these cases. When the original bias is positive, in too many cases the bootstrap corrects in the right direction, but it under-corrects. The situation gets worse with increasing skewness and smaller sample sizes.
Ok, so bias correction is imperfect and it varies a lot, it part depending on whether the sample median fell above or below the unknown population median. Think using the mean is safer? There are several strong arguments to the contrary – more in another post.
[GO TO POST 2/4]
References
Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.
Du, Y., Valentini, N.C., Kim, M.J., Whitall, J. & Clark, J.E. (2017) Children and Adults Both Learn Motor Sequences Quickly, But Do So Differently. Frontiers in Psychology, 8.
Heathcote, A., Popiel, S.J. & Mewhort, D.J.K. (1991) Analysis of Response-Time Distributions – an Example Using the Stroop Task. Psychol Bull, 109, 340-347.
Lavie, N. (1995) Perceptual Load as a Necessary Condition for Selective Attention. J Exp Psychol Human, 21, 451-468.
Matzke, D., Love, J., Wiecki, T.V., Brown, S.D., Logan, G.D. & Wagenmakers, E.J. (2013) Release the BEESTS: Bayesian Estimation of Ex-Gaussian STop Signal reaction time distributions. Front Psychol, 4.
Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.
Palmer, E.M., Horowitz, T.S., Torralba, A. & Wolfe, J.M. (2011) What Are the Shapes of Response Time Distributions in Visual Search? J Exp Psychol Human, 37, 58-71.
Pratte, M.S., Rouder, J.N., Morey, R.D. & Feng, C.N. (2010) Exploring the differences in distributional properties between Stroop and Simon effects using delta plots. Atten Percept Psycho, 72, 2013-2025.
Robinson, M. M., Clevenger, J., & Irwin, D. E. (2018). The action is in the task set, not in the action. Cognitive psychology, 100, 17-42.
Rouder, J.N., Lu, J., Speckman, P., Sun, D.H. & Jiang, Y. (2005) A hierarchical model for estimating response time distributions. Psychon B Rev, 12, 195-223.
Sternberg, S. (2014) Reaction times and the ex-Gaussian distribution: When is it appropriate? Retrieved from http://www.psych.upenn.edu//~saul/
Tipper, S.P., Lortie, C. & Baylis, G.C. (1992) Selective Reaching – Evidence for Action-Centered Attention. J Exp Psychol Human, 18, 891-905.
Whelan, R. (2008) Effective analysis of reaction time data. Psychol Rec, 58, 475-482.