# Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

This is part 4 of a 4 part series. Part 1 is here.

In this post, I look at median bias in a large dataset of reaction times from participants engaged in a lexical decision task. The dataset was described in a previous post.

After removing a few participants who didn’t pay attention to the task (low accuracy or too many very late responses), we’re left with 959 participants to play with. Each participant had between 996 and 1001 trials for each of two conditions, Word and Non-Word.

Here is an illustration of reaction time distributions from 100 randomly sampled participants in the Word condition: Same in the Non-Word condition: Skewness tended to be larger in the Word than the Non-Word condition. Based on the standard parametric definition of skewness, that was the case in 80% of participants. If we use a non-parametric estimate instead (mean – median), it was the case in 70% of participants.

If we save the median of every individual distribution, we get the two following group distributions, which display positive skewness: The same applies to distributions of means: So we have to worry about skewness at 2 levels:

• individual distributions

• group distributions

Here I’m only going to explore estimation bias as a result of skewness and sample size in individual distributions. From what we learnt in previous posts, we can already make predictions: because skewness tended to be stronger in the Word than in the Non-Word condition, the bias of the median will be stronger in the former than the later for small sample sizes. That is, the median in the Word condition will tend to be more over-estimated than the median in the Non-Word condition. As a consequence, the difference between the median of the Non-Word condition (larger RT) and the median of the Word condition (smaller RT) will tend to be under-estimated. To check this prediction, I estimated bias in every participant using a simulation with 2,000 iterations. I assumed that the full sample was the population, from which we can compute population means and population medians. Because the Non-Word condition is the least skewed, I used it as the reference condition, which always had 200 trials. The Word condition had 10 to 200 trials, with 10 trial increments. In the simulation, single RT were sampled with replacements among the roughly 1,000 trials available per condition and participant, so that each iteration is equivalent to a fake experiment.

Let’s look at the results for the median. The figure below shows the bias in the long run estimation of the difference between medians (Non-Word – Word), as a function of sample size in the Word condition. The Non-Word condition always had 200 trials. All participants are superimposed and shown as coloured traces. The average across participants is shown as a thicker black line. As expected, bias tended to be negative with small sample sizes. For the smallest sample size, the average bias was -11 ms. That’s probably substantial enough to seriously distort estimation in some experiments. Also, variability is high, with a 80% highest density interval of [-17.1, -2.6] ms. Bias decreases rapidly with increasing sample size. For n=60, it is only 1 ms.

But inter-participant variability remains high, so we should be cautious interpreting results with large numbers of trials but few participants. To quantify the group uncertainty, we could measure the probability of being wrong, given a level of desired precision, as demonstrated here for instance.

After bootstrap bias correction (with 200 bootstrap resamples), the average bias drops to roughly zero for all sample sizes: Bias correction also reduced inter-participant variability.

As we saw in the previous post, the sampling distribution of the median is skewed, so the standard measure of bias (taking the mean across simulation iterations) does not provide a good indication of the bias we can expect in a typical experiment. If instead of the mean, we compute the median bias, we get the following results: Now, at the smallest sample size, the average bias is only -2 ms, and it drops to near zero for n=20. This result is consistent with the simulations reported in the previous post and confirms that in the typical experiment, the average bias associated with the median is negligible.

What happens with the mean? The average bias of the mean is near zero for all sample sizes. Individual bias values are also much less variable than median values. This difference in bias variability does not reflect a difference in variability among participants for the two estimators of central tendency. In fact, the distributions of differences between Non-Word and Word conditions are very similar for the mean and the median. Estimates of spread are also similar between distributions:

IQR: mean RT = 78; median RT = 79

MAD: mean RT = 57; median RT = 54

VAR: mean RT = 4507; median RT = 4785

This suggests that the inter-participant bias differences are due to the shape differences observed in the first two figures of this post.

Finally, let’s consider the median bias of the mean. For the smallest sample size, the average bias across participants is 7 ms. This positive bias can be explained easily from the simulation results of post 3: because of the larger skewness in the Word condition, the sampling distribution of the mean was more positively skewed for small samples in that condition compared to the Non-Word condition, with the bulk of the bias estimates being negative. As a result, the mean tended to be more under-estimated in the Word condition, leading to larger Non-Word – Word differences in the typical experiment.

I have done a lot more simulations and was planning even more, using other datasets, but it’s time to move on! Of particular note, it appears that in difficult visual search tasks, skewness can differ dramatically among set size conditions – see for instance data posted here.

# Concluding remarks

The data-driven simulations presented here confirm results from our previous simulations:

• if we use the standard definition of bias, for small sample sizes, mean estimates are not biased, median estimates are biased;
• however, in the typical experiment (median bias), mean estimates can be more biased than median estimates;

• bootstrap bias correction can be an effective tool to reduce bias.

Given the large differences in inter-participant variability between the mean and the median, an important question is how to spend your money: more trials or more participants (Rouder & Haaf 2018)? An answer can be obtained by running simulations, either data-driven or assuming generative distributions (for instance exGaussian distributions for RT data). Simulations that take skewness into account are important to estimate bias and power. Assuming normality can have disastrous consequences.

Despite the potential larger bias and bias variability of the median compared to the mean, for skewed distributions I would still use the median as a measure of central tendency, because it provides a more informative description of the typical observations. Large sample sizes will reduce both bias and estimation variability, such that high-precision single-participant estimation should be easy to obtain in many situations involving non-clinical samples. For group estimations, much larger samples than commonly used are probably required to improve the precision of our inferences.

Although the bootstrap bias correction seems to work very well in the long run, for a single experiment there is no guarantee it will get you closer to the truth. One possibility is to report results with and without bias correction.

For group inferences on the median, traditional techniques use incorrect estimations of the standard error, so consider modern parametric or non-parametric techniques instead (Wilcox & Rousselet, 2018).

# References

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.