Reaction times and other skewed distributions: problems with the mean and the median (part 2/4)

As we saw in the previous post, the sample median is biased when sampling from skewed distributions. The bias increases with decreasing sample size. According to Miller (1988), because of this bias, group comparison can be affected if the two groups differ in skewness or sample size, or both. As a result, real differences can be lowered or increased, and non-existent differences suggested. In Miller’s own words:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

Here we will see that this advice is wrong for two reasons.

We assess the problem using a simulation in which we draw samples of same or different sizes from populations that differ in skewness or not. We use the same 12 distributions used by Miller (1988), as described previously.

Group 2 has size 200 and is sampled from the distribution with the least skewness.

Group 1 has size 10 to 200, in increments of 10, and is sampled from the 12 distributions.

After 10,000 iterations, here are the results for the mean:


All the bias values are near zero, as expected. The shaded area shows the upper part of the mean’s bias 50% HDI (highest density interval), when group 1 and group 2 have the same, least, skewness (6). This interval shows the location of the bulk of the 10,000 simulations. The same area is shown in the next figures. Again, this is a reminder that bias is defined in the long run: a single experiment (one of our simulation iterations) can be far off the population value, especially with small sample sizes.

Here are the results for the median:


As in the previous figure, the shaded area shows the upper part of the mean’s bias 50% HDI, when group 1 and group 2 have the same skewness. Bias increases with skewness and sample size difference and is particularly large for n = 10. However, if the two groups have the same skewness (skewness 6), there is almost no bias even when group 2 has 200 observations and group 1 only has 10. So it seems Miller’s (1988) warning about differences in sample sizes was inappropriate, because the main factor causing bias is skewness.

Next, let’s find out if we can correct the bias. Bias correction is performed in 2 ways:

  • using the bootstrap

  • using subsamples, following Miller’s suggestion.

Miller (1988) suggested:

“Although it is computationally quite tedious, there is a way to use medians to reduce the effects of outliers without introducing a bias dependent on sample size. One uses the regular median from Condition F and compares it with a special “average median” (Am) from Condition M. To compute Am, one would take from Condition M all the possible subsamples of Size f where f is the number of trials in Condition F. For each subsample one computes the subsample median. Then, Am is the average, across all possible subsamples, of the subsample medians. This procedure does not introduce bias, because all medians are computed on the basis of the same sample (subsample) size.”

Using all possible subsamples would take far too long. For instance, if one group has 5 observations and the other group has 20 observations, there are 15504 (choose(20,5)) subsamples to consider. Slightly larger sample sizes would force us to consider millions of subsamples. So instead we compute K random subsamples. I arbitrarily set K to 1,000. Although this is not what Miller (1988) suggested, the K loop shortcut should reduce bias to some extent if it is due to sample size differences. Here are the results:


The K loop approach has only a small effect on bias. The reason is simple: the main cause of the bias is not a difference in sample size, it is a difference in skewness. This skewness difference can be handled by the bootstrap. Here is what we get using 200 bootstrap resamples for each simulation iteration:


The bootstrap bias correction works very well. The median’s bias is still a bit larger than the mean’s bias for n = 10 in some conditions. For instance, for the most skewed distribution and n = 10, the mean’s bias is -0.51, whereas the median’s bias after bias correction is 0.73. None of them are exactly zero and the absolute values are very similar. At most, for n = 10, the median’s maximum bias across distributions is 1.85 ms, whereas the mean’s is 0.7 ms.


Miller’s (1988) advice was inappropriate because, when comparing two groups, bias originates from a conjunction of differences in skewness and in sample size, not from differences in sample size alone. Although, in practice it might well be that skewed distributions from different conditions/groups tend to differ in skewness, in which case unequal sample sizes will exacerbate the bias. To be cautious, when sample size is relatively small, it would be useful to report median effects with and without bootstrap bias correction.

Finally, data & code are available on github.


Reaction times and other skewed distributions: problems with the mean and the median (part 1/4)

In this series of 4 posts, I replicate, expand and discuss the results from

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Part 1 = replicate Miller’s simulations + apply bootstrap bias correction

Part 2 = expand Miller’s simulations to group comparison

Part 3 = problems with the mean

Part 4 = application to a large dataset

Data & code are available on github.

Reaction times (RT) and many other quantities in neuroscience & psychology are skewed. This asymmetry tends to differ among experimental conditions, such that a measure of central tendency and a measure of spread are insufficient to capture how conditions differ. Instead, to understand the potentially rich differences among distributions, it is advised to consider multiple quantiles of the distributions (Doksum 1974; Pratte et al. 2010; Rousselet et al. 2017), or to model the shapes of the distributions (Heathcote et al. 1991; Rouder et al. 2005; Palmer et al. 2011; Matzke et al. 2013). Yet, it is still common practice to summarise reaction time distributions using a single number, most often the mean: that one value for each participant and each condition can then be entered into a group ANOVA to make statistical inferences. Because of the skewness of reaction times, the mean is however a poor measure of central tendency: skewness shifts the mean away from the bulk of the distribution, an effect that can be amplified by the presence of outliers or a thick right tail. For instance, in the figure below, the median better represents the typical observation than the mean because it is closer to the bulky part of the distribution.


Mean and median for a very right skewed distribution. The distribution is bounded to the left and has a long right tail. This is an ex-Gaussian distribution, which is popular to model the shape of reaction time distributions, even though the matching between its parameters and mental processes can be debated (Sternberg, 2014).

So the median appears to be a better choice than the mean if the goal is to have a single value to tell us about the location of most observations in a skewed distribution. The choice between the mean and the median is however more complicated. It could be argued that because the mean is sensitive to skewness, outliers and the thickness of the right tail, it is better able to capture changes in the shapes of the distributions among conditions. But the use of a single value to capture shape differences will lead to intractable analyses because the same mean could correspond to various shapes. Instead, a multiple quantile approach or explicit shape modelling should be used.

The mean and the median differ in another important aspect: for small sample sizes, the sample mean is unbiased, whereas the sample median is biased (see illustrations in previous post). Concretely, if we perform many times the same RT experiment, and for each experiment we compute the mean and the median, the average mean will be very close to the population mean. As the number of experiments increases, the average sample mean will converge to the exact population mean. This is not the case for the median when sample size is small.

The reason for this bias is explain by Miller (1988):

“Like all sample statistics, sample medians vary randomly (from sample to sample) around the true population median, with more random variation when the sample size is smaller. Because medians are determined by ranks rather than magnitudes of scores, the population percentiles of sample medians vary symmetrically around the desired value of 50%. For example, a sample median is just as likely to be the score at the 40th percentile in the population as the score at the 60th percentile. If the original distribution is positively skewed, this symmetry implies that the distribution of sample medians will also be positively skewed. Specifically, unusually large sample medians (e.g., 60th percentile) will be farther above the population median than unusually small sample medians (e.g., 40th percentile) will be below it. The average of all possible sample medians, then, will be larger than the true median, because sample medians less than the true value will not be small enough to balance out the sample medians greater than the true value. Naturally, the more the distribution is skewed, the greater will be the bias in the sample median.”

To illustrate the sample median’s bias, Miller (1988) employed 12 ex-Gaussian distributions that differ in skewness. The distributions are illustrated in the next figure, and colour-coded using the difference between the mean and the median as a non-parametric measure of skewness.


Here are the parameters of the 12 distributions with the associated population parameters.



The first figure in this post used the most skewed distribution of the 12, with parameters (300, 20, 300).

To estimate bias, we run a simulation in which we sample with replacement 10,000 times from each of the 12 distributions. We take random samples of sizes 4, 6, 8, 10, 15, 20, 25, 35, 50 and 100, as did Miller. For each random sample, we compute the mean and the median. For each sample size and ex-Gaussian parameter, the bias is then defined as the difference between the mean of the 10,000 sample estimates and the population value.

First, we check that the mean is not biased:


Each line shows the results for one type of ex-Gaussian distribution: the mean of 10,000 simulations for different sample sizes. The grey area marks the 50% highest-density interval (HDI) of the 10,000 simulations for the least skewed distribution (the same interval is shown in the next two figures for comparison). The interval shows the variability across simulations and highlights an important aspect of the results: bias is a long-run property of an estimator; there is no guarantee that one value from a single experiment will be close to the population value. Also, the variability among samples increases with decreasing sample size, which is why results across small n experiments can differ substantially.

Here are the median bias estimates in table format:



Columns = sample sizes; rows = skewness


The values are very close to the values reported in Miller 1998:


An illustration is easier to grasp:


As reported by Miller (1988), bias can be quite large and it gets worse with decreasing sample sizes and increasing skewness.

Based on these results, Miller made this recommendation:

“An important practical consequence of the bias in median reaction time is that sample medians must not be used to compare reaction times across experimental conditions when there are unequal numbers of trials in the conditions.”

According to Google Scholar, Miller (1988) has been cited 172 times. A look at some of the oldest and most recent citations reveals that his advice has been followed.

For instance, Lavie (1995) noted:

“In the following experiments I used the mean RTs for each participant rather than the medians, as the increase in number of go/no-go errors under the high-load conditions resulted in a different number of responses in between conditions (see Miller, 1988).”

Tipper et al. (1992):

Analysis by Miller (1988) shows that for large numbers of trials, differences in the numbers between conditions […] has no impact on the medians obtained.

More recently, Robinson et al. (2018):

“[…] comparing medians among conditions with an unequal number of trials may lead to false positives (Miller, 1988)”

Du et al. (2017):

“We chose the mean rather than median of RT as […] the sample median may provide a biased estimation of RT (Miller, 1988)”

The list goes on. Also, in a review paper, Whelan (2008), cited 324 times, reiterates the advice:

“The median should never be used on RT data to compare conditions with different numbers of trials.”

However, there are several problems with Miller’s advice. In particular, using the mean leads to many issues with estimation and statistical inferences, as we will see in the 3rd post. In this post, we tackle one key omission from Miller’s assessment: the bias of the sample median can be corrected, using a percentile bootstrap bias correction, as described in this previous post.

For each iteration in the simulation, bias correction was performed using 200 bootstrap samples. Here are the bias corrected results:


The bias correction works very well on average, except for the smallest sample sizes. The failure of the bias correction for very small n is not surprising, because the shape of the sampling distribution cannot be properly estimated by the bootstrap from so few observations. From n = 10, the bias values are very close to those observed for the mean. So it seems that in the long-run, we can eliminate the bias of the sample median by using a simple bootstrap procedure. As we will see in the next post, the bootstrap bias correction is also effective when comparing two groups.

Update: 06 Feb 2018

Finally, let’s have a look at how well bias correction works as a function of sample size and skewness. The smaller the sample size, the less the bootstrap can estimate the sampling distribution and its bias. Ideally, after bias correction, we would expect the sample medians to be centred around zero, with limited variability. This ideal situation could look something like that:


Here we’re considering the most skewed distribution with the smallest sample size. The x-axis shows the median bias before bias correction, whereas the y-axis shows the median bias after bias correction. In this ideal case, the correction works extremely well irrespective of the original bias (average bias is the red vertical line). As a result, the average bias after bias correction is very near zero (horizontal green line).

The reality is very different. The bias correction is only partially effective and is inhomogeneous.


Let’s add some markup to help understand what’s going on.


If the original bias is negative, after correction, the median tends to be even more negative, so over corrected in the wrong direction (lower left triangle).

If the original bias is positive, after correction, the median is either:
– over corrected in the right direction (lower right triangle)
– under corrected in the right direction (middle right triangle)
– over corrected in the wrong direction (upper right triangle)

This pattern remains, although attenuated, if we consider the largest sample size.


Or if we consider the least skewed distribution.


We can look at the different patterns as a function of sample size and skewness.

The figure below shows the probability of over correcting in the wrong direction given that the original bias is negative (lower left triangle of the marked up figure).


In the ideal situation illustrated previously, the expected proportion of over correction in the wrong direction, given an original negative bias, is 6.7%. So here we clearly have an overrepresentation of these cases. When the original bias is negative, in most cases the bootstrap is unable to correct in the right direction. The situation gets worse with increasing skewness and smaller sample sizes.

The figure below shows the probability of under correcting in the right direction given that the original bias is positive (middle right triangle of the marked up figure)


In the ideal situation illustrated previously, the expected proportion of under correction in the right direction, given an original positive bias, is 44.7 %. So here, we have an overrepresentation of these cases. When the original bias is positive, in too many cases the bootstrap corrects in the right direction, but it under-corrects. The situation gets worse with increasing skewness and smaller sample sizes.

Ok, so bias correction is imperfect and it varies a lot, it part depending on whether the sample median fell above or below the unknown population median. Think using the mean is safer? There are several strong arguments to the contrary – more in another post.


Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Du, Y., Valentini, N.C., Kim, M.J., Whitall, J. & Clark, J.E. (2017) Children and Adults Both Learn Motor Sequences Quickly, But Do So Differently. Frontiers in Psychology, 8.

Heathcote, A., Popiel, S.J. & Mewhort, D.J.K. (1991) Analysis of Response-Time Distributions – an Example Using the Stroop Task. Psychol Bull, 109, 340-347.

Lavie, N. (1995) Perceptual Load as a Necessary Condition for Selective Attention. J Exp Psychol Human, 21, 451-468.

Matzke, D., Love, J., Wiecki, T.V., Brown, S.D., Logan, G.D. & Wagenmakers, E.J. (2013) Release the BEESTS: Bayesian Estimation of Ex-Gaussian STop Signal reaction time distributions. Front Psychol, 4.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Palmer, E.M., Horowitz, T.S., Torralba, A. & Wolfe, J.M. (2011) What Are the Shapes of Response Time Distributions in Visual Search? J Exp Psychol Human, 37, 58-71.

Pratte, M.S., Rouder, J.N., Morey, R.D. & Feng, C.N. (2010) Exploring the differences in distributional properties between Stroop and Simon effects using delta plots. Atten Percept Psycho, 72, 2013-2025.

Robinson, M. M., Clevenger, J., & Irwin, D. E. (2018). The action is in the task set, not in the action. Cognitive psychology, 100, 17-42.

Rouder, J.N., Lu, J., Speckman, P., Sun, D.H. & Jiang, Y. (2005) A hierarchical model for estimating response time distributions. Psychon B Rev, 12, 195-223.

Sternberg, S. (2014) Reaction times and the ex-Gaussian distribution: When is it appropriate? Retrieved from

Tipper, S.P., Lortie, C. & Baylis, G.C. (1992) Selective Reaching – Evidence for Action-Centered Attention. J Exp Psychol Human, 18, 891-905.

Whelan, R. (2008) Effective analysis of reaction time data. Psychol Rec, 58, 475-482.


Can someone tell if a person is alive or dead based on a photograph?

In this post I review the now retracted paper:

Delorme A, Pierce A, Michel L and Radin D (2016) Prediction of Mortality Based on Facial Characteristics. Front. Hum. Neurosci. 10:173. doi: 10.3389/fnhum.2016.00173

In typical Frontiers’ style, the reason for the retraction is obscure.

In December 2016, I made negative comments about the paper on Twitter. Arnaud Delorme (first author, and whom I’ve known for over 20 years), got in touch, asking for clarifications about my points. I said I would write something eventually, so there it goes.

The story is simple: some individuals claim to be able to determine if a person is alive or dead based on a photograph. The authors got hold of 12 such individuals and asked them to perform a dead/alive/don’t know discrimination task. EEG was measured while participants viewed 394 photos of individuals alive or dead (50/50).

Here are some of the methodological problems.


Participants were from California. Some photographs were of US politicians outside California. Participants did not recognise any individuals from the photographs, but unconscious familiarity could still influence behaviour and EEG – who knows?

More importantly, if participants make general claims about their abilities, why not use photographs of individuals from another country altogether? Even better, another culture?


The average group performance of the participants was 53.6%. So as a group, they really can’t do the task. (If you want to argue they can, I challenge you to seek treatment from a surgeon with  a 53.6% success record.) Yet, a t-test is reported with p=0.005. Let’s not pay too much attention to the inappropriateness of t-tests for percent correct data. The crucial point is that the participants did not make a claim about their performance as a group: each one of them claimed to be able to tease apart the dead from the living based on photographs. So participants should be assessed individually. Here are the individual performances:

(52.3, 56.7, 53.3, 56.0, 56.6, 51.8, 61.3, 55.3, 50.0, 51.6, 49.5, 49.4)

5 participants have results flagged as significant. One in particular has a performance of 61.3% correct. So how does it compare to participants without super abilities? Well, astonishingly, there is no control group! (Yep, and that paper was peer-reviewed.)

Given the extra-ordinary claims made by the participants, I would have expected a large sample of control participants, to clearly demonstrate that the “readers” perform well beyond normal. I would also have expected the readers to be tested on multiple occasions, to demonstrate the reliability of the effect.

There are two other problems with the behavioural performance:

  • participants’ responses were biased towards the ‘dead’ response, so a sensitivity analysis, such as a d’ or a non-parametric equivalent should have been used.

  • performance varied a lot across the 3 sets of images that composed the 394 photographs. This suggests that the results are not image independent, which could in part be due to the 3 sets containing different proportions of dead and alive persons.


The ERP analyses were performed at the group level using a 2 x 2 design: alive/dead x correct/incorrect classification. One effect is reported with p<0.05: a larger amplitude for incorrect than correct around 100 ms post-stimulus, only for pictures of dead persons. A proper spatial-temporal cluster correction for multiple comparison was applied. There is no clear interpretation of the effect in the paper, except a quick suggestion that it could be due to low-level image properties or an attention effect. A non-specific attention effect is possible, because sorting ERPs based on behavioural performance can be misleading, as explained here. The effect could also be a false positive – in the absence of replication and comparison to a control group, it’s impossible to tell.

To be frank, I don’t understand why EEG was measured at all. I guess if the self-proclaimed readers could do the task at all, it would be interesting to look at the time-course of the brain activity related to the task. But the literature on face recognition shows very little modulation due to identity, except in priming tasks or using SSVEP protocols – so not likely to show anything with single image presentations. If there was something to exploit, the analysis should be performed at the participant level, perhaps using multivariate logistic regression, with cross-validation, to demonstrate a link between brain activity and image type. Similarly to behaviour, each individual result from the “readers” should be compared to a large set of control results, from participants who cannot perform the behavioural task.


In conclusion, this paper should never have been sent for peer-review. That would have saved everyone involved a lot of time. There is nothing in the paper supporting the authors’ conclusion:

“Our results support claims of individuals who report that some as-yet unknown features of the face predict mortality. The results are also compatible with claims about clairvoyance warrants further investigation.”

If the authors are serious about studying clairvoyance, they should embark on a much more ambitious study. To save time and money, I would suggest to drop EEG from the study, to focus on the creation of a large bank of images from various countries and cultures and repeated measurements of readers and many controls.

What can we learn from 10,000 experiments?

The code and a notebook for this post are available on github.

Before we conduct an experiment, we decide on a number of trials per condition, and a number of participants, and then we hope that whatever we measure comes close to the population values we’re trying to estimate. In this post I’ll use a large dataset to illustrate how close we can get to the truth – at least in a lexical decision task. The data are from the French lexicon project:

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

After discarding participants who clearly did not pay attention, we have 967 participants who performed a word/non-word discrimination task. Each participant has about 1000 trials per condition. I only consider the reaction time (RT) data, but correct/incorrect data are also available. Here are RT distributions for 3 random participants (PXXX refers to their ID number in the dataset):




The distributions are positively skewed, as expected for RT data, and participants tend to be slower in the non-word condition compared to the word condition. Usually, a single number is used to summarise each individual RT distribution. From 1000 values to 1, that’s some serious data compression. (For alternative strategies to compare RT distributions, see for instance this paper). In psychology, the mean is often used, but here the median gives a better indication of the location of the typical observation. Let’s use both. Here is the distribution across participants of median RT for the word and non-word conditions:


And the distribution of median differences:


Interestingly, the differences between the medians of the two conditions is also skewed: that’s because the two distributions tend to differ in skewness.

We can do the same for the mean:


Mean differences:


With this large dataset, we can play a very useful game. Let’s pretend that the full dataset is our population that we’re trying to estimate. Across all trials and all participants, the population medians are:

Word = 679.5 ms
Non-word = 764 ms
Difference = 78.5

the population means are:

Word = 767.6 ms
Non-word = 853.2 ms
Difference = 85.5

Now, we can perform experiments by sampling with replacement from our large population. I think we can all agree that a typical experiment does not have 1,000 trials per condition, and certainly does not have almost 1,000 participants! So what happens if we perform experiments in which we collect smaller sample sizes? How close can we get to the truth?

10,000 experiments: random samples of participants only

In the first simulation, we use all the trials (1,000) available for each participant but vary how many participants we sample from 10 to 100, in steps of 10. For each sample size, we draw 10,000 random samples of participants from the population, and we compute the mean and the median. For consistency, we compute group means of individual means, and group medians of individual medians. We have 6 sets of results to consider: for the word condition, the non-word condition, their difference; for the mean and for the median. Here are the results for the group medians in the word condition. The vertical red line marks the population median for the word condition.


With increasing sample size, we gain in precision: on average the result of each experiment is closer to the population value. With a small sample size, the result of a single experiment can be quite far from the population. This is not surprising, and also explains why estimates from small n experiments tend to disagree, especially between an original study and subsequent replications.

To get a clearer sense of how close the estimates are to the population value, it is useful to consider highest density intervals (HDI). A HDI is the shorted interval that contains a certain proportion of observations: it shows the location of the bulk of the observations. Here we consider 50% HDI:


In keeping with the previous figure, the intervals get smaller with increasing sample size. The intervals are also asymmetric: the right side is always closer to the population value than the left side. That’s because the sampling distributions are asymmetric. This means that the typical experiment will tend to under-estimate the population value.

We observe a similar pattern for the non-word condition, with the addition of two peaks in the distribution from n = 40. I’m not sure what’s causing it, or if it is a common shape. One could check by analysing the lexicon project data from other countries. Any volunteers? One thing for sure: the two peaks are caused by the median, because they are absent from the mean distributions (see notebook on github).


Finally, we can consider the distributions of group medians of median differences:


Not surprisingly, it has a similar shape to the previous distributions. It shows a landscape of expected effects. It would be interesting to see where the results of other, smaller, experiments fall in that distribution. The distribution could also be used to plan experiments to achieve a certain level of precision. This is rather unusual given the common obsession for statistical power. But experiments should be predominantly about quantifying effects, so a legitimate concern is to determine how far we’re likely to be from the truth in a given experiment. Using our simulated distribution of experiments, we can determine the probability of the absolute difference between an experiment and the population value to be larger than some cut-offs. In the figure below, we look at cut-offs from 5 ms to 50 ms, in steps of 5.


The probability of being wrong by at least 5 ms is more than 50% for all sample sizes. But 5 ms is probably not a difference folks studying lexical decision would worry about. It might be an important difference in other fields. 10 ms might be more relevant: I certainly remember a conference in which group differences of 10 ms between attention conditions were seriously discussed, and their implications for theories of attention considered. If we care about a resolution of at least 10 ms, n=30 still gives us 44% chance of being wrong…

(If someone is looking for a cool exercise: you could determine the probability of two random experiments to differ by at least certain amounts.)

We can also look at the 50% HDI of the median differences:


As previously noted, the intervals shrink with increasing sample size, but are asymmetric, suggesting that the typical experiment will under-estimate the population difference. This asymmetry disappears for larger samples of group means:


Anyway, these last figures really make me think that we shouldn’t make such a big deal out of any single experiment, given the uncertainty in our measurements.

10,000 experiments: random samples of trials and participants

In the previous simulation, we sampled participants with replacement, using all their 1,000 trials. Typical experiments use fewer trials. Let say I plan to collect data from 20 participants, with 100 trials per condition. What does the sampling distribution look like? Using our large dataset, we can sample trials and participants to find out. Again, for consistency, group means are computed from individual means, and group medians are computed from individual medians. Here are the sampling distributions of the mean and the median in the word condition:


The vertical lines mark the population values. The means are larger than the medians because of the positive skewness of RT distributions.

In the non-word condition:


And the difference distributions:


Both distributions are slightly positively skewed, and their medians fall to the left of the population values. The spread of expected values is quite large.

Similarly to simulation 1, we can determine the probability of being wrong by at least 5 ms, which is 39%. For 10 ms, the probability is 28%, and for 20 ms 14%. But that’s assuming 20 participants and 100 trials per condition. A larger simulation could investigate many combinations of sample sizes…

Finally, we consider the 50% HDI, which are both asymmetric with respect to the population values and with similar lengths:



To answer the question from the title, I think the main things we can learn from the simulated sampling distributions above are:
– to embrace uncertainty
– modesty in our interpretations
– healthy skepticism about published research
One way to highlight uncertainty in published research would be to run simulations using descriptive statistics from the articles, to illustrate the range of potential outcomes that are compatible with the results. That would make for interesting stories I’m sure.

Bias & bootstrap bias correction

The code and a notebook for this post are available on github.

The bootstrap bias correction technique is described in detail in chapter 10 of this classic textbook:

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

A mathematical summary + R code are available here.

In a typical experiment, we draw samples from an unknown population and compute a summary quantity, or sample statistic, which we hope will be close to the true population value. For instance, in a reaction time (RT) experiment, we might want to estimate how long it takes for a given participant to detect a visual stimulus embedded in noise. Now, to get a good estimate of our RTs, we ask our participant to perform a certain number of trials, say 100. Then, we might compute the mean RT, to get a value that summarises the 100 trials. This mean RT is an estimate of the true, population, value. In that case the population would be all the unknowable reaction times that could be generated by the participant, in the same task, in various situations – for instance after x hours of sleep, y cups of coffee, z pints of beer the night before – typically all these co-variates that we do not control. (The same logic applies across participants). So for that one experiment, the mean RT might under-estimate or over-estimate the population mean RT. But if we ask our participant to come to the lab over and over, each time to perform 100 trials, the average of all these sample estimates will converge to the population mean. We say that the sample mean is an unbiased estimator of the population mean. Certain estimators, in certain situations, are however biased: no matter how many experiments we perform, the average of the estimates is systematically off, either above or below the population value.

Let’s say we’re sampling from this very skewed distribution. It is an ex-Gaussian distribution which looks a bit like a reaction time distribution, but could be another skewed quantity – there are plenty to choose from in nature. The mean is 600, the median is 509.

Now imagine we perform experiments to try to estimate these population values. Let say we take 1,000 samples of 10 observations. For each experiment (sample), we compute the mean. These sample means are shown as grey vertical lines in the next figure. A lot of them fall very near the population mean (black vertical line), but some of them are way off.


The mean of these estimates is shown with the black dashed vertical line. The difference between the mean of the mean estimates and the population value is called bias. Here bias is small (2.5). Increasing the number of experiments will eventually lead to a bias of zero. In other words, the sample mean is an unbiased estimator of the population mean.

For small sample sizes from skewed distributions, this is not the case for the median. In the example below, the bias is 15.1: the average median across 1,000 experiments over-estimates the population median.


Increasing sample size to 100 reduces the bias to 0.7 and improves the precision of our estimates. On average, we get closer to the population median, and the distribution of sample medians has much lower variance.


So bias and measurement precision depend on sample size. Let’s look at sampling distributions as a function of sample size. First, we consider the mean.

The sample mean is not biased

Using our population, let’s do 1000 experiments, in which we take different numbers of samples 1000, 500, 100, 50, 20, 10.

Then, we can illustrate how close all these experiments get to the true (population) value.


As expected, our estimates of the population mean are less and less variable with increasing sample size, and they converge towards the true population value. For small samples, the typical sample mean tends to underestimate the true population value (yellow curve). But despite the skewness of the sampling distribution with small n, the average of the 1000 simulations/experiments is very close to the population value, for all sample sizes:

  • population = 677.8

  • average sample mean for n=10 = 683.8

  • average sample mean for n=20 = 678.3

  • average sample mean for n=50 = 678.1

  • average sample mean for n=100 = 678.7

  • average sample mean for n=500 = 678.0

  • average sample mean for n=1000 = 678.0

The approximation will get closer to the true value with more experiments. With 10,000 experiments of n=10, we get 677.3. The result is very close to the true value. That’s why we say that the sample mean is an unbiased estimator of the population mean.

It remains that with small n, the sample mean tends to underestimate the population mean.

The median of the sampling distribution of the mean in the previous figure is 656.9, which is 21 ms under the population value. A good reminder that small sample sizes lead to bad estimation.

Bias of the median

The sample median is biased when n is small and we sample from skewed distributions:

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

However, the bias can be corrected using the bootstrap. Let’s first look at the sampling distribution of the median for different sample sizes.


Doesn’t look too bad, but for small sample sizes, on average the sample median over-estimates the population median:

  • population = 508.7

  • sample mean for n=10 = 522.1

  • sample mean for n=20 = 518.4

  • sample mean for n=50 = 511.5

  • sample mean for n=100 = 508.9

  • sample mean for n=500 = 508.7

  • sample mean for n=1000 = 508.7

Unlike what happened with the mean, the approximation does not get closer to the true value with more experiments. Let’s try 10,000 experiments of n=10:

  • sample mean = 523.9

There is systematic shift between average sample estimates and the population value: thus the sample median is a biased estimate of the population median. Fortunately, this bias can be corrected using the bootstrap.

Bias estimation and bias correction

A simple technique to estimate and correct sampling bias is the percentile bootstrap. If we have a sample of n observations:

  • sample with replacement n observations from our original sample

  • compute the estimate

  • perform steps 1 and 2 nboot times

  • compute the mean of the nboot bootstrap estimates

The difference between the estimate computed using the original sample and the mean of the bootstrap estimates is a bootstrap estimate of bias.

Let’s consider one sample of 10 observations from our skewed distribution. It’s median is: 535.9

The population median is 508.7, so our sample considerably over-estimate the population value.

Next, we sample with replacement from our sample, and compute bootstrap estimates of the median. The distribution obtained is a bootstrap estimate of the sampling distribution of the median. The idea is this: if the bootstrap distribution approximates, on average, the shape of the sampling distribution of the median, then we can use the bootstrap distribution to estimate the bias and correct our sample estimate. However, as we’re going to see, this works on average, in the long run. There is no guarantee for a single experiment.


Using the current data, the mean of the bootstrap estimates is  722.8.

Therefore, our estimate of bias is the difference between the mean of the bootstrap estimates and the sample median = 187.

To correct for bias, we subtract the bootstrap bias estimate from the sample estimate:

sample median – (mean of bootstrap estimates – sample median)

which is the same as:

2 x sample median – mean of bootstrap estimates.

Here the bias corrected sample median is 348.6. Quite a drop from 535.9. So the sample bias has been reduced dramatically, clearly too much. But bias is a long run property of an estimator, so let’s look at a few more examples. We take 100 samples of n = 10, and compute a bias correction for each of them. The arrows go from the sample median to the bias corrected sample median. The vertical black line shows the population median.


  • population median =  508.7

  • average sample median =  515.1

  • average bias corrected sample median =  498.8

So across these experiments, the bias correction was too strong.

What happens if we perform 1000 experiments, each with n=10, and compute a bias correction for each one? Now the average of the bias corrected median estimates is much closer to the true median.

  • population median =  508.7

  • average sample median =  522.1

  • average bias corrected sample median =  508.6

It works very well!

But that’s not always the case: it depends on the estimator and on the amount of skewness.

I’ll cover median bias correction in more detail in a future post. For now, if you study skewed distributions (most bounded quantities such as time measurements are skewed) and use the median, you should consider correcting for bias. But it’s unclear in which situation to act: clearly, bias decreases with sample size, but with small sample sizes the bias can be poorly estimated, potentially resulting in catastrophic adjustments. Clearly more simulations are needed…

Your power is lower than you think

The previous post considered alpha when sampling from normal and non-normal distributions. Here the simulations are extended to look at power in the one-sample case. Statistical power is the long term probability of returning a significant test when there is an effect, or probability of true positives.

Power depends both on sample size and effect size. This is illustrated in the figure below, which reports simulations with 5,000 iterations, using t-test on means applied to samples from a normal distribution.


Now, let’s look at power in less idealistic conditions, for instance when sampling from a lognormal distribution, which is strongly positively skewed. This power simulation used 10,000 iterations and an effect size of 0.5, i.e. we sample from distributions that are shifted by 0.5 from zero.


Under normality (dashed lines), as expected the mean performs better than the 20% trimmed mean and the median: we need smaller sample sizes to reach 80% power when using the mean. However, when sampling from a lognormal distribution the performance of the three estimators is completely reversed: now the mean performs worse; the 20% trimmed mean performs much better; the median performs even better. So when sampling from a skewed distribution, the choice of statistical test can have large effects on power. In particular, a t-test on the mean can have very low power, whereas a t-test on a trimmed mean, or a test on the median can provide much larger power.

As we did in the previous post, let’s look at power in different situations in which we vary the asymmetry and the tails of the distributions. The effect size is 0.5.

Asymmetry manipulation

A t-test on means performs very well under normality (g=0), as we saw in the previous figure. However, as asymmetry increases, power is strongly affected. With large asymmetry (g>1) the t-test is biased: starting from very low sample sizes, power goes down with increasing sample sizes, before going up again in some situations.


A t-test using a 20% trimmed mean is dramatically less affected by asymmetry than the mean.


The median also performs much better than the mean but it behaves differently from the mean and the 20% trimmed mean: power increases with increasing asymmetry!


Tail manipulation

What happens when we manipulate the tails instead? Remember that samples from distributions with heavy tails tend to contain outliers, which affect disproportionally the mean and the variance compared to robust estimators. Not surprisingly, t-tests on means are strongly affected by heavy tails.


The 20% trimmed mean boosts power significantly, although it is still affected by heavy tails.


The median performs the best, showing very limited power drop with increasing tail thickness.



The simulations presented here are of course limited, but they serve as a reminder that power should be estimated using realistic distributions, for instance if the goal is to estimate well-known skewed distributions such as reaction times. The choice of estimators is also critical, and it would be wise to consider robust estimators whenever appropriate.


Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare.

Your alpha is probably not 0.05

The R code to run the simulations and create the figures is on github.

The alpha level, or type I error rate, or significance level, is the long-term probability of false positives: obtaining a significant test when there is in fact no effect. Alpha is traditionally given the arbitrary value of 0.05, meaning that in the long-run, we will commit 5% of false positives. However, on average, we will be wrong much more often (Colquhoun, 2014). As a consequence, some have advocated to lower alpha to avoid fooling ourselves too often in the long run. Others have suggested to avoid using the term “statistically significant” altogether, and to justify the choice of alpha. Another very sensible suggestion is to not bother with arbitrary thresholds at all:

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Justify Your Alpha: A Response to “Redefine Statistical Significance”

When the statistical tail wags the scientific dog

Here I want to address a related but different problem: assuming we’re happy with setting alpha to a particular value, say 0.05, is alpha actually equal to the expected, nominal, value in realistic situations?

Let’s check using one-sample estimation as an example. First, we assume we live in a place of magic, where unicorns are abundant (Micceri 1989): we draw random samples from a standard normal distribution. For each sample of a given size, we perform a t-test. We do that 5000 times and record the proportions of false positives. The results appear in the figure below:


As expected under normality, alpha is very near 0.05 for all sample sizes. Bradley (1978) suggested that keeping the probability of a type I error between 0.025 and 0.075 is satisfactory if the goal is to achieve 0.05. So we illustrate that satisfactory zone as a grey band in the figure above and subsequent ones.

So far so good, but how many quantities we measure have perfectly symmetric distributions with well-behaved tails extending to infinity? Many (most?) quantities related to time measurements are positively skewed and bounded at zero: behavioural reaction times, onset and peak latencies from recordings of single neurones, EEG, MEG… Percent correct data are bounded [0, 1]. Physiological measurements are naturally bounded, or bounded by our measuring equipment (EEG amplifiers for instance). Brain imaging data can have non-normal distributions (Pernet et al. 2009). The list goes on…

So what happens when we sample from a skewed distribution? Let’s look at an example using a lognormal distribution. This time we run 10,000 iterations sampling from a normal distribution and a lognormal distribution, and each time we apply a t-test:


In the normal case (dashed blue curve), results are similar to those obtained in the previous figure: we’re very near 0.05 at all sample sizes. In the lognormal case (solid blue curve) the type I error rate is much larger than 0.05 for small sample sizes. It goes down with increasing sample size, but is still above 0.05 even with 500 observations. The point is clear: if we sample from skewed distributions, our type I error rate is larger than the nominal level, which means that in the long run, we will commit more false positives than we thought.

Fortunately, there is a solution: the detrimental effects of skewness can be limited by trimming the data. Under normality, t-tests on 20% trimmed means do not perform as well as t-tests on means: they trigger more false positives for small sample sizes (dashed green). Sampling from a lognormal distribution also increases the type I error rate of the 20% trimmed mean (solid green), but clearly not as much as what we observed for the mean. The median (red) performs even better, with estimates very near 0.05 whether we sample from normal or lognormal distributions.

To gain a broader perspective, we compare the mean, the 20% trimmed mean and the median in different situations. To do that, we use g & h distributions (Wilcox 2012). These distributions have a median of zero; the parameter g controls the asymmetry of the distribution, while the parameter h controls the tails.

Here are examples of distributions with h = 0 and g is varied. The lognormal distribution corresponds to g = 1. The standard normal distribution is defined by g = h = 0.


Here are examples of distributions with g=0 and h is varied. The tails get heavier with larger values of h.


Asymmetry manipulation

Let’s first look at the results for t-tests on means:


The type I error rate is strongly affected by asymmetry, and the effect gets worse with smaller sample sizes.

The effect of asymmetry is much less dramatic is we use a 20% trimmed mean:


And a test on the median doesn’t seem to be affected by asymmetry at all, irrespective of sample size:


The median performs better because it provides a more accurate estimation of location than the mean.

Tail manipulation

If instead of manipulating the degree of asymmetry, we manipulate the thickness of the tails, now the type I error rate goes down as tails get heavier. This is because sampling from distributions with heavy tails tends to generate outliers, which tend to inflate the variance. The consequence is an alpha below the nominal level and low statistical power (as seen in next post).


Using a 20% trimmed mean improves matter significantly because trimming tends to get rid of outliers.


Using the median leads to even better results.



The simulations presented here are by far not exhaustive:

  • they are limited to the one-sample case;

  • they only consider three statistical tests;

  • they do not consider conjunctions of g and h values.

But they help make an important point: when sampling from non-normal distributions, the type I error rate is probably not at the nominal level when making inferences on the mean. And the problem is exacerbated with small sample sizes. So, in all the current discussions about alpha levels, we must also consider the types of distributions we are investigating and the estimators used to make inferences about them. See for instance simulations looking at the two independent sample case here.


Bradley, J. V. (1978). Robustness? British Journal of Mathematical & Statistical Psychology, 31, 144-152.

Micceri, T. (1989) The Unicorn, the Normal Curve, and Other Improbable Creatures. Psychol Bull, 105, 156-166.

Pernet, C.R., Poline, J.B., Demonet, J.F. & Rousselet, G.A. (2009) Brain classification reveals the right cerebellum as the best biomarker of dyslexia. BMC Neurosci, 10, 67.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, Rand; Rousselet, Guillaume (2017): A guide to robust statistical methods in neuroscience. figshare.