Tag Archives: sample size

Bias & bootstrap bias correction

The code and a notebook for this post are available on github.

The bootstrap bias correction technique is described in detail in chapter 10 of this classic textbook:

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

A mathematical summary + R code are available here.

In a typical experiment, we draw samples from an unknown population and compute a summary quantity, or sample statistic, which we hope will be close to the true population value. For instance, in a reaction time (RT) experiment, we might want to estimate how long it takes for a given participant to detect a visual stimulus embedded in noise. Now, to get a good estimate of our RTs, we ask our participant to perform a certain number of trials, say 100. Then, we might compute the mean RT, to get a value that summarises the 100 trials. This mean RT is an estimate of the true, population, value. In that case the population would be all the unknowable reaction times that could be generated by the participant, in the same task, in various situations – for instance after x hours of sleep, y cups of coffee, z pints of beer the night before – typically all these co-variates that we do not control. (The same logic applies across participants). So for that one experiment, the mean RT might under-estimate or over-estimate the population mean RT. But if we ask our participant to come to the lab over and over, each time to perform 100 trials, the average of all these sample estimates will converge to the population mean. We say that the sample mean is an unbiased estimator of the population mean. Certain estimators, in certain situations, are however biased: no matter how many experiments we perform, the average of the estimates is systematically off, either above or below the population value.

Let’s say we’re sampling from this very skewed distribution. It is an ex-Gaussian distribution which looks a bit like a reaction time distribution, but could be another skewed quantity – there are plenty to choose from in nature. The mean is 600, the median is 509.


Now imagine we perform experiments to try to estimate these population values. Let say we take 1,000 samples of 10 observations. For each experiment (sample), we compute the mean. These sample means are shown as grey vertical lines in the next figure. A lot of them fall very near the population mean (black vertical line), but some of them are way off.


The mean of these estimates is shown with the black dashed vertical line. The difference between the mean of the mean estimates and the population value is called bias. Here bias is small (2.5). Increasing the number of experiments will eventually lead to a bias of zero. In other words, the sample mean is an unbiased estimator of the population mean.

For small sample sizes from skewed distributions, this is not the case for the median. In the example below, the bias is 15.1: the average median across 1,000 experiments over-estimates the population median.


Increasing sample size to 100 reduces the bias to 0.7 and improves the precision of our estimates. On average, we get closer to the population median, and the distribution of sample medians has much lower variance.


So bias and measurement precision depend on sample size. Let’s look at sampling distributions as a function of sample size. First, we consider the mean.

The sample mean is not biased

Using our population, let’s do 1000 experiments, in which we take different numbers of samples 1000, 500, 100, 50, 20, 10.

Then, we can illustrate how close all these experiments get to the true (population) value.


As expected, our estimates of the population mean are less and less variable with increasing sample size, and they converge towards the true population value. For small samples, the typical sample mean tends to underestimate the true population value (yellow curve). But despite the skewness of the sampling distribution with small n, the average of the 1000 simulations/experiments is very close to the population value, for all sample sizes:

  • population = 677.8

  • average sample mean for n=10 = 683.8

  • average sample mean for n=20 = 678.3

  • average sample mean for n=50 = 678.1

  • average sample mean for n=100 = 678.7

  • average sample mean for n=500 = 678.0

  • average sample mean for n=1000 = 678.0

The approximation will get closer to the true value with more experiments. With 10,000 experiments of n=10, we get 677.3. The result is very close to the true value. That’s why we say that the sample mean is an unbiased estimator of the population mean.

It remains that with small n, the sample mean tends to underestimate the population mean.

The median of the sampling distribution of the mean in the previous figure is 656.9, which is 21 ms under the population value. A good reminder that small sample sizes lead to bad estimation.

Bias of the median

The sample median is biased when n is small and we sample from skewed distributions:

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

However, the bias can be corrected using the bootstrap. Let’s first look at the sampling distribution of the median for different sample sizes.


Doesn’t look too bad, but for small sample sizes, on average the sample median over-estimates the population median:

  • population = 508.7

  • sample mean for n=10 = 522.1

  • sample mean for n=20 = 518.4

  • sample mean for n=50 = 511.5

  • sample mean for n=100 = 508.9

  • sample mean for n=500 = 508.7

  • sample mean for n=1000 = 508.7

Unlike what happened with the mean, the approximation does not get closer to the true value with more experiments. Let’s try 10,000 experiments of n=10:

  • sample mean = 523.9

There is systematic shift between average sample estimates and the population value: thus the sample median is a biased estimate of the population median. Fortunately, this bias can be corrected using the bootstrap.

Bias estimation and bias correction

A simple technique to estimate and correct sampling bias is the percentile bootstrap. If we have a sample of n observations:

  • sample with replacement n observations from our original sample

  • compute the estimate

  • perform steps 1 and 2 nboot times

  • compute the mean of the nboot bootstrap estimates

The difference between the estimate computed using the original sample and the mean of the bootstrap estimates is a bootstrap estimate of bias.

Let’s consider one sample of 10 observations from our skewed distribution. It’s median is: 535.9

The population median is 508.7, so our sample considerably over-estimate the population value.

Next, we sample with replacement from our sample, and compute bootstrap estimates of the median. The distribution obtained is a bootstrap estimate of the sampling distribution of the median. The idea is this: if the bootstrap distribution approximates, on average, the shape of the sampling distribution of the median, then we can use the bootstrap distribution to estimate the bias and correct our sample estimate. However, as we’re going to see, this works on average, in the long run. There is no guarantee for a single experiment.


Using the current data, the mean of the bootstrap estimates is  722.8.

Therefore, our estimate of bias is the difference between the mean of the bootstrap estimates and the sample median = 187.

To correct for bias, we subtract the bootstrap bias estimate from the sample estimate:

sample median – (mean of bootstrap estimates – sample median)

which is the same as:

2 x sample median – mean of bootstrap estimates.

Here the bias corrected sample median is 348.6. Quite a drop from 535.9. So the sample bias has been reduced dramatically, clearly too much. But bias is a long run property of an estimator, so let’s look at a few more examples. We take 100 samples of n = 10, and compute a bias correction for each of them. The arrows go from the sample median to the bias corrected sample median. The vertical black line shows the population median.


  • population median =  508.7

  • average sample median =  515.1

  • average bias corrected sample median =  498.8

So across these experiments, the bias correction was too strong.

What happens if we perform 1000 experiments, each with n=10, and compute a bias correction for each one? Now the average of the bias corrected median estimates is much closer to the true median.

  • population median =  508.7

  • average sample median =  522.1

  • average bias corrected sample median =  508.6

It works very well!

But that’s not always the case: it depends on the estimator and on the amount of skewness.

I’ll cover median bias correction in more detail in a future post. For now, if you study skewed distributions (most bounded quantities such as time measurements are skewed) and use the median, you should consider correcting for bias. But it’s unclear in which situation to act: clearly, bias decreases with sample size, but with small sample sizes the bias can be poorly estimated, potentially resulting in catastrophic adjustments. Clearly more simulations are needed…


Problems with small sample sizes

In psychology and neuroscience, the typical sample size is too small. I’ve recently seen several neuroscience papers with n = 3-6 animals. For instance, this article uses n = 3 mice per group in a one-way ANOVA. This is a real problem because small sample size is associated with:

  • low statistical power

  • inflated false discovery rate

  • inflated effect size estimation

  • low reproducibility

Here is a list of excellent publications covering these points:

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Forstmeier, W., Wagenmakers, E.J. & Parker, T.H. (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev Camb Philos Soc.

Lakens, D., & Albers, C. J. (2017, September 10). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Retrieved from psyarxiv.com/b7z4q

See also these two blog posts on small n:

When small samples are problematic

Low Power & Effect Sizes

Small sample size also prevents us from properly estimating and modelling the populations we sample from. As a consequence, small n stops us from answering a fundamental, yet often ignored empirical question: how do distributions differ?

This important aspect is illustrated in the figure below. Columns show distributions that differ in four different ways. The rows illustrate samples of different sizes. The scatterplots were jittered using ggforce::geom_sina in R. The vertical black bars indicate the mean of each sample. In row 1, examples 1, 3 and 4 have exactly the same mean. In example 2 the means of the two distributions differ by 2 arbitrary units. The remaining rows illustrate random subsamples of data from row 1. Above each plot, the t value, mean difference and its confidence interval are reported. Even with 100 observations we might struggle to approximate the shape of the parent population. Without additional information, it can be difficult to determine if an observation is an outlier, particularly for skewed distributions. In column 4, samples with n = 20 and n = 5 are very misleading.


Small sample size could be less of a problem in a Bayesian framework, in which information from prior experiments can be incorporated in the analyses. In the blind and significance obsessed frequentist world, small n is a recipe for disaster.