Tag Archives: sample size

Small n correlations + p values = disaster

Previously, we saw that with small sample sizes, correlation estimation is very uncertain, which implies that small n correlations cannot be trusted: the observed value in any experiment could be very far from the population value, and the sign could be wrong too. In addition to the uncertainty associated with small sample sizes, the selective report of results based on p values < 0.05 (or some other threshold), can lead to massively inflated correlation estimates in the literature (Yarkoni, 2009 ☜ if you haven’t done so, you really should read this excellent paper).

Let’s illustrate the problem (code is on GitHub). First, we consider a population rho = 0. Here is the sampling distribution as a function of sample size, as we saw in an earlier post. 

figure_rpval_ori

Figure 1: Sampling distribution for rho=0.

Now, here is the sampling distribution conditional on p < 0.05. The estimates are massively inflated and the problem gets worse with smaller sample sizes, because the smaller the sample size, the larger the correlations must be by chance for them to be significant.

figure_rpval_cond

Figure 2: Sampling distribution for rho=0, given p<0.05

So no, don’t get too excited when you see a statistically significant correlation in a paper…

Let’s do the same exercise when the population correlation is relatively large. With rho = 0.4, the sampling distribution looks like this:

figure_rpval_ori_04

Figure 3: Sampling distribution for rho=0.4.

If we report only those correlations associated with p < 0.05, the distribution looks like this:

figure_rpval_cond_04

Figure 4: Sampling distribution for rho=0.4, given p<0.05

Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations (see small purple bump around -0.6 -0.8). Indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p < 0.05 negative correlation was obtained.

Advertisements

Correlations in neuroscience: are small n, interaction fallacies, lack of illustrations and confidence intervals the norm?

As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:

  • Pearson’s correlation, a non-robust measure of association, is used;
  • R and p values are reported, but not confidence intervals;
  • sample sizes tend to be small, leading to large estimation bias and inflated effect sizes in the literature;
  • R values and confidence intervals are not considered when interpreting the results;
  • instead, most analyses are reported as significant or non-significant (p<0.05), leading to the conclusion that an association exists or not (frequentist fallacy);
  • often figures illustrating the correlations are absent;
  • the explicit or implicit comparison of two correlations is done without a formal test (interaction fallacy).

To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉 

I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details: 

  • n
  • estimator
  • confidence interval
  • R
  • p value
  • consideration of effect sizes
  • figure illustrating positive result
  • figure illustrating negative result
  • interaction test.

164 articles reported no correlation.

7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.

48 articles reported correlations.

Sample size

The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process. 

To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution: 

figure_ejn_sample_sizes

The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.

The median sample size is 18, which is far too low to provide sufficiently precise estimation.

Estimator

The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.

Among the 27 articles that did mention which estimator was used:

  • 11 used only Pearson’s correlation;
  • 11 used only Spearman’s correlation;
  • 4 used Pearson’s and Spearman’s correlations;
  • 1 used Spearman’s and Kendall’s correlations.

So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).

R & p values

Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.

Confidence interval

Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).

Consideration for effect sizes

Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.

Illustrations

In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results. 

Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.

Interaction test

Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours… 

10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!

Recommendations

In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.

How can we do a better job?

If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation. 

Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).

Reporting

These details are essential in articles reporting correlation analyses:

  • sample size for each correlation;
  • estimator of association;
  • R value;
  • confidence interval;
  • scatterplot illustration of every correlation, irrespective of the p value;
  • explicit comparison test of all correlations explicitly or implicitly compared;
  • consideration of effect sizes (R values) and their uncertainty (confidence intervals) in the interpretation of the results.

 Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).

Finally, are you sure you really want to compute a correlation?

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?

John W. Tukey.
 American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108

 

References

McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

[preprint]

Small n correlations cannot be trusted

This post illustrates two important effects of sample size on the estimation of correlation coefficients: lower sample sizes are associated with increased variability and lower probability of replication. This is not specific to correlations, but here we’re going to have a detailed look at what it means when using the popular Pearson’s correlation (similar results are obtained using Spearman’s correlation, and the same problems arise with regression). The R code is available on github.


UPDATE: 2018-06-02

In the original post, I mentioned non-linearities in some of the figures. Jan Vanhove replied on Twitter that he was not getting any, and suggested a different code snippet. I’ve updated the simulations using his code, and now the non-linearities are gone! So thanks Jan!

Johannes Algermissen mentioned on Twitter that his recent paper covered similar issues. Have a look! He also reminded me about this recent paper that makes points very similar to those in this blog.

Gjalt-Jorn Peters mentioned on Twitter that “you can also use the Pearson distribution in package suppdists. Also see pwr.confintR to compute the required sample size for a given desired accuracy in parameter estimation (AIPE), which can also come in handy when planning studies”.

Wolfgang Viechtbauer‏ mentioned on Twitter “that one can just compute the density of r directly (no need to simulate). For example: link. Then everything is nice and smooth”.


Let’s start with an example, shown in the figure below. Nice scatterplot isn’t it! Sample size is 30, and r is 0.703. It seems we have discovered a relatively strong association between variables 1 and 2: let’s submit to Nature or PPNAS! And pollute the literature with another effect that won’t replicate!

figure_random_correlation

Yep, the data in the scatterplot are due to chance. They were sampled from a population with zero correlation. I suspect a lot of published correlations might well fall into that category. Nothing new here, false positives and inflated effect sizes are a natural outcome of small n experiments, and the problem gets worse with questionable research practices and incentives to publish positive new results. 

To understand the problem with estimation from small n experiments, we can perform a simulation in which we draw samples of different sizes from a normal population with a known Pearson’s correlation (rho) of zero. The sampling distributions of the estimates of rho for different sample sizes look like this: 

figure_sampling_distributions

 

Sampling distributions tell us about the behaviour of a statistics in the long run, if we did many experiments. Here, with increasing sample sizes, the sampling distributions are narrower, which means that in the long run, we get more precise estimates. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on:

  • the shape of the univariate and bivariate distributions (if outliers are common, consider robust methods);

  • the expected effect size (the larger the effect, the fewer trials are needed – see below);

  • the precision we want to afford.

For the sampling distributions in the previous figure, we can ask this question for each sample size:

What is the proportion of correlation estimates that are within +/- a certain number of units from the true population correlation? For instance:

  • for 70% of estimates to be within +/- 0.1 of the true correlation value (between -0.1 and 0.1), we need at least 109 observations;

  • for 90% of estimates to be within +/- 0.2 of the true correlation value (between -0.2 and 0.2), we need at least 70 observations. 

These values are illustrated in the next figure using black lines and arrows. The figure shows the proportion of estimates near the true value, for different sample sizes, and for different levels of precision. The bottom-line is that even if we’re willing to make imprecise measurements (up to 0.2 from the true value), we need a lot of observations to be precise enough and often enough in the long run.  

figure_precision

 

The estimation uncertainty associated with small sample sizes leads to another problem: effects are not likely to replicate. A successful replication can be defined in several ways. Here I won’t consider the relatively trivial case of finding a statistically significant (p<0.05) effect going in the same direction in two experiments. Instead, let’s consider how close two estimates are. We can determine, given a certain level of precision, the probability to observe similar effects in two consecutive experiments. In other words, we can find the probability that two measurements differ by at most a certain amount. Not surprisingly, the results follow the same pattern as those observed in the previous figure: the probability to replicate (y-axis) increases with sample size (x-axis) and with the uncertainty we’re willing to accept (see legend with colour coded difference conditions).  

 

figure_replication

In the figure above, the black lines indicates that for 80% of replications to be at most 0.2 apart, we need at least 83 observations.

So far, we have considered samples from a population with zero correlation, such that large correlations were due to chance. What happens when there is an effect? Let see what happens for a fixed sample size of 30, as illustrated in the next figure. 

figure_sampling_distributions_rho

 

As a sanity check, we can see that the modes of the sampling distributions progressively increase with increasing population correlations. More interestingly, the sampling distributions also get narrower with increasing effect sizes. As a consequence, the larger the true effect we’re trying to estimate, the more precise our estimations. Or put another way, for a given level of desired precision, we need fewer trials to estimate a true large effect. The next figure shows the proportion of estimates close to the true estimate, as a function of the population correlation, and for different levels of precision, given a sample size of 30 observations.

figure_precision_rho

 

Overall, in the long run, we can achieve more precise measurements more often if we’re studying true large effects. The exact values will depend on priors about expected effect sizes, shape of distributions and desired precision or achievable sample size. Let’s look in more detail at the sampling distributions for a generous rho = 0.4.

figure_sampling_distributions_rho04

 

The sampling distributions for n<50 appear to be negatively skewed, which means that in the long run, experiments might tend to give biased estimates of the population value; in particular, experiments with n=10 or n=20 are more likely than others to get the sign wrong (long left tail) and to overestimate the true value (distribution mode shifted to the right). From the same data, we can calculate the proportion of correlation estimates close to the true value, as a function of sample size and for different precision levels.

figure_precision_rho04

 

We get this approximate results:

  • for 70% of estimates to be within +/- 0.1 of the true correlation value (between 0.3 and 0.5), we need at least 78 observations;

  • for 90% of estimates to be within +/- 0.2 of the true correlation value (between 0.2 and 0.6), we need at least 50 observations. 

You could repeat this exercise using the R code to get estimates based on your own priors and the precision you want to afford.

Finally, we can look at the probability to observe similar effects in two consecutive experiments, for a given precision. In other words, what is the probability that two measurements differ by at most a certain amount? The next figure shows results for differences ranging from 0.05 (very precise) to 0.4 (very imprecise). The black arrow illustrates that for 80% of replications to be at most 0.2 apart, we need at least 59 observations.

figure_replication_rho04

 

We could do the same analyses presented in this post for power. However, I don’t really see the point of looking at power if the goal is to quantify an effect. The precision of our measurements and of our estimations should be a much stronger concern than the probability to flag any effect as statistically significant (McShane et al. 2018).

There is a lot more to say about correlation estimation and I would recommend in particular these papers from Ed Vul and Tal Yarkoni, from the voodoo correlation era. More recently, Schönbrodt & Perugini (2013) looked at the effect of sample size on correlation estimation, with a focus on precision, similarly to this post. Finally, this more general paper (Forstmeier, Wagemakers & Parker, 2016) about false positives is well worth reading.

Reaction times and other skewed distributions: problems with the mean and the median (part 3/4)

Bias is defined as the distance between the mean of the sampling distribution (here estimated using Monte-Carlo simulations) and the population value. In part 1 and part 2, we saw that for small sample sizes, the sample median provides a biased estimation of the population median, which can significantly affect group comparisons. However, this bias disappears with large sample sizes, and it can be corrected using a bootstrap bias correction. In part 3, we look in more detail at the shape of the sampling distributions, which was ignored by Miller (1988).

Sampling distributions

Let’s consider the sampling distributions of the mean and the median for different sample sizes and ex-Gaussian distributions with skewness ranging from 6 to 92 (Figure 1). When skewness is limited (6, top row), the sampling distributions are symmetric and centred on the population values: there is no bias. As we saw previously, with increasing sample size, variability decreases, which is why studies with larger samples provide more accurate estimations. The flip side is that studies with small samples are much noisier, which is why their results tend not to replicate…

figure_samp_dist_summary

Figure 1

When skewness is large (92, middle row), sampling distributions get more positively skewed with decreasing sample sizes. To better understand how the sampling distributions change with sample size, we turn to the last row of Figure 1, which shows 50% highest-density intervals (HDI). Each horizontal line is a HDI for a particular sample size. The labels contain the values of the interval boundaries. The coloured vertical tick inside the interval marks the median of the distribution. The red vertical line spanning the entire plot is the population value.

Means For small sample sizes, the 50% HDI is offset to the left of the population mean, and so is the median of the sampling distribution. This demonstrates that the typical sample mean tends to under-estimate the population mean – that is to say, the mean sampling distribution is median biased. This offset reduces with increasing sample size, but is still present even for n=100.

Medians With small sample sizes, there is a discrepancy between the 50% HDI, which is shifted to the left of the population median, and the median of the sampling distribution, which is shifted to the right of the population median. This contrasts with the results for the mean, and can be explained by differences in the shapes of the sampling distributions, in particular the larger skewness and kurtosis of the median sampling distribution compared to that of the mean (see code on github for extra figures). The offset between 50% HDI and the population reduces quickly with increasing sample size. For n=10, the median bias is already very small. From n=15, the median sample distribution is not median bias, which means that the typical sample median is not biased.

Another representation of the sampling distributions is provided in Figure 2: 50% HDI are shown as a function of sample size. For both the mean and the median, bias increase with increasing skewness and decreasing sample size. Skewness also increases the asymmetry of the sampling distributions, but more for the mean than median.

figure_samp_dist_hdi_summary

Figure 2

So what’s going on here? Is the mean also biased?  According to the standard definition of bias, which is based on the distance between the population mean and the average of the sampling distribution of the mean, the mean is not biased. But this definition applies to the long run, after we replicate the same experiment many times. In practice, we never do that. So what happens in practice, when we perform only one experiment? In that case, the median of the sampling distribution provides a better description of the typical experiment than the mean of the distribution. And the median of the sample distribution of the mean is inferior to the population mean when sample size is small. So if you conduct one small n experiment and compute the mean of a skewed distribution, you’re likely to under-estimate the true value.

Is the median biased after all? The median is indeed biased according to the standard definition. However, with small n, the typical median (represented by the median of the sampling distribution of the median) is close to the population median, and the difference disappears for even relatively small sample sizes. 

Is it ok to use the median then?

If the goal is to accurately estimate the central tendency of a RT distribution, while protecting against the influence of outliers, the median is far more efficient than the mean (Wilcox & Rousselet, 2018). Providing sample sizes are large enough, bias is actually not a problem, and the typical bias is actually very small, as we’ve just seen. So, if you have to choose between the mean and the median, I would go for the median without hesitation.

It’s more complicated though. In an extensive series of simulations, Ratcliff (1993) demonstrated that when performing standard group ANOVAs, the median can lack power compared to other estimators. Ratcliff’s simulations involved ANOVAs on group means, in which for each participant, very few trials (7 to 12) are available for each condition. Based on the simulations, Ratcliff recommended data transformations or computing the mean after applying specific cut-offs to maximise power. However, these recommendations should be considered with caution because the results could be very different with more realistic sample sizes. Also, standard ANOVA on group means are not robust, and alternative techniques should be considered (Wilcox 2017). Data transformations are not ideal either, because they change the shape of the distributions, which contains important information about the nature of the effects. Also, once data are transformed, inferences are made on the transformed data, not on the original ones, an important caveat that tends to be swept under the carpet in articles’ discussions… Finally, truncating distributions introduce bias too, especially with the mean – see next section (Miller 1991; Ulrich & Miller 1994)!

At this stage, I don’t see much convincing evidence against using the median of RT distributions, if the goal is to use only one measure of location to summarise the entire distribution. Clearly, a better alternative is to not throw away all that information, by studying how entire distributions differ (Rousselet et al. 2017). For instance, explicit modelling of RT distributions can be performed with the excellent brms R package.

Other problems with the mean

In addition to being median biased, and a poor measure of central tendency for asymmetric distributions, the mean is also associated with several other important problems. Standard procedures using the mean lack of power, offer poor control over false positives, and lead to inaccurate confidence intervals. Detailed explanations of these problems are provided in Field & Wilcox (2017) and Wilcox & Rousselet (2018) for instance. For detailed illustrations of the problems associated with means in the one-sample case, when dealing with skewed distributions, see the companion reproducibility package on figshare.

If that was not enough, common outlier exclusion techniques lead to bias estimation of the mean (Miller, 1991). When applied to skewed distributions, removing any values more than 2 or 3 SD from the mean affects slow responses more than fast ones. As a consequence, the sample mean tends to underestimate the population mean. And this bias increases with sample size because the outlier detection technique does not work for small sample sizes, which results from the lack of robustness of the mean and the SD. The bias also increases with skewness. Therefore, when comparing distributions that differ in sample size, or skewness, or both, differences can be masked or created, resulting in inaccurate quantification of effect sizes.

Truncation using absolute thresholds (for instance RT < 300 ms or RT > 1,200 ms) also leads to potentially severe bias of the mean, median, standard deviation and skewness of RT distributions (Ulrich & Miller 1994). The median is much less affected by truncation bias than the mean though.

In the next and final post of this series, we will explore sampling bias in a real dataset, to see how much of a problem we’re really dealing with. Until then, thanks for reading.

[GO TO POST 4/4]

References

Field, A.P. & Wilcox, R.R. (2017) Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behav Res Ther, 98, 19-38.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Miller, J. (1991) Reaction-Time Analysis with Outlier Exclusion – Bias Varies with Sample-Size. Q J Exp Psychol-A, 43, 907-912.

Ratcliff, R. (1993) Methods for dealing with reaction time outliers. Psychol Bull, 114, 510-532.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Ulrich, R. & Miller, J. (1994) Effects of Truncation on Reaction-Time Analysis. Journal of Experimental Psychology-General, 123, 34-80.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Bias & bootstrap bias correction


The code and a notebook for this post are available on github.

The bootstrap bias correction technique is described in detail in chapter 10 of this classic textbook:

Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. CRC press.

A mathematical summary + R code are available here.


In a typical experiment, we draw samples from an unknown population and compute a summary quantity, or sample statistic, which we hope will be close to the true population value. For instance, in a reaction time (RT) experiment, we might want to estimate how long it takes for a given participant to detect a visual stimulus embedded in noise. Now, to get a good estimate of our RTs, we ask our participant to perform a certain number of trials, say 100. Then, we might compute the mean RT, to get a value that summarises the 100 trials. This mean RT is an estimate of the true, population, value. In that case the population would be all the unknowable reaction times that could be generated by the participant, in the same task, in various situations – for instance after x hours of sleep, y cups of coffee, z pints of beer the night before – typically all these co-variates that we do not control. (The same logic applies across participants). So for that one experiment, the mean RT might under-estimate or over-estimate the population mean RT. But if we ask our participant to come to the lab over and over, each time to perform 100 trials, the average of all these sample estimates will converge to the population mean. We say that the sample mean is an unbiased estimator of the population mean. Certain estimators, in certain situations, are however biased: no matter how many experiments we perform, the average of the estimates is systematically off, either above or below the population value.

Let’s say we’re sampling from this very skewed distribution. It is an ex-Gaussian distribution which looks a bit like a reaction time distribution, but could be another skewed quantity – there are plenty to choose from in nature. The mean is 600, the median is 509.

figure.kde.pop.m.md

Now imagine we perform experiments to try to estimate these population values. Let say we take 1,000 samples of 10 observations. For each experiment (sample), we compute the mean. These sample means are shown as grey vertical lines in the next figure. A lot of them fall very near the population mean (black vertical line), but some of them are way off.

figure.kde.1000m

The mean of these estimates is shown with the black dashed vertical line. The difference between the mean of the mean estimates and the population value is called bias. Here bias is small (2.5). Increasing the number of experiments will eventually lead to a bias of zero. In other words, the sample mean is an unbiased estimator of the population mean.

For small sample sizes from skewed distributions, this is not the case for the median. In the example below, the bias is 15.1: the average median across 1,000 experiments over-estimates the population median.

figure.kde.1000md

Increasing sample size to 100 reduces the bias to 0.7 and improves the precision of our estimates. On average, we get closer to the population median, and the distribution of sample medians has much lower variance.

figure.kde.1000md_n100

So bias and measurement precision depend on sample size. Let’s look at sampling distributions as a function of sample size. First, we consider the mean.

The sample mean is not biased

Using our population, let’s do 1000 experiments, in which we take different numbers of samples 1000, 500, 100, 50, 20, 10.

Then, we can illustrate how close all these experiments get to the true (population) value.

figure_sample_mean

As expected, our estimates of the population mean are less and less variable with increasing sample size, and they converge towards the true population value. For small samples, the typical sample mean tends to underestimate the true population value (yellow curve). But despite the skewness of the sampling distribution with small n, the average of the 1000 simulations/experiments is very close to the population value, for all sample sizes:

  • population = 677.8

  • average sample mean for n=10 = 683.8

  • average sample mean for n=20 = 678.3

  • average sample mean for n=50 = 678.1

  • average sample mean for n=100 = 678.7

  • average sample mean for n=500 = 678.0

  • average sample mean for n=1000 = 678.0

The approximation will get closer to the true value with more experiments. With 10,000 experiments of n=10, we get 677.3. The result is very close to the true value. That’s why we say that the sample mean is an unbiased estimator of the population mean.

It remains that with small n, the sample mean tends to underestimate the population mean.

The median of the sampling distribution of the mean in the previous figure is 656.9, which is 21 ms under the population value. A good reminder that small sample sizes lead to bad estimation.

Bias of the median

The sample median is biased when n is small and we sample from skewed distributions:

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

However, the bias can be corrected using the bootstrap. Let’s first look at the sampling distribution of the median for different sample sizes.

figure_sample_median

Doesn’t look too bad, but for small sample sizes, on average the sample median over-estimates the population median:

  • population = 508.7

  • sample mean for n=10 = 522.1

  • sample mean for n=20 = 518.4

  • sample mean for n=50 = 511.5

  • sample mean for n=100 = 508.9

  • sample mean for n=500 = 508.7

  • sample mean for n=1000 = 508.7

Unlike what happened with the mean, the approximation does not get closer to the true value with more experiments. Let’s try 10,000 experiments of n=10:

  • sample mean = 523.9

There is systematic shift between average sample estimates and the population value: thus the sample median is a biased estimate of the population median. Fortunately, this bias can be corrected using the bootstrap.

Bias estimation and bias correction

A simple technique to estimate and correct sampling bias is the percentile bootstrap. If we have a sample of n observations:

  • sample with replacement n observations from our original sample

  • compute the estimate

  • perform steps 1 and 2 nboot times

  • compute the mean of the nboot bootstrap estimates

The difference between the estimate computed using the original sample and the mean of the bootstrap estimates is a bootstrap estimate of bias.

Let’s consider one sample of 10 observations from our skewed distribution. It’s median is: 535.9

The population median is 508.7, so our sample considerably over-estimate the population value.

Next, we sample with replacement from our sample, and compute bootstrap estimates of the median. The distribution obtained is a bootstrap estimate of the sampling distribution of the median. The idea is this: if the bootstrap distribution approximates, on average, the shape of the sampling distribution of the median, then we can use the bootstrap distribution to estimate the bias and correct our sample estimate. However, as we’re going to see, this works on average, in the long run. There is no guarantee for a single experiment.

figure_boot_md

Using the current data, the mean of the bootstrap estimates is  722.8.

Therefore, our estimate of bias is the difference between the mean of the bootstrap estimates and the sample median = 187.

To correct for bias, we subtract the bootstrap bias estimate from the sample estimate:

sample median – (mean of bootstrap estimates – sample median)

which is the same as:

2 x sample median – mean of bootstrap estimates.

Here the bias corrected sample median is 348.6. Quite a drop from 535.9. So the sample bias has been reduced dramatically, clearly too much. But bias is a long run property of an estimator, so let’s look at a few more examples. We take 100 samples of n = 10, and compute a bias correction for each of them. The arrows go from the sample median to the bias corrected sample median. The vertical black line shows the population median.

figure_mdbc_examples

  • population median =  508.7

  • average sample median =  515.1

  • average bias corrected sample median =  498.8

So across these experiments, the bias correction was too strong.

What happens if we perform 1000 experiments, each with n=10, and compute a bias correction for each one? Now the average of the bias corrected median estimates is much closer to the true median.

  • population median =  508.7

  • average sample median =  522.1

  • average bias corrected sample median =  508.6

It works very well!

But that’s not always the case: it depends on the estimator and on the amount of skewness.

I’ll cover median bias correction in more detail in a future post. For now, if you study skewed distributions (most bounded quantities such as time measurements are skewed) and use the median, you should consider correcting for bias. But it’s unclear in which situation to act: clearly, bias decreases with sample size, but with small sample sizes the bias can be poorly estimated, potentially resulting in catastrophic adjustments. Clearly more simulations are needed…

Problems with small sample sizes

In psychology and neuroscience, the typical sample size is too small. I’ve recently seen several neuroscience papers with n = 3-6 animals. For instance, this article uses n = 3 mice per group in a one-way ANOVA. This is a real problem because small sample size is associated with:

  • low statistical power

  • inflated false discovery rate

  • inflated effect size estimation

  • low reproducibility

Here is a list of excellent publications covering these points:

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Forstmeier, W., Wagenmakers, E.J. & Parker, T.H. (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev Camb Philos Soc.

Lakens, D., & Albers, C. J. (2017, September 10). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Retrieved from psyarxiv.com/b7z4q

See also these two blog posts on small n:

When small samples are problematic

Low Power & Effect Sizes

Small sample size also prevents us from properly estimating and modelling the populations we sample from. As a consequence, small n stops us from answering a fundamental, yet often ignored empirical question: how do distributions differ?

This important aspect is illustrated in the figure below. Columns show distributions that differ in four different ways. The rows illustrate samples of different sizes. The scatterplots were jittered using ggforce::geom_sina in R. The vertical black bars indicate the mean of each sample. In row 1, examples 1, 3 and 4 have exactly the same mean. In example 2 the means of the two distributions differ by 2 arbitrary units. The remaining rows illustrate random subsamples of data from row 1. Above each plot, the t value, mean difference and its confidence interval are reported. Even with 100 observations we might struggle to approximate the shape of the parent population. Without additional information, it can be difficult to determine if an observation is an outlier, particularly for skewed distributions. In column 4, samples with n = 20 and n = 5 are very misleading.

figure1

Small sample size could be less of a problem in a Bayesian framework, in which information from prior experiments can be incorporated in the analyses. In the blind and significance obsessed frequentist world, small n is a recipe for disaster.