Why normality tests are great…

…as a teaching example and should be avoided in research.

These statements are common in the psychology and neuroscience literature:

“In order to assess the normal distribution of the population in terms of age, BV% and CSF%, the Lilliefors-corrected Kolmogorov–Smirnov test was performed” (Porcu et al. 2019)

“The Kolmogorov–Smirnov-Test revealed a normal distribution (p = 0.82).” (Knolle et al. 2019)

“The distribution was not normal (P < 0.01 with the Shapiro–Wilk test).” (Beaudu-Lange et al. 2001)

“Assumptions of the one-way anova for normality were also confirmed with the Shapiro–Wilk test.” (Holloway et al. 2015)

“The Shapiro-Wilk-W-test (P < 0.05) revealed that all distributions could be assumed to be Gaussian as a prerequisite for the application of a t-test.” (Dicke et al. 2008)

“Given the non-normal distribution of such data (Shapiro–Wilk’s p < .05), we applied a nonparametric one-sample t test (the one-sample Wilcoxon signed rank test).” (Zapparoli et al. 2019)

A common recipe goes like this:

  • apply a normality test;
  • if p>0.05, conclude that the data are normally distributed and proceed with a parametric test;
  • if p<0.05, conclude that the data are not normally distributed and proceed with a non-parametric test (or transform the data to try to achieve normality).

It is a useful exercise or class activity to consider the statements above with the goal of identifying all the underlying issues. It could take several hours of teaching to do justice to the rich topics we need to cover to properly understand these issues.

Here is a succinct and non-exhaustive list of issues, with references for follow-up readings:

[1] In the general context of linear regression, the normality assumption applies to the residuals, not the marginal distributions. The main solution involves graphical checks of the residuals (Ernst & Albers, 2017; Vanhove, 2018).

Resources for graphical checks:

Visualization of Regression Models Using visreg

Visualizing regression model predictions

Extracting and visualizing tidy residuals from Bayesian models

Other solutions involve model comparison, to contrast models making different assumptions, and using models robust to assumption violations (Bürkner, 2017; Kruschke, 2013; Wilcox & Rousselet, 2018).

[2] The p value from standard frequentist tests, such as normality tests, cannot be used to accept the null (Rouder et al., 2016; Kruschke, 2018). The p value being computed assuming that the null is true, it cannot in turn be used to support the null — that’s circular. To find support for the null, we need an alternative hypothesis (to compute a Bayes Factor; Rouder et al., 2016; Wagenmakers et al., 2020) or a Region of Practical Equivalence (ROPE, to compute a test of equivalence; Freedman et al., 1984; Kruschke, 2018; Lakens, 2017; Campbell & Gustafson, 2022). Setting an alternative hypothesis is also crucial to get a consistent test (Rouder et al., 2016; Kruschke & Liddell, 2018). Tests of normality, like all Point Null Hypothesis Significance Tests (PNHST), are inconsistent: given alpha = 0.05, even if normality holds, 5% of tests will be positive no matter how large the sample size is.

[3] Failure to reject (p>0.05) doesn’t mean data were sampled from a normal distribution. Another function could fit the data equally well (for instance a shifted lognormal distribution). This point follows directly from [2]. Since our alternative hypothesis is extremely vague, the possibility of another distribution being a plausible data generation process is ignored: the typical test considers only a point null hypothesis versus “anything else”. So when we ask a very vague question, we can only get a very vague answer (there is no free lunch in inference – Rouder et al., 2016).

[4] Failure to reject (p>0.05) could be due to low power. This is well known but usually ignored. Here are the results of simulations to illustrate this point. The code is available on GitHub. We sample from g-and-h distributions (Yan & Genton, 2019), which let us vary asymmetry (parameter g) and tail-thickness (parameter h, which also affects how peaky the distribution is). We start by varying g, keeping a constant h=0.

g-and-h populations used in the simulation in which we vary parameter g

Here are results for the Shapiro-Wilk test, based on a simulation with 10,000 iterations.

The Shapiro-Wilk test has low power unless the departure from normality is pronounced, or sample sizes are large. With small departures from normality (say g=0.1, g=0.2), achieving high power won’t be possible with typical sample sizes in psychology and neuroscience. For g=0, the proportion of false positives is at the expected 5% level (false positive rate).

The Kolmogorov-Smirnov test is dramatically less powerful than the Shapiro-Wilk test (Yap & Sim, 2011).

What happens if we sample from symmetric distributions that are more prone to outliers than the normal distribution? By varying the h parameter, keeping a constant g=0, we can consider distributions that are progressively more kurtotic than the normal distribution.

g-and-h populations used in the simulation in which we vary parameter h

Are the tests considered previously able to detect such deviations from normality? Here is how the Shapiro-Wilk test behaves.

And here are the catastrophic results for the Kolmogorov-Smirnov test.

[5] As the sample size increases, progressively smaller and smaller deviations from normality can be detected, eventually reaching absurd levels of precision, such that tiny differences of no practical relevance will be flagged. This point applies to all PNHST and again follows from [2]: because in PNHST no alternative is considered, tests are biased against the null (Rouder et al., 2016; Wagenmakers et al., 2020). Even when p<0.05, contrasting two hypotheses could reveal that a normal distribution and a non-normal distribution are equally plausible, given our data. Also, because PNHST is not consistent, even when the null is true, 5% of tests will be positive.

[6] Choosing a model conditional on the outcome of a preliminary check affects sampling distributions and thus p values and confidence intervals. The same problem arises when doing balance tests. If a t-test is conditional on a normality test, the p value of the t-test will be different (but unknown) from the one obtained if a t-test is performed without a preliminary check. That’s because p values depend on sampling distributions of imaginary experiments, which in turn depend on sampling and testing intentions (Wagenmaker, 2007; Kruschke & Liddell, 2018). This dependence can make p values difficult to interpret, because unless we simulate the world of possibilities that led to our p value, the sampling distribution for our statistic (say t statistic) is unknown.

[7] When non-normality is detected or suspected, a classic alternative to the two sample t-test is the Wilcoxon-Mann-Whitney test. However, in general different tests or models address different hypotheses — they are not interchangeable. For instance, the WMW’s U statistics is related to the distribution of all pairwise differences between two independent groups; unlike the t-test it doesn’t involve a comparison of the marginal means. Similarly, if instead of the mean, we use a trimmed mean, a robust measure of central tendency, our inferences are about the population trimmed mean, not the population mean.

[8] In most cases, researchers know the answer to the normality question before conducting the experiment. For instance, we know that reaction times, accuracy and questionnaire data are not normally distributed. Testing for normality when we already know the answer is unnecessary and falls into the category of tautological tests. Since we know the answer in most situations, it is better practice to use appropriate models and drop the checks altogether. For instance, accuracy data follow beta-binomial distributions (Jaeger, 2008; Kruschke, 2014); questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018; Bürkner & Vuorre, 2019; Taylor et al., 2022); reaction time data can be modelled using several families of skewed distributions (Lindeløv, 2019).


Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. & Vuorre, M. (2019) Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science, 2, 77–101. https://journals.sagepub.com/doi/full/10.1177/2515245918823199

Campbell, H. & Gustafson, P. (2021) re:Linde et al. (2021): The Bayes factor, HDI-ROPE and frequentist equivalence tests can all be reverse engineered – almost exactly – from one another. https://arxiv.org/abs/2104.07834

Ernst AF, Albers CJ. 2017. Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. PeerJ 5:e3323 https://doi.org/10.7717/peerj.3323

Freedman, L.S., Lowe, D., & Macaskill, P. (1984) Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586.

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434–446.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573–603.

Kruschke, J.K. (2014) Doing Bayesian Data Analysis, 2nd Edition. edn. Academic Press.

Kruschke, J.K. (2018) Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1, 270–280.

Kruschke, J.K. & Liddell, T.M. (2018) The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev, 25, 178–206.

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Liddell, T.M. & Kruschke, J.K. (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348.

Lindeløv, J.K. (2019) Reaction time distributions: an interactive overview

Rouder, J.N., Morey, R.D., Verhagen, J., Province, J.M. and Wagenmakers, E.-J. (2016), Is There a Free Lunch in Inference?. Top Cogn Sci, 8: 520-547. https://doi.org/10.1111/tops.12214

Taylor, J.E., Rousselet, G.A., Scheepers, C. et al. Rating norms should be calculated from cumulative link mixed effects models. Behav Res (2022). https://doi.org/10.3758/s13428-022-01814-7

Torrin M.Liddell & John K.Kruschke (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348

Vanhove (2018) Checking model assumptions without getting paranoid. https://janhove.github.io/analysis/2018/04/25/graphical-model-checking

Wagenmakers, E.-J. (2007) A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2020) The Principle of Predictive Irrelevance or Why Intervals Should Not be Used for Model Comparison Featuring a Point Null Hypothesis. In Gruber, C.W. (ed), The Theory of Statistics in Psychology: Applications, Use, and Misunderstandings. Springer International Publishing, Cham, pp. 111–129.

Wilcox RR, Rousselet GA. A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci. 2018 Jan 22;82:8.42.1-8.42.30. doi: 10.1002/cpns.41. PMID: 29357109.

Yan, Y. & Genton, M.G. (2019) The Tukey g-and-h distribution. Significance, 16, 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Yap, B.W. & Sim, C.H. (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141–2155. DOI: 10.1080/00949655.2010.520163

Mean or median reaction time? An open review of Miller (2020)

Below is a review of Miller (2020), which is a reply to an article in which Rand Wilcox & I reproduced and built upon Miller (1988). Here are the articles in order:

[1] A Warning About Median Reaction Time

[2] Reaction times and other skewed distributions: problems with the mean and the median

[3] Another Warning about Median Reaction Time

I’ve added a link to [3] and this review to article [2], so readers are aware of the discussions.

The review will only make sense if you at least read [3] first, but [2] contains a lot of simulations, descriptions and references not covered in [3].


To start, I’ve got to say I’ve learnt so much about various sources of bias from your work on reaction time analyses, including Miller (1988) and many subsequent papers. I discovered Miller (1988) by chance a few years ago, while researching a review article on robust measures of effect sizes. Actually, I was so startled by the 1988 results that I dropped the article on effect sizes to replicate Miller’s simulations and explore their consequences. This extensive work has taught me a lot about reaction time data, the mean, the median, their sampling distributions and associated inferential statistics. So I’m really thankful for that.

Overall, I enjoyed your reply to our paper, which provides interesting new simulations and a good summary of the issues. The main apparent discrepancies are much smaller than they seem and can easily be addressed by wording key statements and conclusions more carefully, for instance by highlighting boundary conditions. If anything I wrote below is unclear, feel free to contact me directly. In particular, if needed, we could discuss the g&h simulation code, which I realise I probably could have explained better in the R&W2020 article.

Your article is well written. The illustrations are fine but could be improved by adding colours or grey/black contrasts. In Figure 4 (presenting the most original and interesting results), the symbols could be different for the 3 estimators to improve clarity.

The simulation results are convincing and mostly concur with our own assessment. However, what is missing is a consideration of the effects of outliers and the skewness of the distribution of differences (after pooling across trials using the mean, the median or some other estimator). As we explained in R&W2020, outliers and skewness can essentially destroy the power of inferential tests using the mean, whereas tests on the median are hardly affected.

“Another unusual feature of R&W’s simulations is that they used g&h distributions (Hoaglin, 1985) as models of RT. These distributions are quite different from ex-Gaussians and are not normally considered in RT modelling (e.g., Luce, 1986). This distributional choice may also have contributed to the advantage for medians in their simulations.”

I don’t understand how our simulations using g&h distributions could be the source of the discrepancy, because we didn’t use them to simulate distributions of reaction times. In fact, we used them to systematically manipulate the distribution of differences that is fed to a one-sample test, from normal to very skewed. We also varied the probability that outliers are observed. The shape of the marginal RT distributions should be irrelevant to these simulations: when the RT distributions are identical in every condition and participant, irrespective of their skewness, the distribution of differences is normal (that is, for each participant, compute the mean for each condition then subtract the means, leading to a distribution of pairwise differences). When the distributions differ in skewness, the distribution of differences has skewness equal to the difference in skewness between the original distributions. Thus, our simulations are only concerned with the group level analyses, and only with the shapes of the distributions of differences — whether these distributions resulted from individual means or medians was not considered. We also used different one-sample tests for the different estimators of central tendency of the distributions of differences. This is necessary because the mean, the median and trimmed means have different standard errors. So, our assessments of the median are not comparable between simulations.

To be clear: as I understand, in your simulations, different RT distributions from 2 conditions are summarised using the mean and the median, one for each condition and each participant (stage 1); pairwise differences are computed, resulting in distributions of differences (stage 2); one-sample tests on the mean of the differences are performed.

In R&W2020’s g&h simulations, we ignored stage 1. We only considered the shapes of the distributions of differences, and then computed one-sample tests for 3 different estimators of central tendency. The skewness of these distributions depend on both within- and between-participant variability, but we did not model these sources of variability explicitly, only their end-product. Our simulations demonstrate that, given the same distribution of differences, in some situations the median and the 20% trimmed mean have dramatically more power than the mean.

It might well be that in some (many?) situations, RT distributions do not differ enough in skewness to affect the power of the one-sample t-test. But it remains a fundamental statistical result that one-sample t-tests are strongly affected by skewness, and even much more so by outliers. This is also covered in detail in Wilcox & Rousselet (2018).

To address this discrepancy in simulation results, I don’t think new simulations necessarily need to be added, but the problem should be presented more carefully.

Other apparent disagreements can be addressed by more careful phrasing of the situations under which the problems occur, especially at the start of the conclusion section, which I find somewhat misleading.

“R&W concluded that “there seems to be no rationale for preferring the mean over the median as a measure of central tendency for skewed distributions” (p. 37). On the contrary, when performing hypothesis tests to compare the central tendencies of RTs between experimental conditions, the present simulations show that there may be an extremely clear rationale in terms of both Type I error rate and power.”

As we explain in our paper, it is precisely because the mean is a poor measure of central tendency that in some situations it is better at detecting distribution differences (particularly when conditions differ in skewness, and more specifically in their right tails when dealing with RT distributions). But higher power or nominal type I error rate doesn’t make the mean a better measure of central tendency.

What is needed in this section is a clear distinction between 2 different but complementary goals:
[1] to detect differences between distributions;
[2] to understand how distributions differ.

As we argue in our paper, if the goal is [2], then it makes no sense to use the mean or the median; much better tools are available, starting by using the mean and the median, to get a richer perspective. The distinction is clearly made in footnote 1 of your article, but should be reiterated in the conclusion and the abstract, so there is no confusion.

“When comparing conditions with unequal numbers of trials, the sample-size-dependent bias of regular medians can lead to clear inflation of the Type I error rate (Fig. 2), so these medians definitely should not be used.”

This statement is only valid when considering low numbers of trials in the condition with the least trials. So to be clear, the problem emerges only for a combination of absolute and relative numbers of trials. The problem also occurs when group statistics involve means. In contrast, performing for instance a median one-sample test on group differences between medians does not lead to inflated false positives. I realise most users who collapse RT distributions are most likely to perform group statistics using the mean, but this assumption needs to be explicitly stated. The choice of estimator applies to the two levels of analysis: within-participants and between-participants.

For this reason, in the text, it would be worth describing what inferential tests were used for the analyses (I presume t-tests). This should bring a bit of nuance to this statement for instance, given that power depends on choices at 2 levels of analysis:

“The choice of central tendency measure would then be determined primarily by comparing the power of these three measures (i.e., means, medians, bias-corrected medians).”

In R&W2020 we also considered tests on medians, which solve the bias issue. We also looked at trimmed means, and other options are available such as the family of M-estimators.

“In view of the fact that means have demonstrably greater power than bias-corrected medians for experiments with unequal trial frequencies (Fig. 3), it is also sensible to compare power levels in experiments with equal trial frequencies.”

This statement is inaccurate because too general: as we demonstrate in Figure 15 of R&W2020, at least in some situations, the median can have higher power than the mean. Maybe more importantly, in the presence of skewness and outliers, methods using quantiles (with the median as a special case) or trimmed means can have dramatically more power than mean-based methods.

To go back to the distinction between goals [1] and [2], as we argue in R&W2020, ideally we should focus on richer descriptions of distributions using multiple quantiles to understand how they differ (a point you also make in other articles). Limiting analyses to the mean or the median is really unsatisfactory. Also, standard t-tests and ANOVAs do not account for measurement error, including item variability, which should really be handled using hierarchical models. Whether going the quantile way or the hierarchical modelling way (or both, the number of trials required to make sense of the data would make bias issues naturally go away. So personally I see using the mean RT as a valid recommendation, but only in very specific situations.

Your last point in the discussion about choosing a measure of central tendency before seeing the data is an important one. Perhaps more importantly, given the large number of options available to make sense of rich reaction time data, probably the prime recommendation should be that authors make their data publicly available, so that others can try alternative, and in most cases, better techniques.

Other points:

“There is unfortunately no consensus about the psychological meanings of changes in these different parameters (e.g., Rieger & Miller, 2019), but the ex-Gaussian distribution nevertheless remains useful as a way of describing changes in the shapes as well as the means of RT distributions.”

This is an excellent point. It would be worth also to cite Matzke & Wagenmakers (2009) in addition to Rieger & Miller (2019).

I would also mention that other distributions provide better interpretability in terms of shift and scale effects (shifted lognormal) — see Lindelov (2019).


“the very between-condition difference” — remove very?

Analog teaching activities about sampling and resampling

This year I’m teaching a new undergraduate course on the bootstrap for 4th year psychology students. Class examples, take-home exercises and the exam use R. I will also use a few analog activities in class. Here I’d like to share some of these activities. (This is also the opportunity to start a new category of posts on teaching.) The course is short, with only 5 sessions of 2 hours, but I think it is important to spend some of that time to try to get key concepts across by providing engaging activities. I’ll report back on how it went.

The 3 main activities involve dice, poker chips and wooden fish, to explore different types of sampling, sampling distributions, the distinction between sample and population, resampling…

Activity 1: dice (hierarchical sampling)

We use dice to simulate sampling with replacement from an infinite population of participants and trials.

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • hierarchical sampling;
  • running simulations;
  • estimation;
  • the distinction between finite and infinite populations.


  • 3 bags of dice
  • 3 trays (optional)

Each bag contains a selection of dice with 4 to 20 facets, forming 3 independent populations. I used a lot of dice in each of bag but that’s not necessary. It just makes it harder to guess the content of the bags. I got the dice from the TheDiceShopOnline.

Many exercises can be proposed, involving different sampling strategies, with the aim of making some sort of inference about the populations. Here is the setup we will use in class:

  • 3 participants or groups of participants are involved, each working independently with their own bag/population;
  • a dice is randomly picked from a bag (without looking inside the bag!) — this is similar to randomly sampling a participant from the population;
  • the dice is rolled 5 times, and the results written down — this is similar to randomly sampling trials from the participant;
  • perform the two previous steps 10 times, for a total of 10 participants x 5 trials = 50 trials.

These values are then entered into a text file and shared with the rest of the class. The text files are opened in R, and the main question is: is there evidence that our 3 samples of 10 participants x 5 trials were drawn from different populations? To simplify the problem, a first step could involve averaging over trials, so we are left with 10 values from each group. The second step is to produce some graphical representation of the data. Then we can try various inferential statistics.

The exercise can be repeated on different days, to see how much variability we get between simulated experiments. During the last class, the populations and the sampling distributions are revealed.

Also, in this exercise, because the dice are sampled with replacement, the population has an infinite size. The content of each bag defines the probability of sampling each type of dice, but it is not the entire population, unlike in the faux fish activity (see below).

Here is an example of samples after averaging the 5 trials per dice/participant:

Activity 2: poker chips (bootstrap sampling with replacement)

We use poker chips to demonstrate sampling with replacement, as done in the bootstrap.

This exercise provides an opportunity to learn about:

  • sampling with replacement;
  • bootstrap sampling;
  • running simulations.

A bag contains 8 poker chips, representing the outcome of an experiment. Each chip is like an observation.

First, we demonstrate sampling with replacement by getting a random chip from the bag, writing down its value, and replacing the chip in the bag. Second, we demonstrate bootstrap sampling by performing sampling with replacement 8 times, each time writing down the value of the random chip selected from the bag. This should help make bootstrap sampling intuitive.

After this analog exercise, we switch to R to demonstrate the implementation of sampling with replacement using the sample function.

Activity 3: faux fish (sampling distributions)

We sample with replacement from a finite population of faux fish to demonstrate the effect of sample size on the shape of sampling distributions.

The faux fish activity is mentioned in Steel, Liermann & Guttorp (2019), with pictures of class results. The activity is described in detail in Kelsey & Steel (2001).

Steel, Liermann & Guttorp (2019)

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • running simulations;
  • estimation;
  • sampling distributions.


  • two sets of 97 faux fish = fish-shaped bits of paper or other material
  • two containers = ponds
  • two large blank sheets of paper
  • x axis = ‘Mean weight (g)’
  • y axis = ‘Number of experiments’
  • titles = ‘n=3 replicates’ / ‘n=10 replicates’

I got faux fish made of wood from Wood Craft Shapes.

Each faux fish has a weight in grams written on it.

The frequencies of the weights is given in Kelsey & Steel (2001).

The fish population is stored in a box. I made 2 identical populations, so that two groups can work in parallel.

The first goal of the exercise is to produce sampling distributions by sampling with replacement from a population. The second goal is to evaluate the effect of the sample size on the shape of the sampling distribution. The third goal is to experiment with a digital version of the analog task, to gain familiarity with simulations.

Unlike the dice activity, this activity involves a finite size population: each box contains the full population under study.


  • two groups of participants;
  • each group is assigned a box;
  • participants from each group take turn sampling from the box n=3 or n=10 faux fish (depending on the group), without looking inside the box;
  • each participant averages the numbers, writes down the answer and marks it on the large sheet of paper assigned to each group;
  • this is repeated until a sufficient number of simulated experiments have been carried out to assess the shape of the resulting sampling distribution.

To speed up the exercise, a participant picks n fish, writes down the weights, puts all the fish back in the box, and passes the box to the next participant. While the next participant is sampling fish from the box, the previous participant computes the mean and marks the result on the group graph.

Once done, the class discusses the results:

  • the sampling distributions are compared;
  • the population mean is revealed;
  • the population is revealed by showing the handout from the book and opening the boxes.

Then we do the same in R, but much quicker!

Here is an example of simulated results for n=3 (the vertical line marks the population mean):


Kelsey, Kathryn, and Ashley Steel. The Truth about Science: A Curriculum for Developing Young Scientists. NSTA Press, 2001.

Steel, E. Ashley, Martin Liermann, and Peter Guttorp. Beyond Calculations: A Course in Statistical Thinking. The American Statistician 73, no. sup1 (29 March 2019): 392–401. https://doi.org/10.1080/00031305.2018.1505657.

The bootstrap-t technique

There are many types of bootstrap methods, but for most applications, two methods are most common: the percentile bootstrap, presented in an earlier post, and the bootstrap-t technique—also known as the percentile-t bootstrap or the studentized bootstrap (Efron & Tibshirani, 1994; Wilcox, 2017)​. For inferences on the population mean, the standard ​T-test and the percentile bootstrap can give unsatisfactory results when sampling from skewed distributions, especially when sample size is small. To illustrate the problem with the t-test, imagine that we sample from populations of increasing skewness.

Probability density functions for ​g&h​ distributions. Parameter ​g​ varies from 0 to 1. Parameter ​h=​0.

Here we use ​g&h​ distributions, in which parameter ​g​ controls the skewness, and parameter ​h​ controls the thickness of the tails—a normal distribution is obtained by setting ​g​=​h​=0 (Hoaglin, 1985; Yan & Genton, 2019)​. If we take many samples of size n=30 from these distributions, and for each sample we compute a ​T​ value, using the population mean as the null value, we obtain progressively more negatively skewed ​T​ value sampling distributions.

Sampling distributions of ​T​ values for different ​g​ values. Results are based on a simulation with 50,000 iterations and samples of size n=30.​

However, when we perform a ​T​-test, the ​T​ values are assumed to be symmetric, irrespective of sample size. This assumption leads to incorrect confidence intervals (CIs). The idea behind the bootstrap-t technique is to use the bootstrap (sampling with replacement) to compute a data-driven T​ distribution. In the presence of skewness, this ​T​ distribution could be skewed, as suggested by the data. Then, the appropriate quantile of the bootstrap ​T distribution is plugged into the standard CI equation to obtain a parametric bootstrap CI.

Bootstrap-t procedure

Let’s illustrate the procedure for a CI for the population mean. We start with a sample of 30 observations from a ​g&h​ distribution with ​g​=1 and ​h=​ 0.

Sample of size n=30 from a ​g&h​ distribution with ​g=1 and ​h​=0. The vertical line indicates the sample mean.

In a first step, we centre the distribution: for inferences on the mean, we subtract the mean from each observation in the sample, so that the mean of the centred distribution is now zero. This is a way to create a data-driven null distribution, in which there is no effect (the mean is zero), but the shape of the distribution and the absolute distance among observations are unaffected, as shown in the next figure. For inferences on a trimmed mean, we subtract the trimmed mean from each observation, so that the centred distribution now has a trimmed mean of zero.

Same distribution as in the previous figure, but the distribution has been mean centred, so that the sample mean is now zero.

In the next step, we sample with replacement from the centred distribution many times, and for each random sample we compute a ​T​ value. That way, we obtain a bootstrap distribution of ​T​ values expected by random sampling, under the hypothesis that the population has a mean (or trimmed mean) of zero, given the distribution of the data. Then, we use some quantile of the bootstrap ​T distribution in the standard CI equation. (Note that for trimmed means, the T-test equation is adjusted—see Tukey & McLaughlin, 1963).

5,000 bootstrap ​T​ values obtained by sampling with replacement from the mean centred data. In the asymmetric bootstrap-t technique, the quantiles (red vertical lines) of that distribution of ​T​ values are used to define the CI bounds. The insets contain the formulas for the lower (CI​lo)​ and upper bounds (CI​up)​ of the CI. Note that the lower ​T​ quantile is used to compute the upper bound (this not an error). In the symmetric bootstrap-t technique, one quantile of the distribution of absolute ​T​ values is used to define the CI bounds.​

Because the bootstrap distribution is potentially asymmetric, we have two choices of quantiles: for a 95% CI, either we use the 0.025 and the 0.975 quantiles of the signed ​T​ values to obtain a potentially asymmetric CI, also called an equal-tailed CI, or we use the 0.95 quantile of the absolute ​T​ values, thus leading to a symmetric CI.

In our example, for the mean the symmetric CI is [-0.4, 1.62] and the asymmetric CI is [0.08, 1.87]. If instead we use the 20% trimmed mean, the symmetric CI is [-0.36, 0.59] and the asymmetric CI is [-0.3, 0.67] (see Rousselet, Pernet & Wilcox, 2019). So clearly, confidence intervals can differ a lot depending on the estimator and method we use. In other words, a 20% trimmed mean is not a substitute for the mean, it asks a different question about the data.

Bootstrap samples

Why does the bootstrap-t approach work better than the standard ​T-test CI? Imagine we take multiple samples of size n=30 from a ​g&h​ distribution with ​g=​1 and ​h​=0.

Comparison of ​T​ distributions for ​g​=1 & h=0: the theoretical ​T distribution in red​ is the one used in the T-​test, the empirical ​T​ distribution in black was obtained by sampling with replacement multiple times from the g&h distribution. The red and black vertical lines indicate the ​T​ quantiles for a 95% CI. The grey lines show examples of 20 bootstrap sampling distributions, based on samples of size n=30 and 5000 bootstrap samples.

In the figure above, the standard ​T​-test assumes the sampling distribution in red, symmetric around zero. As we considered above, the sampling distribution is actually asymmetric, with negative skewness, as shown in black. However, the black empirical distribution is unobservable, unless we can perform thousands of experiments. So, with the bootstrap, we try to estimate this correct, yet unobservable, sampling distribution. The grey curves show examples of 20 simulated experiments: in each experiment, a sample of 30 observations is drawn, and then 5,000 bootstrap ​T​ values are computed. The resulting bootstrap sampling distributions are negatively skewed and are much closer to the empirical distribution in black than the theoretical symmetric distribution in red. Thus, it seems that using data-driven ​T​ distributions could help achieve better CIs than if we assumed symmetry.

How do these different methods perform? To find out we carry out simulations in which we draw samples from​ g&h distributions with the​ g​ parameter varying from 0 to 1, keeping ​h=0. For each sample, we compute a one-sample CI using the standard ​T-​ test, the two bootstrap-t methods just described (asymmetric and symmetric), and the percentile bootstrap. When estimating the population mean, for all four methods, coverage goes down with skewness.

Confidence interval coverage for the 4 methods applied to the mean. Results of a simulation with 20,000 iterations, sample sizes of n=30, and 599 bootstrap samples.​ You can see what happens for the 10% trimmed mean and the 20% trimmed mean in Rousselet, Pernet & Wilcox, 2019.

Among the parametric methods, the standard ​T-​test is the most affected by skewness, with coverage less than 90% for the most skewed condition. The asymmetric bootstrap-t CI seems to perform the best. The percentile bootstrap performs the worst in all situations, and has coverage systematically below 95%, including for normal distributions.

In addition to coverage, it is useful to consider the width of the CIs from the different techniques.

Confidence interval median width, based on the same simulation reported in the previous figure.

The width of a CI is its upper bound minus its lower bound. For each combination of parameters. the results are summarised by the median width across simulations. At low levels of asymmetry, for which the three parametric methods have roughly 95% coverage, the CIs also tend to be of similar widths. As asymmetry increases, all methods tend to produce larger CIs, but the ​T-test produces CIs that are too short, a problem that stems from the symmetric theoretical ​T​ distribution, which assumes T​ ​values too small. Compared to the parametric approaches, the percentile bootstrap produces the shortest CIs for all ​g​ values.

Confidence intervals: a closer look

We now have a closer look at the confidence intervals in the different situations considered above. We use a simulation with 20,000 iterations, sample size n=30, and 599 bootstrap samples.

Under normality

As we saw above, under normality the coverage is close to nominal (95%) for every method, although coverage for the percentile bootstrap is slightly too low, at 93.5%. Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value. About the same number of CIs were shifted to the left and to the right of the population value for all methods, and the CIs were of similar sizes:

We observed the same behaviour for several parametric methods in a previous post. Now, what happens when we sample from a skewed population?

In the presence of skewness (g=1, h=0)

Coverage is lower than the expected 95% for all methods. Coverage is about 88% for the standard and percentile bootstrap CIs, 92.3% for the asymmetric bootstrap-t CIs, and 91% for the symmetric bootstrap-t CIs. As we saw above, CIs are larger for the bootstrap-t CIs relative to the standard and percentile bootstrap CIs. CIs that did not include the population value tended to be shifted to the left of the population value, and more so for the standard CIs and the bootstrap-t symmetric CIs.

So when making inferences about the mean using the standard T-test, our CI coverage is lower than expected, and we are likely to underestimate the population value (the sample mean is median biased—Rousselet & Wilcox, 2019).

Relative to the other methods, the asymmetric bootstrap-t CIs are more evenly distributed on either side of the population and the right shifted CIs tends to be much larger and variable. The difference with the symmetric CIs is particularly striking and suggests that the asymmetric CIs could be misleading in certain situations. This intuition is confirmed by a simulation in which outliers are likely (h=0.2).

In the presence of skewness and outliers (g=1, h=0.2)

In the presence of outliers, the patterns observed in the previous figure are exacerbated. Some of the percentile bootstrap and asymmetric bootstrap-t intervals are ridiculously wide (x axis is truncated).

In such situation, inferences on trimmed means would greatly improve performance over the mean.


As we saw in a previous post, a good way to handle skewness and outliers is to make inferences about the population trimmed means. For instance, trimming 20% is efficient in many situations, even when using parametric methods that do not rely on the bootstrap. So what’s the point of the bootstrap-t? From the examples above, the bootstrap-t can perform much better than the standard Student’s approach and the percentile bootstrap when making inferences about the mean. So, in the presence of skewness and the population mean is of interest, the bootstrap-t is highly recommended. Whether to use the symmetric or asymmetric approach is not completely clear based on the literature (Wilcox, 2017). Intuition suggests that the asymmetric approach is preferable but our last example suggests that could be a bad idea when making inferences about the mean.

Symmetric or not, the bootstrap-t confidence intervals combined with the mean do not necessarily deal with skewness as well as other methods combined with trimmed means. But the bootstrap-t can be used to make inferences about trimmed means too! So which combination of approaches should we use? For instance, we could make inferences about the mean, the 10% trimmed mean or the 20% trimmed mean, in conjunction with a non-bootstrap parametric method, the percentile bootstrap or the bootstrap-t. We saw that for the mean, the bootstrap-t method is preferable in the presence of skewness. For inferences about trimmed means, the percentile bootstrap works well when trimming 20%. If we trim less, then the other methods should be considered, but a blanket recommendation cannot be provided. The choice of combination can also depend on the application. For instance, to correct for multiple comparisons in brain imaging analyses, cluster-based statistics are strongly recommended, in which case a bootstrap-t approach is very convenient. And the bootstrap-t is easily extended to factorial ANOVAs (Wilcox, 2017; Field & Wilcox, 2017).

What about the median? The bootstrap-t should not be used to make inferences about the median (50% trimming), because the standard error is not estimated correctly. Special parametric techniques have been developed for the median (Wilcox, 2017). The percentile bootstrap also works well for the median and other quantiles in some situations, providing sample sizes are sufficiently large (Rousselet, Pernet & Wilcox, 2019).


Efron, Bradley, and Robert Tibshirani. An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994.

Field, Andy P., and Rand R. Wilcox. ‘Robust Statistical Methods: A Primer for Clinical Psychology and Experimental Psychopathology Researchers’. Behaviour Research and Therapy 98 (November 2017): 19–38. https://doi.org/10.1016/j.brat.2017.05.013.

Hesterberg, Tim C. ‘What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum’. The American Statistician 69, no. 4 (2 October 2015): 371–86. https://doi.org/10.1080/00031305.2015.1089789.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘A Practical Introduction to the Bootstrap: A Versatile Method to Make Inferences by Using Data-Driven Simulations’. Preprint. PsyArXiv, 27 May 2019. https://doi.org/10.31234/osf.io/h8ft7.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Tukey, John W., and Donald H. McLaughlin. ‘Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1’. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002) 25, no. 3 (1963): 331–52.

Wilcox, Rand R. Introduction to Robust Estimation and Hypothesis Testing. 4th edition. Academic Press, 2017.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

Comparing two independent Pearson’s correlations: confidence interval coverage

This post looks at the coverage of confidence intervals for the difference between two independent correlation coefficients. Previously, we saw how the standard Fisher’s r-to-z transform can lead to inflated false positive rates when sampling from non-normal bivariate distributions and the population correlation differs from zero. In this post, we look at a complementary perspective: using simulations, we’re going to determine how often confidence intervals include the population difference. As we saw in our previous post, because we compute say 95% confidence intervals does not mean that over the long run, 95% of such confidence intervals will include the population we’re trying to estimate. In some situations, the coverage is much lower than expected, which means we might fool ourselves more often that we thought (although in practice in most discussions I’ve ever read, authors behave as if their 95% confidence intervals were very narrow 100% confidence intervals — but that’s another story).

We look at confidence interval coverage for the difference between Pearsons’ correlations using Zou’s method (2007) and a modified percentile bootstrap method (Wilcox, 2009). We do the same for the comparison of Spearmans’ correlations using the standard percentile bootstrap. We used simulations with 4,000 iterations. Sampling is from bivariate g & h distributions (see illustrations here).

We consider 4 cases:

  • g = h = 0, difference = 0.1, vary rho
  • g = 1, h = 0, difference = 0.1, vary rho
  • rho = 0.3, difference = 0.2, vary g, h = 0
  • rho = 0.3, difference = 0.2, vary g, h = 0.2

g = h = 0, difference = 0.1, vary rho

That’s the standard normal bivariate distribution. Group 1 has values of rho1 = 0 to 0.8, in steps of 0.1. Group 2 has values of rho2 = rho1 + 0.1.

For normal bivariate distributions, coverage is at the nominal level for all methods, sample sizes and population correlations. (Here I only considered sample sizes of 50+ because otherwise power is far too low, so there is no point.)

The width of the CIs (upper bound minus lower bound) decreases with rho and with sample size. That’s expected from the sampling distributions of correlation coefficients

When CIs do not include the population value, are they located to the left or the right of the population? In the figure below, negative values indicate a preponderance of left shifts, positive values a preponderance of right shifts. A value of 1 = 100% right shifts, -1 = 100% left shifts. For Pearson, CIs not including the population value tend to be located evenly to the left and right of the population. For Spearman, there is a preponderance of left shifted CIs for rho1 = 0.8. This left shift implies a tendency to over-estimate the difference (the difference group 1 minus group 2 is negative).

g = 1, h = 0, vary rho

What happens when we sample from a skewed distribution?

The coverage is lower than the expected 95% for Zou’s method and the discrepancy worsens with increasing rho1 and with increasing sample size. The percentile bootstrap does a much better job. Spearman’s combined with the percentile bootstrap is spot on.

For CIs that did not include the population value, the pattern of shifts varies as a function of rho. For Pearson, CIs are more likely to be located to the right of the population (under-estimation of the population value or wrong sign) for rho = 0, whereas for rho = 0.8, CIs are more likely to be located to the left. Spearman + bootstrap produces much more balanced results.

To investigate the asymmetry, we look at CIs for g=1, a sample size of n = 200 and the extremes of the distributions, rho1 = 0 and rho2 = 0.8. The figure below shows the preponderance of right shifted CIs for the two Pearson methods. The vertical line marks the population difference of -0.1.

For rho1 = 0.8, the pattern changes to a preponderance of left shifts for all methods, which means that the CIs tended to over-estimate the population difference. CIs for differences between Spearman’s correlations were quite smaller than Pearson’s ones though, thus showing less bias and less uncertainty.  

rho=0.3, diff=0.2, vary g, h = 0

For another perspective on the three methods, we now look at a case with:

  • group 1: rho1 = 0.3
  • group 2: rho2 = 0.5
  • we vary g from 0 to 1.

For Pearson + Zou, coverage progressively decreases with increasing g, and to a much more limited extent with increasing sample size. Pearson + bootstrap is much more resilient to changes in g. And Spearman + bootstrap just doesn’t care about asymmetry!

The better coverage of Pearson + bootstrap seems to be achieved by producing wider CIs.

Matters only get’s worse for Pearson + Zou when outliers are likely (see notebook on GitHub).


Based on this new comparison of the 3 methods, I’d argue again that Spearman + bootstrap should be preferred over the two Pearson methods. But if the goal is to assess linear relationships, then Pearson + bootstrap is preferable to Zou’s method. I’ll report on other methods in another post.


Comparison of correlation coefficients

Zou, Guang Yong. Toward Using Confidence Intervals to Compare Correlations. Psychological Methods 12, no. 4 (2007): 399–413. https://doi.org/10.1037/1082-989X.12.4.399.

Wilcox, Rand R. Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics – Simulation and Computation 38, no. 10 (1 November 2009): 2220–34. https://doi.org/10.1080/03610910903289151.

Baguley, Thom. Comparing correlations: independent and dependent (overlapping or non-overlapping) https://seriousstats.wordpress.com/2012/02/05/comparing-correlations/

Diedenhofen, Birk, and Jochen Musch. Cocor: A Comprehensive Solution for the Statistical Comparison of Correlations. PLoS ONE 10, no. 4 (2 April 2015). https://doi.org/10.1371/journal.pone.0121945.

g & h distributions

Hoaglin, David C. Summarizing Shape Numerically: The g-and-h Distributions. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Yan, Yuan, and Marc G. Genton. The Tukey G-and-h Distribution. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

When is a 95% confidence interval not a 95% confidence interval?

In previous posts, we saw how skewness and outliers can affect false positives (type I errors) and true positives (power) in one-sample tests. In particular, when making inferences about the population mean, skewness tends to inflate false positives, and skewness and outliers can destroy power. Here we investigate a complementary perspective, looking at how confidence intervals are affected by skewness and outliers.

Spoiler alert: 95% confidence intervals most likely do not have a coverage of 95%. In fact, I’ll show you an example in which a 95% CI for the mean has an 80% coverage…

The R code for this post is on GitHub.

Back to the title of the post. Seems like a weird question? Not if we consider the definition of a confidence interval (CI). Let say we conduct an experiment to estimate quantity x from a sample, where x could be the median or the mean for instance. Then a 95% CI for the population value of x refers to a procedure whose behaviour is defined in the long-run: CIs computed in the same way should contain the population value in 95% of exact replications of the experiment. For a single experiment, the particular CI does or does not contain the population value, there is no probability associated with it. A CI can also be described as the interval compatible with the data given our model — see definitions and common misinterpretations in Greenland et al. (2016).

So 95% refers to the (long-term) coverage of the CI; the exact values of the CI bounds vary across experiments. The CI procedure is associated with a certain coverage probability, in the long-run, given the model. Here the model refers to how we collected data, data cleaning procedures (e.g. outlier removal), assumptions about data distribution, and the methods used to compute the CI. Coverage can differ from the expected one if model assumptions are violated or the model is just plain wrong.

Wrong models are extremely common, for instance when applying a standard t-test CI to percent correct data (Kruschke, 2014; Jaeger, 2008) or Likert scale data (Bürkner & Vuorre, 2019; Liddell & Kruschke, 2019). 

For continuous data, CI coverage is not at the expected, nominal level, for instance when the model expects symmetric distributions and we’re actually sampling from skewed populations (which is the norm, not the exception, when we measure sizes, durations, latencies etc.). Here we explore this issue using g & h distributions that let us manipulate asymmetry.

Illustrate g & h distributions

All g & h distributions have a median of zero. The parameter g controls the asymmetry of the distribution, while the parameter h controls the thickness of the tails (Hoaglin, 1985; Yan & Genton, 2019). Let’s look at some illustrations to make things clear.

Examples in which we vary g from 0 to 1.

As g increases, the asymmetry of the distributions increases. Using negative g values would produce distributions with negative skewness.

Examples in which we vary h from 0 to 0.2.

As h increases, the tails are getting thicker, which means that outliers are more likely. 

Test with normal (g=h=0) distribution

Let’s run simulations to look at coverage probability in different situations and for different estimators. First, we sample with replacement from a normal population (g=h=0) 20,000 times (that’s 20,000 simulated experiments). Each sample has size n=30. Confidence intervals are computed for the mean, the 10% trimmed mean ™, the 20% trimmed mean and the median using standard parametric methods (see details in the code on GitHub, and references for equations in Wilcox & Rousselet, 2018). The trimmed mean and the median are robust measures of central tendency. To compute a 10% trimmed mean, observations are sorted, the 10% lowest and 10% largest values are discarded (20% in total), and the remaining values are averaged. In this context, the mean is a 0% trimmed mean and the median is a 50% trimmed mean. Trimming the data attenuates the influence of the tails of the distributions and thus the effects of asymmetry and outliers on confidence intervals.

First we look at coverage for the 4 estimators: we look at the proportion of simulated experiments in which the CIs included the population value for each estimator. As expected for the special case of a normal distribution, the coverage is close to nominal (95%) for every method:

Mean 10% tm 20% tm Median
0.949 0.948 0.943 0.947

In addition to coverage, we also look at the width of the CIs (upper bound minus lower bound). Across simulations, we summarise the results using the median width. CIs tends to be larger for trimmed means and median relative to the mean, which implies lower power under normality for these methods (Wilcox & Rousselet, 2018). 

Mean 10% tm 20% tm Median
0.737 0.761 0.793 0.889

For CIs that did not include the population, the distribution is fairly balanced between the left and the right of the population. To see this, I computed a shift index: if the CI was located to the left of the population value, it receives a score of -1, when it was located to the right, it receives a score of 1. The shift index was then computed by averaging the scores only for those CI excluding the population.

Mean 10% tm 20% tm Median
0.046 0.043 0.009 0.013

Illustrate CIs that did not include the population

Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value for each estimator. About the same number of CIs were shifted to the left and to the right of the population value, which is illustrated in the next figure. In each panel, the vertical line marks the population value (here it’s zero in all conditions because the population is symmetric). The CIs are plotted in the order of occurrence in the simulation. So the figure shows that if we miss the population value, we’re as likely to overshoot than undershoot our estimation.

Across panels, the figure also shows that the more we trim (10%, 20%, median) the larger the CIs get. So for a strictly normal population, we more precisely estimate the mean than trimmed means and the median.

Test with g=1 & h=0 distribution

What happens for a skewed population? Three things happen for the mean:

  • coverage goes down
  • width increases
  • CIs not including the population value tend to be shifted to the left (negative average shift values)

The same effects are observed for the trimmed means, but less so the more we trim, because trimming alleviates the effects of the tails.

Measure Mean 10% tm 20% tm Median
Coverage 0.880 0.936 0.935 0.947
Width 1.253 0.956 0.879 0.918
Shift -0.962 -0.708 -0.661 0.017
# left 2350 1101 1084 521
# right 45 188 221 539

Illustrate CIs that did not include the population

The figure illustrates the strong imbalance between left and right CI shifts. If we try to estimate the mean of a skewed population, our CIs are likely to miss it more than 5% of the time, and when that happens, the CIs are most likely to be shifted towards the bulky part of the distribution (here the left for a right skewed distribution). Also, the right shifted CIs vary a lot in width and can be very large.

As we trim, the imbalance is progressively resolved. With 20% trimming, when CIs do not contain the population value, the distribution of left and right shifts is more balanced, although with still far more left shifts. With the median we have roughly 50% left / 50% right shifts and CIs are narrower than for the mean.

Test with g=1 & h=0.2 distribution

What happens if we sample from a skewed distribution (g=1) in which outliers are likely (h=0.2)?

Measure Mean 10% tm 20% tm Median
Coverage 0.801 0.934 0.936 0.947
Width 1.729 1.080 0.934 0.944
Shift -0.995 -0.797 -0.709 0.018
# left 3967 1194 1086 521
# right 9 135 185 540

The results are similar to those observed for h=0, only exacerbated. Coverage for the mean is even lower, CIs are larger, and the shift imbalance even more severe. I have no idea how often such a situation occur, but I suspect if you study clinical populations that might be rather common. Anyway, the point is that it is a very bad idea to assume the distributions we study are normal, apply standard tools, and hope for the best. Reporting CIs as 95% or some other value, without checking, can be very misleading.

Simulations in which we vary g

We now explore CI properties as a function of g, which we vary from 0 to 1, in steps of 0.1. The parameter h is set to 0 (left column of next figure) or 0.2 (right column). Let’s look at column A first (h=0). For the median, coverage is unaffected by g. For the other estimators, there is a monotonic decrease in coverage with increasing g. The effect is much stronger for the mean than the trimmed means.

For all estimators, increasing g leads to monotonic increases in CI width. The effect is very subtle for the median and more pronounced the less we trim. Under normality, g=0, CIs are the shortest for the mean, explaining the larger power of mean based methods relative to trimmed means in this unusual situation.

In the third panel, the zero line represents an equal proportion of left and right shifts, relative to the population, for CIs that did not include the population value. The values are consistently above zero for the median, with a few more right shifts than left shifts for all values of g. For the other estimators, the preponderance of left shifts increases markedly with g.

Now we look at results in panel B (h=0.2). When outliers are likely, coverage drops faster with g for the mean. Other estimators are resistant to outliers.

When outliers are common, CIs for the population mean are larger than for all other estimators, irrespective of g.

Again, there is a constant over-representation of right shifted CIS for the median. For the other estimators, the left shifted CIs dominate more and more with increasing g. The trend is more pronounced for the mean relative to the h=0 situation, with a sharper monotonic downward trajectory.


The answer to the question in the title is: most of the time! Simply because our models are wrong most of the time. So I would take all published confidence intervals with a pinch of salt. [Some would actually go further and say that if the sampling and analysis plans for an experiment were not clearly stipulated before running the experiment, then confidence interval, like P values, are not even defined (Wagenmakers, 2007). That is, we can compute a CI, but the coverage is meaningless, because exact repeated sampling might be impossible or contingent on external factors that would need to be simulated.] The best way forward is probably not to advocate for the use of trimmed means or the median over the mean in all cases, because different estimators address different questions about the data. And there are more estimators of central tendency than means, trimmed means and medians. There are also more interesting questions to ask about the data than their central tendencies (Rousselet, Pernet & Wilcox, 2017). For these reasons, we need data sharing to be the default, so that other users can ask different questions using different tools. The idea that the one approach used in a paper is the best to address the problem at hand is just silly.

To see what happens when we use the percentile bootstrap or the bootstrap-t to build confidence intervals for the mean, see this more recent post.


Bürkner, Paul-Christian, and Matti Vuorre. ‘Ordinal Regression Models in Psychology: A Tutorial’. Advances in Methods and Practices in Psychological Science 2, no. 1 (1 March 2019): 77–101. https://doi.org/10.1177/2515245918823199.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. ‘Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations’. European Journal of Epidemiology 31, no. 4 (1 April 2016): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Jaeger, T. Florian. ‘Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models’. Journal of Memory and Language 59, no. 4 (November 2008): 434–46. https://doi.org/10.1016/j.jml.2007.11.007.

Kruschke, John K. Doing Bayesian Data Analysis. 2nd Edition. Academic Press, 2014.

Liddell, Torrin M., and John K. Kruschke. ‘Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?’ Journal of Experimental Social Psychology 79 (1 November 2018): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘Beyond Differences in Means: Robust Graphical Methods to Compare Two Groups in Neuroscience’. European Journal of Neuroscience 46, no. 2 (1 July 2017): 1738–48. https://doi.org/10.1111/ejn.13610.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review 14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.

“Amends might be made in the interest of the new generation of students by printing in leaded type in future editions of existing text-books and in all new text-books:

Normality is a myth; there never was, and never will be, a normal distribution.

This is an over-statement from the practical point of view, but it represents a safer initial mental attitude than any in fashion during the past two decades.”

Geary, Biometrika, 1947


Comparing two independent Pearson’s correlations

This post demonstrates two important facts: 

(1) the Fisher’s z method to compare two independent correlations can give very inaccurate results when sampling from distributions that are skewed (asymmetric) or heavy tailed (high probability of outliers) and the population correlation rho differs from zero;

(2) even when sampling from normal distributions, Fisher’s z as well as bootstrap methods require very large sample sizes to detect differences, much larger than commonly used in neuroscience & psychology.

We start with an example and then consider the results of a few simulations. The R code is available on GitHub. It is part of a larger project and I’ll report in other blog posts on the rest of the simulations I’m currently running.

UPDATE: the bootcorci R package implements bootstrap methods to compare correlation coefficients.


Let’s look at an example of correlation analysis reported in Davis et al. (2008). This article had 898 citations on June 11th 2019 according to Google scholar. In their figure 3, the authors reported 4 correlations: 2 correlations in 2 independent groups of participants. Across panels A and B, the frontal activity variable appears twice, so for each group of participants (old and young), the correlations in A and B are dependent and overlapping. Code on how to handle this case was covered in a previous post. Here we’re going to focus on the comparison of two independent correlations, which is like considering A or B and its own.


There are several problems with the analysis from Davis et al.:

  • within-participant variability was removed by averaging over trials and other variables, so equal weight is wrongly attributed to every observation;

  • the analyses are split into different correlations instead of attempting mixed-effect modelling with the groups of participants considered together;

  • association is assessed using Pearson’s correlation, which is not robust and suffers from many issues (it is sensitive to outliers, the magnitude of the slope around which points are clustered, curvature, the magnitude of the residuals, restriction of range, and heteroscedasticity — see details in Wilcox 2017);

  • sample sizes are very small (far too small to say anything meaningful actually, as described in this previous post);

  • if a correction for multiple comparisons is applied, nothing passes the magic 0.05 threshold;

  • the authors commit a classic interaction fallacy by reporting one P<0.05 correlation, the other not, without directly comparing them (Gelman & Stern 2006; Nieuwenhuis et al. 2011).

There is nothing special about the correlation analysis in Davis et al. (2008): in neuroscience and psychology these problems are very common. In the rest of this post we’re going to tackle only two of them: how to compare 2 independent Pearson’s correlations, and what sample sizes are required to tease them apart in the long-run.


The most common approach to compare 2 independent correlations is to use the Fisher’s r-to-z approach. Here is a snippet of R code for Fisher’s z, given r1 and r2 the correlations in group 1 and group 2, and n1 and n2 the corresponding sample sizes:

z1 <- atanh(r1)

z2 <- atanh(r2)

zobs <- (z1-z2) / sqrt( 1 / (n1-3) + 1 / (n2-3) )

pval <- 2 * pnorm(-abs(zobs))

This approach is implemented in the cocor R package for instance (Diedenhofen & Musch, 2015). The problem with this approach is its normality assumption: it works well when sampling from a bivariate normal distribution or when the two distributions have a population correlation rho of zero, but it misbehaves when sampling from asymmetric distributions, or from distributions with heavy tails, when rho differs from zero (e.g. Duncan & Layard, 1973; Berry & Mielke, 2000). And all methods based on some variant of the z transform suffer the same issues.


Simulations can be used to understand how a statistical procedure works in the long-run, over many identical experiments. Let’s imagine that we sample from bivariate g & h distributions, in which parameter g controls the asymmetry of the distributions and h controls the thickness of the tails  (Hoaglin, 1985). You can see univariate g & h distributions illustrated here. Below you can see bivariate samples with n = 5,000, for combinations of g and h parameters, for a population correlation rho of 0:

and for a population correlation rho of 0.5:

Using g=h=0 gives a normal bivariate distribution.

When h=0.2, the tails are thicker, which means that outliers are more likely.

There is a large parameter space to cover, so here we only look at a subset of results, using simulations with 4,000 iterations. We’ll see examples in which we vary rho, the population correlation, or we vary g for a given rho.

Vary rho

First, we consider false positives, before turning to true positives (power). 

False positives

For false positives, 2 independent random samples are taken from the same bivariate population, so on average the two correlations should not differ. Here are the results for different rhos and sample sizes for g=h=0, using Fisher’s z test and an alpha of 0.05:


So all is fine when sampling from normal distributions: the type I error rate, or proportion of false positives, is at the nominal 0.05 level (marked by a horizontal black line). The values are not exactly 0.05 because of the limited number of simulations, but they’re close enough.

What happens if we sample from slightly asymmetric populations instead (g=0.2)? Relative to the g=0 condition, the number of false positives increases a bit over 0.05 on average, and more so with increasing rho.


Things get worse if we sample from symmetric distributions in which outliers are likely (g=0, h=0.2):


If rho = 0 then the proportion of false positives is still around 0.05, bit it increases rapidly with rho, an effect exacerbated by sample size (a pattern already reported in Duncan & Layard, 1973). That’s a very bad feature, because, as we will see below, very large sample sizes are required to detect differences between correlation coefficients. Unless of course our experimental samples are from normal distributions, but that’s unlikely and usually unknown, so why make such a gamble? (A cursory look at the literature suggests that skewed distributions and outliers are frequent in correlation analyses —Rousselet & Pernet, 2012).

If we sample from asymmetric distributions in which outliers are likely (g=h=0.2), it just gets slightly worse than in the previous example:


What’s the alternative? According to Wilcox (2009), a percentile bootstrap can be used to compare Pearson’s correlations, leading to satisfactory proportions of false positives. If instead of Fisher’s z we use a percentile bootstrap to compare Pearson’s correlations when g=h=0.2, we get the following results:


Still above the expected 0.05 but much better than Fisher’s z!

To compare the two correlation coefficients using the percentile bootstrap, we proceed like this:

  • sample participants with replacement, independently in each group;

  • compute the two correlation coefficients based on the bootstrap samples;

  • save the difference between correlations;

  • execute the previous steps many times;

  • use the distribution of bootstrap differences to derive a confidence interval. 

Usually a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution, but for Pearson’s correlation different quantiles are used depending on the sample size. This adjusted procedure is implemented in the R function twopcor() from Rand Wilcox. To compare two robust correlation coefficients, the adjustment is not necessary, a procedure implemented in twocor(). If for instance we use Spearman’s correlation when g=h=0.2, we get these results:


The proportion of false positives is at the nominal level and doesn’t change with rho! I’ll report simulations using the percentage bend correlation and the winsorised correlation, two other robust estimators, in another post.

True positives

To assess true positives, we first consider differences of 0.1: that is, if rho1 is 0, rho2 is 0.1, and so on for different rhos. Here are the results for g=h=0 using Fisher’s z:


For a difference of 0.1, power is overall very low, and it depends strongly on rho. The dependence of power on rho follows logically from the reduced sampling variability with increasing rho, which is demonstrated in a previous post. That post also illustrates the asymmetry of the sampling distributions when rho is different from zero. More generally, bounded distributions tend to be asymmetric, which implies that normality assumptions are routinely violated in psychology.

Changing g or h to 0.2 lowers power. Here is what happens with h=0.2 for instance:


But it’s difficult to make comparisons across figures. I would need to make plots of power differences, which is getting tedious for a blog post. So instead we can directly investigate the effect of g and h for fixed rho values. Here we only consider g.

Vary g

False positives

First, we consider the case where rho1 = rho2 = 0.3, we vary g and keep h=0. Again, asymmetry has a strong effect on false positives from Fisher’s z test.


Using the bootstrap gives better results, but still with inflated false positives.


Spearman + bootstrap is better behaved:


True positives

What about power? Let’s consider rho1 = 0.3 and rho2 = 0.5 for Fisher’s z. Increasing asymmetry reduces power.


The percentile bootstrap gives even worse results:


Spearman + bootstrap is not affected by asymmetry at all it seems:


But again, notice the large number of trials required to detect an effect. Here we have a population difference of 0.2, and even assuming normality, 80% power requires well over 250 observations! And that’s if you’re ok to be wrong 20% of the time when there is an effect — seems quite a gamble. For 90% power you will need over 350 observations! But again that’s assuming normality… I haven’t looked at the effect of h on power in this scenario yet.

Finally, let’s consider a simulation in which the rhos are set to the values reported in Figure 3A of Davis et al. (2008): rho1 = 0, rho2 = 0.6. That’s a massive and unlikely difference, but let’s look at how many observations we need even in this improbable case.


To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 36 observations. For 90% power, the minimum sample size is 47 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 61 observations; to achieve at least 90% power the minimum sample size is 91 observations.

Contrast these results with the results obtained using a combination of Spearman + bootstrap:


To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 43 observations. For 90% power, the minimum sample size is 56 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 42 observations; to achieve at least 90% power the minimum sample size is 55 observations.

So under normality, Pearson + Fisher is more powerful than Spearman + bootstrap, but power can drop a lot for asymmetric distributions. Spearman is very resilient to asymmetry.

Pearson + bootstrap doesn’t fare well:


A good reminder that the bootstrap is not a magic recipe that can handle all situations well (Rousselet, Pernet & Wilcox, 2019).

UPDATE: 2019-07-23

For the normal case (g=h=0), we look at the proportion of true positives (power) for the difference between Pearsons’ correlations using Fisher’s r-to-z transform. We vary systematically the sampling size n, rho1 and the difference between rho1 and rho2. The title of each facet indicates rho1. The difference between rho1 and rho2 is colour coded. For instance, for rho1=0, a 0.1 difference means that rho2=0.1. For rho1=0.8, only a difference of 0.1 is considered, meaning that rho2=0.9. So, unless we deal with large differences, we need very large numbers of trials to detect differences between independent correlation coefficients.


Given the many problems associated with Pearson’s correlation, it’s probably wise to choose some other, robust alternative (Pernet et al. 2012). Whatever methods you choose, it seems that very large sample sizes are required to detect effects, certainly much larger than I expected before I ran the simulations. I’ll report on the behaviour of other methods in another post. Meanwhile, consider published correlation analyses from small samples with a grain of salt!

Given the present results, I would recommend to use Spearman + bootstrap to compare correlation coefficients. You could also consider a skipped Spearman, which is robust to multivariate outliers, but with long computation times relative to the other techniques mentioned above (Pernet et al. 2012).


Berry, K.J. & Mielke, P.W. (2000) A Monte Carlo investigation of the Fisher Z transformation for normal and nonnormal distributions. Psychol Rep, 87, 1101-1114.

Davis, S.W., Dennis, N.A., Daselaar, S.M., Fleck, M.S. & Cabeza, R. (2008) Que PASA? The posterior-anterior shift in aging. Cereb Cortex, 18, 1201-1209.

Diedenhofen, B. & Musch, J. (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One, 10, e0121945.

Duncan, G.T. & Layard, M.W.J. (1973) Monte-Carlo Study of Asymptotically Robust Tests for Correlation-Coefficients. Biometrika, 60, 551-558.

Gelman, A. & Stern, H. (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat, 60, 328-331.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2019) A practical introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations (Preprint). PsyArXiv. https://doi.org/10.31234/osf.io/h8ft7

Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Commun Stat-Simul C, 38, 2220-2234.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

“Forget about getting definitive results from a single experiment; instead embrace variation, accept uncertainty, and learn what you can.”

Andrew Gelman 2018

Gelman, A. (2018) The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It. Pers Soc Psychol B, 44, 16-23.

[article] [preprint]

R functions for the hierarchical shift function

The hierarchical shift function presented in the previous post is now available in the `rogme` R package. Here is a short demo.

Get the latest version of `rogme`:

# install.packages("devtools")

Load data and compute hierarchical shift function:

df <- flp # get reaction time data - for details `help(flp)`
# Compute shift functions for all participants
out <- hsf(df, rt ~ condition + participant)


Because of the large number of participants, the confidence intervals are too narrow to be visible. So let’s subset a random sample of participants to see what can happen with a more smaller sample size:

set.seed(22) # subset random sample of participants
id <- unique(df$participant) 
df <- subset(df, flp$participant %in% sample(id, 50, replace = FALSE))
out <- hsf(df, rt ~ condition + participant) 


Want to estimate the quartiles only?

out <- hsf(df, rt ~ condition + participant, qseq = c(.25, .5, .75))


Want to reverse the comparison?

out <- hsf(df, rt ~ condition + participant, todo = c(2,1))


P values are here:


P values adjusted for multiple comparisons using Hochberg’s method:


Percentile bootstrap version:

out <- hsf_pb(df, rt ~ condition + participant)

Plot bootstrap highest density intervals – default:



Plot distributions of bootstrap samples of group differences. Bootstrap distributions are shown in orange. Black dot marks the mode. Vertical black lines mark the 50% and 90% highest density intervals.




For more examples, a vignette is available on GitHub.

Feedback would be much appreciated: don’t hesitate to leave a comment or to get in touch directly.