Tag Archives: false positive

Mean or median reaction time? An open review of Miller (2020)

Below is a review of Miller (2020), which is a reply to an article in which Rand Wilcox & I reproduced and built upon Miller (1988). Here are the articles in order:

[1] A Warning About Median Reaction Time

[2] Reaction times and other skewed distributions: problems with the mean and the median

[3] Another Warning about Median Reaction Time

I’ve added a link to [3] and this review to article [2], so readers are aware of the discussions.

The review will only make sense if you at least read [3] first, but [2] contains a lot of simulations, descriptions and references not covered in [3].


To start, I’ve got to say I’ve learnt so much about various sources of bias from your work on reaction time analyses, including Miller (1988) and many subsequent papers. I discovered Miller (1988) by chance a few years ago, while researching a review article on robust measures of effect sizes. Actually, I was so startled by the 1988 results that I dropped the article on effect sizes to replicate Miller’s simulations and explore their consequences. This extensive work has taught me a lot about reaction time data, the mean, the median, their sampling distributions and associated inferential statistics. So I’m really thankful for that.

Overall, I enjoyed your reply to our paper, which provides interesting new simulations and a good summary of the issues. The main apparent discrepancies are much smaller than they seem and can easily be addressed by wording key statements and conclusions more carefully, for instance by highlighting boundary conditions. If anything I wrote below is unclear, feel free to contact me directly. In particular, if needed, we could discuss the g&h simulation code, which I realise I probably could have explained better in the R&W2020 article.

Your article is well written. The illustrations are fine but could be improved by adding colours or grey/black contrasts. In Figure 4 (presenting the most original and interesting results), the symbols could be different for the 3 estimators to improve clarity.

The simulation results are convincing and mostly concur with our own assessment. However, what is missing is a consideration of the effects of outliers and the skewness of the distribution of differences (after pooling across trials using the mean, the median or some other estimator). As we explained in R&W2020, outliers and skewness can essentially destroy the power of inferential tests using the mean, whereas tests on the median are hardly affected.

“Another unusual feature of R&W’s simulations is that they used g&h distributions (Hoaglin, 1985) as models of RT. These distributions are quite different from ex-Gaussians and are not normally considered in RT modelling (e.g., Luce, 1986). This distributional choice may also have contributed to the advantage for medians in their simulations.”

I don’t understand how our simulations using g&h distributions could be the source of the discrepancy, because we didn’t use them to simulate distributions of reaction times. In fact, we used them to systematically manipulate the distribution of differences that is fed to a one-sample test, from normal to very skewed. We also varied the probability that outliers are observed. The shape of the marginal RT distributions should be irrelevant to these simulations: when the RT distributions are identical in every condition and participant, irrespective of their skewness, the distribution of differences is normal (that is, for each participant, compute the mean for each condition then subtract the means, leading to a distribution of pairwise differences). When the distributions differ in skewness, the distribution of differences has skewness equal to the difference in skewness between the original distributions. Thus, our simulations are only concerned with the group level analyses, and only with the shapes of the distributions of differences — whether these distributions resulted from individual means or medians was not considered. We also used different one-sample tests for the different estimators of central tendency of the distributions of differences. This is necessary because the mean, the median and trimmed means have different standard errors. So, our assessments of the median are not comparable between simulations.

To be clear: as I understand, in your simulations, different RT distributions from 2 conditions are summarised using the mean and the median, one for each condition and each participant (stage 1); pairwise differences are computed, resulting in distributions of differences (stage 2); one-sample tests on the mean of the differences are performed.

In R&W2020’s g&h simulations, we ignored stage 1. We only considered the shapes of the distributions of differences, and then computed one-sample tests for 3 different estimators of central tendency. The skewness of these distributions depend on both within- and between-participant variability, but we did not model these sources of variability explicitly, only their end-product. Our simulations demonstrate that, given the same distribution of differences, in some situations the median and the 20% trimmed mean have dramatically more power than the mean.

It might well be that in some (many?) situations, RT distributions do not differ enough in skewness to affect the power of the one-sample t-test. But it remains a fundamental statistical result that one-sample t-tests are strongly affected by skewness, and even much more so by outliers. This is also covered in detail in Wilcox & Rousselet (2018).

To address this discrepancy in simulation results, I don’t think new simulations necessarily need to be added, but the problem should be presented more carefully.

Other apparent disagreements can be addressed by more careful phrasing of the situations under which the problems occur, especially at the start of the conclusion section, which I find somewhat misleading.

“R&W concluded that “there seems to be no rationale for preferring the mean over the median as a measure of central tendency for skewed distributions” (p. 37). On the contrary, when performing hypothesis tests to compare the central tendencies of RTs between experimental conditions, the present simulations show that there may be an extremely clear rationale in terms of both Type I error rate and power.”

As we explain in our paper, it is precisely because the mean is a poor measure of central tendency that in some situations it is better at detecting distribution differences (particularly when conditions differ in skewness, and more specifically in their right tails when dealing with RT distributions). But higher power or nominal type I error rate doesn’t make the mean a better measure of central tendency.

What is needed in this section is a clear distinction between 2 different but complementary goals:
[1] to detect differences between distributions;
[2] to understand how distributions differ.

As we argue in our paper, if the goal is [2], then it makes no sense to use the mean or the median; much better tools are available, starting by using the mean and the median, to get a richer perspective. The distinction is clearly made in footnote 1 of your article, but should be reiterated in the conclusion and the abstract, so there is no confusion.

“When comparing conditions with unequal numbers of trials, the sample-size-dependent bias of regular medians can lead to clear inflation of the Type I error rate (Fig. 2), so these medians definitely should not be used.”

This statement is only valid when considering low numbers of trials in the condition with the least trials. So to be clear, the problem emerges only for a combination of absolute and relative numbers of trials. The problem also occurs when group statistics involve means. In contrast, performing for instance a median one-sample test on group differences between medians does not lead to inflated false positives. I realise most users who collapse RT distributions are most likely to perform group statistics using the mean, but this assumption needs to be explicitly stated. The choice of estimator applies to the two levels of analysis: within-participants and between-participants.

For this reason, in the text, it would be worth describing what inferential tests were used for the analyses (I presume t-tests). This should bring a bit of nuance to this statement for instance, given that power depends on choices at 2 levels of analysis:

“The choice of central tendency measure would then be determined primarily by comparing the power of these three measures (i.e., means, medians, bias-corrected medians).”

In R&W2020 we also considered tests on medians, which solve the bias issue. We also looked at trimmed means, and other options are available such as the family of M-estimators.

“In view of the fact that means have demonstrably greater power than bias-corrected medians for experiments with unequal trial frequencies (Fig. 3), it is also sensible to compare power levels in experiments with equal trial frequencies.”

This statement is inaccurate because too general: as we demonstrate in Figure 15 of R&W2020, at least in some situations, the median can have higher power than the mean. Maybe more importantly, in the presence of skewness and outliers, methods using quantiles (with the median as a special case) or trimmed means can have dramatically more power than mean-based methods.

To go back to the distinction between goals [1] and [2], as we argue in R&W2020, ideally we should focus on richer descriptions of distributions using multiple quantiles to understand how they differ (a point you also make in other articles). Limiting analyses to the mean or the median is really unsatisfactory. Also, standard t-tests and ANOVAs do not account for measurement error, including item variability, which should really be handled using hierarchical models. Whether going the quantile way or the hierarchical modelling way (or both, the number of trials required to make sense of the data would make bias issues naturally go away. So personally I see using the mean RT as a valid recommendation, but only in very specific situations.

Your last point in the discussion about choosing a measure of central tendency before seeing the data is an important one. Perhaps more importantly, given the large number of options available to make sense of rich reaction time data, probably the prime recommendation should be that authors make their data publicly available, so that others can try alternative, and in most cases, better techniques.

Other points:

“There is unfortunately no consensus about the psychological meanings of changes in these different parameters (e.g., Rieger & Miller, 2019), but the ex-Gaussian distribution nevertheless remains useful as a way of describing changes in the shapes as well as the means of RT distributions.”

This is an excellent point. It would be worth also to cite Matzke & Wagenmakers (2009) in addition to Rieger & Miller (2019).

I would also mention that other distributions provide better interpretability in terms of shift and scale effects (shifted lognormal) — see Lindelov (2019).


“the very between-condition difference” — remove very?

Comparing two independent Pearson’s correlations

This post demonstrates two important facts: 

(1) the Fisher’s z method to compare two independent correlations can give very inaccurate results when sampling from distributions that are skewed (asymmetric) or heavy tailed (high probability of outliers) and the population correlation rho differs from zero;

(2) even when sampling from normal distributions, Fisher’s z as well as bootstrap methods require very large sample sizes to detect differences, much larger than commonly used in neuroscience & psychology.

We start with an example and then consider the results of a few simulations. The R code is available on GitHub. It is part of a larger project and I’ll report in other blog posts on the rest of the simulations I’m currently running.


Let’s look at an example of correlation analysis reported in Davis et al. (2008). This article had 898 citations on June 11th 2019 according to Google scholar. In their figure 3, the authors reported 4 correlations: 2 correlations in 2 independent groups of participants. Across panels A and B, the frontal activity variable appears twice, so for each group of participants (old and young), the correlations in A and B are dependent and overlapping. Code on how to handle this case was covered in a previous post. Here we’re going to focus on the comparison of two independent correlations, which is like considering A or B and its own.


There are several problems with the analysis from Davis et al.:

  • within-participant variability was removed by averaging over trials and other variables, so equal weight is wrongly attributed to every observation;

  • the analyses are split into different correlations instead of attempting mixed-effect modelling with the groups of participants considered together;

  • association is assessed using Pearson’s correlation, which is not robust and suffers from many issues (it is sensitive to outliers, the magnitude of the slope around which points are clustered, curvature, the magnitude of the residuals, restriction of range, and heteroscedasticity — see details in Wilcox 2017);

  • sample sizes are very small (far too small to say anything meaningful actually, as described in this previous post);

  • if a correction for multiple comparisons is applied, nothing passes the magic 0.05 threshold;

  • the authors commit a classic interaction fallacy by reporting one P<0.05 correlation, the other not, without directly comparing them (Gelman & Stern 2006; Nieuwenhuis et al. 2011).

There is nothing special about the correlation analysis in Davis et al. (2008): in neuroscience and psychology these problems are very common. In the rest of this post we’re going to tackle only two of them: how to compare 2 independent Pearson’s correlations, and what sample sizes are required to tease them apart in the long-run.


The most common approach to compare 2 independent correlations is to use the Fisher’s r-to-z approach. Here is a snippet of R code for Fisher’s z, given r1 and r2 the correlations in group 1 and group 2, and n1 and n2 the corresponding sample sizes:

z1 <- atanh(r1)

z2 <- atanh(r2)

zobs <- (z1-z2) / sqrt( 1 / (n1-3) + 1 / (n2-3) )

pval <- 2 * pnorm(-abs(zobs))

This approach is implemented in the cocor R package for instance (Diedenhofen & Musch, 2015). The problem with this approach is its normality assumption: it works well when sampling from a bivariate normal distribution or when the two distributions have a population correlation rho of zero, but it misbehaves when sampling from asymmetric distributions, or from distributions with heavy tails, when rho differs from zero (e.g. Duncan & Layard, 1973; Berry & Mielke, 2000). And all methods based on some variant of the z transform suffer the same issues.


Simulations can be used to understand how a statistical procedure works in the long-run, over many identical experiments. Let’s imagine that we sample from bivariate g & h distributions, in which parameter g controls the asymmetry of the distributions and h controls the thickness of the tails  (Hoaglin, 1985). You can see univariate g & h distributions illustrated here. Below you can see bivariate samples with n = 5,000, for combinations of g and h parameters, for a population correlation rho of 0:

and for a population correlation rho of 0.5:

Using g=h=0 gives a normal bivariate distribution.

When h=0.2, the tails are thicker, which means that outliers are more likely.

There is a large parameter space to cover, so here we only look at a subset of results, using simulations with 4,000 iterations. We’ll see examples in which we vary rho, the population correlation, or we vary g for a given rho.

Vary rho

First, we consider false positives, before turning to true positives (power). 

False positives

For false positives, 2 independent random samples are taken from the same bivariate population, so on average the two correlations should not differ. Here are the results for different rhos and sample sizes for g=h=0, using Fisher’s z test and an alpha of 0.05:


So all is fine when sampling from normal distributions: the type I error rate, or proportion of false positives, is at the nominal 0.05 level (marked by a horizontal black line). The values are not exactly 0.05 because of the limited number of simulations, but they’re close enough.

What happens if we sample from slightly asymmetric populations instead (g=0.2)? Relative to the g=0 condition, the number of false positives increases a bit over 0.05 on average, and more so with increasing rho.


Things get worse if we sample from symmetric distributions in which outliers are likely (g=0, h=0.2):


If rho = 0 then the proportion of false positives is still around 0.05, bit it increases rapidly with rho, an effect exacerbated by sample size (a pattern already reported in Duncan & Layard, 1973). That’s a very bad feature, because, as we will see below, very large sample sizes are required to detect differences between correlation coefficients. Unless of course our experimental samples are from normal distributions, but that’s unlikely and usually unknown, so why make such a gamble? (A cursory look at the literature suggests that skewed distributions and outliers are frequent in correlation analyses —Rousselet & Pernet, 2012).

If we sample from asymmetric distributions in which outliers are likely (g=h=0.2), it just gets slightly worse than in the previous example:


What’s the alternative? According to Wilcox (2009), a percentile bootstrap can be used to compare Pearson’s correlations, leading to satisfactory proportions of false positives. If instead of Fisher’s z we use a percentile bootstrap to compare Pearson’s correlations when g=h=0.2, we get the following results:


Still above the expected 0.05 but much better than Fisher’s z!

To compare the two correlation coefficients using the percentile bootstrap, we proceed like this:

  • sample participants with replacement, independently in each group;

  • compute the two correlation coefficients based on the bootstrap samples;

  • save the difference between correlations;

  • execute the previous steps many times;

  • use the distribution of bootstrap differences to derive a confidence interval. 

Usually a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution, but for Pearson’s correlation different quantiles are used depending on the sample size. This adjusted procedure is implemented in the R function twopcor() from Rand Wilcox. To compare two robust correlation coefficients, the adjustment is not necessary, a procedure implemented in twocor(). If for instance we use Spearman’s correlation when g=h=0.2, we get these results:


The proportion of false positives is at the nominal level and doesn’t change with rho! I’ll report simulations using the percentage bend correlation and the winsorised correlation, two other robust estimators, in another post.

True positives

To assess true positives, we first consider differences of 0.1: that is, if rho1 is 0, rho2 is 0.1, and so on for different rhos. Here are the results for g=h=0 using Fisher’s z:


For a difference of 0.1, power is overall very low, and it depends strongly on rho. The dependence of power on rho follows logically from the reduced sampling variability with increasing rho, which is demonstrated in a previous post. That post also illustrates the asymmetry of the sampling distributions when rho is different from zero. More generally, bounded distributions tend to be asymmetric, which implies that normality assumptions are routinely violated in psychology.

Changing g or h to 0.2 lowers power. Here is what happens with h=0.2 for instance:


But it’s difficult to make comparisons across figures. I would need to make plots of power differences, which is getting tedious for a blog post. So instead we can directly investigate the effect of g and h for fixed rho values. Here we only consider g.

Vary g

False positives

First, we consider the case where rho1 = rho2 = 0.3, we vary g and keep h=0. Again, asymmetry has a strong effect on false positives from Fisher’s z test.


Using the bootstrap gives better results, but still with inflated false positives.


Spearman + bootstrap is better behaved:


True positives

What about power? Let’s consider rho1 = 0.3 and rho2 = 0.5 for Fisher’s z. Increasing asymmetry reduces power.


The percentile bootstrap gives even worse results:


Spearman + bootstrap is not affected by asymmetry at all it seems:


But again, notice the large number of trials required to detect an effect. Here we have a population difference of 0.2, and even assuming normality, 80% power requires well over 250 observations! And that’s if you’re ok to be wrong 20% of the time when there is an effect — seems quite a gamble. For 90% power you will need over 350 observations! But again that’s assuming normality… I haven’t looked at the effect of h on power in this scenario yet.

Finally, let’s consider a simulation in which the rhos are set to the values reported in Figure 3A of Davis et al. (2008): rho1 = 0, rho2 = 0.6. That’s a massive and unlikely difference, but let’s look at how many observations we need even in this improbable case.


To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 36 observations. For 90% power, the minimum sample size is 47 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 61 observations; to achieve at least 90% power the minimum sample size is 91 observations.

Contrast these results with the results obtained using a combination of Spearman + bootstrap:


To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 43 observations. For 90% power, the minimum sample size is 56 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 42 observations; to achieve at least 90% power the minimum sample size is 55 observations.

So under normality, Pearson + Fisher is more powerful than Spearman + bootstrap, but power can drop a lot for asymmetric distributions. Spearman is very resilient to asymmetry.

Pearson + bootstrap doesn’t fare well:


A good reminder that the bootstrap is not a magic recipe that can handle all situations well (Rousselet, Pernet & Wilcox, 2019).

UPDATE: 2019-07-23

For the normal case (g=h=0), we look at the proportion of true positives (power) for the difference between Pearsons’ correlations using Fisher’s r-to-z transform. We vary systematically the sampling size n, rho1 and the difference between rho1 and rho2. The title of each facet indicates rho1. The difference between rho1 and rho2 is colour coded. For instance, for rho1=0, a 0.1 difference means that rho2=0.1. For rho1=0.8, only a difference of 0.1 is considered, meaning that rho2=0.9. So, unless we deal with large differences, we need very large numbers of trials to detect differences between independent correlation coefficients.


Given the many problems associated with Pearson’s correlation, it’s probably wise to choose some other, robust alternative (Pernet et al. 2012). Whatever methods you choose, it seems that very large sample sizes are required to detect effects, certainly much larger than I expected before I ran the simulations. I’ll report on the behaviour of other methods in another post. Meanwhile, consider published correlation analyses from small samples with a grain of salt!

Given the present results, I would recommend to use Spearman + bootstrap to compare correlation coefficients. You could also consider a skipped Spearman, which is robust to multivariate outliers, but with long computation times relative to the other techniques mentioned above (Pernet et al. 2012).


Berry, K.J. & Mielke, P.W. (2000) A Monte Carlo investigation of the Fisher Z transformation for normal and nonnormal distributions. Psychol Rep, 87, 1101-1114.

Davis, S.W., Dennis, N.A., Daselaar, S.M., Fleck, M.S. & Cabeza, R. (2008) Que PASA? The posterior-anterior shift in aging. Cereb Cortex, 18, 1201-1209.

Diedenhofen, B. & Musch, J. (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One, 10, e0121945.

Duncan, G.T. & Layard, M.W.J. (1973) Monte-Carlo Study of Asymptotically Robust Tests for Correlation-Coefficients. Biometrika, 60, 551-558.

Gelman, A. & Stern, H. (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat, 60, 328-331.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2019) A practical introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations (Preprint). PsyArXiv. https://doi.org/10.31234/osf.io/h8ft7

Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Commun Stat-Simul C, 38, 2220-2234.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Cluster correction for multiple dependent comparisons

In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:

  • different groups of rats are tested with increasing concentrations of a molecule;
  • different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);
  • pain thresholds are measured on contiguous patches of skin;
  • insects are sampled from neighbouring fields;
  • participants undergo a series of training sessions. 

In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power. 

These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…  

Cluster-based statistics

To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right  panel.


We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group). 


What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated. 


The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.

When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).

In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.


Figure 1 from Maris & Oostenveld, 2007.

If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.

Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!


So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance. 


There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not. 


From the description above, it is clear that using cluster-based statistics require a few choices:

  • a method to estimate the null distribution;
  • a method to form clusters;
  • a choice of cluster statistics;
  • a choice of statistic to form the null distribution (max across clusters for instance);
  • a number of resamples…

Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…

5 dependent groups

First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.


With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes. 

The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.

Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment   where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.


If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:


Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 50 observations for Hochberg;
  • 57 observations for Bonferroni.

7 dependent groups

If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.


As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 56 observations for Hochberg;
  • 59 observations for Bonferroni.


7 independent groups

Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:


The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.

To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:

c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)

And here is an example of a random sample with n = 70 measurements per group:


In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:

  • 50 observations for the cluster test;
  • 67 observations for Hochberg;
  • 81 observations for Bonferroni.



I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself. 

Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). See a clear explanation in this blog post by Benedikt Ehinger. However, TFCE like the cluster methods presented here suffer from non-trivial inference issues. In the words of @ten_photos:

“I’m instead rather drawn to a pragmatic approach […] using a concrete interpretation of the conclusions drawn from rejecting one individual test:

– voxel-wise inference, reject null at a voxel, conclude signal present at that voxel;

– cluster-wise, reject null for that cluster, concluding that signal present at one or more voxels in the cluster;

– TFCE inference, reject the null at a voxel, conclude there exists a cluster containing that voxel for which we reject the null, concluding that signal present at one or more voxels in that contributing cluster.”


Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.

R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.


Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x. 

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.

McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset. 


Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.