In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:
 different groups of rats are tested with increasing concentrations of a molecule;
 different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);

pain thresholds are measured on contiguous patches of skin;

insects are sampled from neighbouring fields;

participants undergo a series of training sessions.
In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power.
These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Clusterbased statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…
Clusterbased statistics
To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right panel.
We also create 100 maps composed entirely of noise. Then we perform a ttest for independent groups at each point in the map (n=100 per group).
What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated.
The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the familywise error rate but maintain high power. The familywise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.
When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).
In the example below, two timecourses are compared pointbypoint. Panel a shows the mean timecourses across participants. Panel b shows the timecourse of the ttest for 2 dependent groups (the same participants were tested in each condition). Panel c shows timepoints at which significant ttests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.
If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The clusterbased inferences do not suffer from this problem.
Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant ttests. After correction, only 2 clusters are significant!
So how do we compute clusterbased statistics? The next figure illustrates the different steps. At the top, we start with a timecourse of Fvalues, from a series of pointbypoint ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use nonparametric statistics to estimate the distribution expected by chance.
There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrapt methods (Pernet et al., 2015). Whatever technique we use, we simulate timecourses of F values expected by chance, given the data. For each of these simulated timecourses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our clusterbased threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after clusterbased correction for multiple comparisons, whereas the smallest one is not.
Simulations
From the description above, it is clear that using clusterbased statistics require a few choices:
 a method to estimate the null distribution;

a method to form clusters;

a choice of cluster statistics;

a choice of statistic to form the null distribution (max across clusters for instance);

a number of resamples…
Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…
5 dependent groups
First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 onesample ttests on means.
With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a clusterbased correction, the FWER is near the nominal level at all sample sizes.
The cluster correction was done using a bootstrapt procedure, in which the original data are first meancentred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of ttest. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.
Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.
If we define power as the probability to observe a significant ttest simultaneously in conditions 3, 4 and 5, we get these results:
Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:
 39 observations for the cluster test;

50 observations for Hochberg;

57 observations for Bonferroni.
7 dependent groups
If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.
As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:
 39 observations for the cluster test;

56 observations for Hochberg;

59 observations for Bonferroni.
7 independent groups
Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:
The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.
To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:
c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)
And here is an example of a random sample with n = 70 measurements per group:
In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:
 50 observations for the cluster test;

67 observations for Hochberg;

81 observations for Bonferroni.
Conclusion
I hope the examples above have convinced you that clusterbased statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could reuse the simulation code and give it a go yourself.
Limitations: clusterbased methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use nonparametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). TFCE also affords inferences for each test, not the cluster of tests. But it is computationally much more demanding than the standard cluster test demonstrated in this post.
Code
Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.
R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using ttests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.
References
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.20448317.1978.tb00581.x.
Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG and MEGdata. Journal of neuroscience methods, 164, 177190.
McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.
Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.
Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Clusterbased computational methods for mass univariate analyses of eventrelated brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 8593.
Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset.
https://doi.org/10.6084/m9.figshare.3501728.v1
Smith, S.M. & Nichols, T.E. (2009) Thresholdfree cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 8398.