In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:
pain thresholds are measured on contiguous patches of skin;
insects are sampled from neighbouring fields;
participants undergo a series of training sessions.
In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power.
These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…
To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right panel.
We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group).
What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated.
The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.
When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).
In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.
If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.
Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!
So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance.
There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not.
From the description above, it is clear that using cluster-based statistics require a few choices:
a method to form clusters;
a choice of cluster statistics;
a choice of statistic to form the null distribution (max across clusters for instance);
a number of resamples…
Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…
First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.
With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes.
The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.
Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.
If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:
Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:
50 observations for Hochberg;
57 observations for Bonferroni.
If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.
As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:
56 observations for Hochberg;
59 observations for Bonferroni.
Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:
The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.
To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:
c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)
And here is an example of a random sample with n = 70 measurements per group:
In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:
67 observations for Hochberg;
81 observations for Bonferroni.
I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself.
Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). TFCE also affords inferences for each test, not the cluster of tests. But it is computationally much more demanding than the standard cluster test demonstrated in this post.
Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.
R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.
Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x.
Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.
McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.
Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.
Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.
Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset.
https://doi.org/10.6084/m9.figshare.3501728.v1
Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.
In this post I’m going to show you a few simple steps to illustrate continuous distributions. As an example, we consider reaction time data, which are typically positively skewed and can differ in different ways. Reaction time distributions are also a rich source of information to constrain cognitive theories and models. So unless the distributions are at least illustrated, this information is lost (which is typically the case when distributions are summarised using a single value like the mean). Other approaches not covered here include explicit mathematical models of decision making and fitting functions to model the shape of the distributions (Balota & Yap, 2011).
For our current example, I made up data for 2 independent groups with four patterns of differences:
mostly late differences;
mostly early differences.
The R code is on GitHub.
For our first visualisation, we use geom_jitter()
from ggplot2
. The 1D scatterplots give us a good idea of how the groups differ but they’re not the easiest to read. The main reason is probably that we need to estimate local densities of points in different regions and compare them between groups.
For the purpose of this exercise, each group (g1 and g2) is composed of 1,000 observations, so the differences in shapes are quite striking. With smaller sample sizes the evaluation of these graphs could be much more challenging.
Relative to scatterplots, I find that kernel density plots make the comparisons between groups much easier.
Scatterplots and kernel density plots can be combined by using beeswarm plots. Here we create scatterplots shaped by local density using the geom_quasirandom()
function from the ggbeeswarm
package. Essentially, the function creates violin plots in which the constituent points are visible.
To make the plots even more informative, I’ve superimposed quantiles – here deciles computed using the Harrell-Davis quantile estimator. The deciles are represented by vertical black lines, with medians shown with thicker lines. Medians are informative about the location of the bulk of the observations and comparing the lower to upper quantiles let us appreciate the amount of asymmetry within distributions. Comparing quantiles between groups give us a sense of the amount of relative compression/expansion on each side of the distributions. This information would be lost if we only compared the medians.
If we remove the scatterplots and only show the quantiles, we obtain quantile plots, which provide a compact description of how distributions differ (please post a comment if you know of older references using quantile plots). Because the quantiles are superimposed, they are easier to compare than in the previous scatterplots. To help with the group comparisons, I’ve also added plots of the quantile differences, which emphasise the different patterns of group differences.
An alternative to quantiles are Vincentiles, which are computed by sorting the data and splitting them in equi-populated bins (there is the same number of observations in each bin). Then the mean is computed for each bin (Balota et al. 2008; Jiang et al. 2004). Below means were computed for 9 equi-populated bins. As expected from the way they are computed, quantile plots and Vincentile plots look very similar for our large samples from continuous variables.
Group quantile and Vincentile plots can be created by averaging quantiles and Vincentiles across participants (Balota & Yap, 2011; Ratcliff, 1979). This will be the topic of another post.
Related to quantile plots and Vincentile plots, delta plots show the difference between conditions, bin by bin (for each Vincentile) along the y-axis, as a function of the mean across conditions for each bin along the x-axis (De Jong et al., 1994). Not surprisingly, these plots have very similar shapes to the quantile difference plots we considered earlier.
Negative delta plots (nDP, delta plots with a negative slope) have received particular attention because of their theoretical importance (Ellinghaus & Miller, 2018; Schwarz & Miller, 2012).
Delta plots are related to the shift function, a powerful tool introduced in the 1970s: it consists in plotting the difference between the quantiles of two groups as a function of the quantiles in one group, with some measure of uncertainty around the difference (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977). It was later refined by Rand Wilcox (Rousselet et al. 2017). This modern version is shown below, with deciles estimated using the Harrell-Davis quantile estimator, and percentile bootstrap confidence intervals of the quantile differences. The sign of the difference is colour-coded (purple for negative, orange for positive).
Unlike other graphical quantile techniques presented here, the shift function affords statistical inferences because of it’s use of confidence intervals (the shift function also comes in a few Bayesian flavours). It is probably one of the easiest ways to compare entire distributions, without resorting to explicit models of the distributions. But the shift function and the other graphical methods demonstrated in this post are not meant to compete with hierarchical models. Instead, they can be used to better understand data patterns within and between participants, before modelling attempts. They also provide powerful alternatives to the mindless application of t-tests and bar graphs, helping to nudge researchers away from the unique use of the mean (or the median) and towards considering the rich information available in continuous distributions.
Balota, D.A. & Yap, M.J. (2011) Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. Curr Dir Psychol Sci, 20, 160-166.
Balota, D.A., Yap, M.J., Cortese, M.J. & Watson, J.M. (2008) Beyond mean response latency: Response time distributional analyses of semantic priming. J Mem Lang, 59, 495-523.
Clarke, E. & Sherrill-Mix, S. (2016) ggbeeswarm: Categorical Scatter (Violin Point) Plots.
De Jong, R., Liang, C.C. & Lauber, E. (1994) Conditional and Unconditional Automaticity – a Dual-Process Model of Effects of Spatial Stimulus – Response Correspondence. J Exp Psychol Human, 20, 731-750.
Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Ann Stat, 2, 267-277.
Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.
Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.
Ellinghaus, R. & Miller, J. (2018) Delta plots with negative-going slopes as a potential marker of decreasing response activation in masked semantic priming. Psychol Res, 82, 590-599.
Jiang, Y., Rouder, J.N. & Speckman, P.L. (2004) A note on the sampling properties of the Vincentizing (quantile averaging) procedure. J Math Psychol, 48, 186-195.
Ratcliff, R. (1979) Group Reaction-Time Distributions and an Analysis of Distribution Statistics. Psychol Bull, 86, 446-461.
Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.
Schwarz, W. & Miller, J. (2012) Response time models of delta plots with negative-going slopes. Psychon B Rev, 19, 555-574.
When we estimate power curves, we ask this question: given some priors about the data generating process, the nature of the effect and measurement variance, what is the probability to detect an effect for a given statistical test (say using an arbitrary p<0.05 threshold) for various sample sizes and effect sizes. While there are very good reasons to focus on power estimation, this is not the only or the most important aspect of an experimental procedure to consider (Gelman & Carlin, 2014). Indeed, finding the number of observations needed so that we get p<0.05 in say 87% of experiments, is not the most exciting part of designing an experiment.
The relevant question is not “What is the power of a test?” but rather is “What might be expected to happen in studies of this size?” (Gelman & Carlin, 2014)
A related but more important question is that of measurement precision: given some priors and a certain number of participants, how close can we get to the unknown population value (Maxwell et al., 2008; Schönbrodt & Perugini, 2013; Peters & Crutzen, 2018)? Not surprisingly, measurement precision depends on sample size. As we saw in previous posts, sampling distributions get narrower with increasing sample sizes:
And with narrower sampling distributions, measurement precision increases. To illustrate, let’s consider an example from a lexical decision task – hundreds of reaction times (RT) were measured in hundreds of participants who had to distinguish between words and non-words presented on a computer screen.
Here are examples of RT distributions from 100 participants for each condition:
If we save the median of each distribution, for each participant and condition, we get these positively skewed group level distributions:
The distribution of pairwise differences between medians is also positively skewed:
Notably, most participants have a positive difference: 96.4% of participants are faster in the Word than the Non-Word condition – a potential case of stochastic dominance (Rouder & Haaf, 2018; see also this summary blog post).
Now let say we want to estimate the group difference between conditions. Because of the skewness at each level of analysis (within and across participants), we estimate the central tendency at each level using the median: that is, we compute the median for each participant and each condition, then compute the medians of medians across participants (a more detailed assessment could be obtained by performing hierarchical modelling or multiple quantile estimation for instance).
Then we can assess measurement precision at the group level by performing a multi-level simulation. In this simulation, we can ask, for instance, how often the group estimate is no more than 10 ms from the population value across many experiments. To simplify, in each iteration of the simulation, we draw 200 trials per condition and participant, compute the median and save the Non-Word – Word difference. Group estimation of the difference is then based on a random sample of 10 to 300 participants, with the group median computed across participants’ differences between medians. Because the dataset is very large at the two level of analysis, we can pretend we have access to the population values, and define them by first computing, for each condition, the median across all available trials for each participant, second by computing across all participants the median of the pairwise differences.
Having defined population values (the truth we’re trying to estimate, here a group difference of 78 ms), we can calculate measurement precision as the proportion of experiments in which the group estimate is no more than X ms from the population value, with X varying from 5 to 40 ms. Here are the results:
Not surprisingly, the proportion of estimates close to the population value increases with the number of participants. More interestingly, the relationship is non-linear, such that a larger gain in precision can be achieved by increasing sample size for instance from 10 to 20 compared to from 90 to 100.
The results also let us answer useful questions for planning experiments (see the black arrows in the above figure):
• So that in 70% of experiments the group estimate of the median is no more than 10 ms from the population value, we need to test at least 56 participants.
• So that in 90% of experiments the group estimate of the median is no more than 20 ms from the population value, we need to test at least 38 participants.
Obviously, this is just an example, about a narrow problem related to lexical decisions. Other aspects could be considered too, for instance the width of the confidence intervals (Maxwell, Kelley & Rausch, 2008; Peters & Crutzen, 2017; Rothman & Greenland, 2018). And for your particular case, most likely, you won’t have access to a large dataset from which to perform a data driven simulation. In this case, you can get estimates about plausible effect sizes and their variability from various sources (Gelman & Carlin 2014):
(systematic) literature review;
meta-analysis;
outputs of a hierarchical model;
modelling.
To model a range of plausible effect sizes and their consequences on repeated measurements, you need priors about a data generating process and how distributions differ between conditions. For instance, you could use exGaussian distributions to simulate RT data. For research on new effects, it is advised to consider a large range of potential effects, with their plausibility informed by the literature and psychological/biological constraints.
Although relying on the literature alone can lead to over-optimistic expectations because of the dominance of small n studies and a bias towards significant results (Yarkoni 2009; Button et al. 2013), methods are being developed to overcome these limitations (Anderson, Kelley & Maxwell, 2017). In the end, the best cure against effect size over-estimation is a combination of pre-registration/registered reports (to diminish literature bias) and data sharing (to let anyone do their own calculations and meta-analyses).
The code is on figshare: the simulation can be reproduced using the flp_sim_precision
notebook, the illustrations of the distributions can be reproduced using flp_illustrate_dataset
.
Anderson, S.F., Kelley, K. & Maxwell, S.E. (2017) Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychol Sci, 28, 1547-1562.
The tyranny of power: is there a better way to calculate sample size? https://www.bmj.com/content/339/bmj.b3985)
Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.
Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.
Gelman, A. & Carlin, J. (2014) Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci, 9, 641-651.
Maxwell, S.E., Kelley, K. & Rausch, J.R. (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol, 59, 537-563.
Peters, G.-J.Y. & Crutzen, R. (2017) Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. PsyArXiv. doi:10.31234/osf.io/cjsk2.
Rothman, K.J. & Greenland, S. (2018) Planning Study Size Based on Precision Rather Than Power. Epidemiology, 29, 599-603.
Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.
Rousselet, G.A. & Wilcox, R.R. (2018) Reaction times and other skewed distributions: problems with the mean and the median. bioRxiv. doi: https://doi.org/10.1101/383935
Rousselet, G.; Wilcox, R. (2018): Reaction times and other skewed distributions: problems with the mean and the median. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6911924.v1
Schönbrodt, F.D. & Perugini, M. (2013) At what sample size do correlations stabilize? J Res Pers, 47, 609-612.
Yarkoni, T. (2009) Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power‚ Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4, 294-298.
Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you).
To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X.
For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives.
Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.
For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.
To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.
To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations.
Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.
So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:
curvature
the magnitude of the residuals
restriction of range
heteroscedasticity
The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).
Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:
“X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”
Here is an example of such data:
Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity.
As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.
Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.
We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X.
For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance.
From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).
Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power.
We can zoom in:
The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).
The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.
But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).
Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.
The R code to reproduce the simulations and the figures is on GitHub.
Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.
Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.
Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.
Previously, we saw that with small sample sizes, correlation estimation is very uncertain, which implies that small n correlations cannot be trusted: the observed value in any experiment could be very far from the population value, and the sign could be wrong too. In addition to the uncertainty associated with small sample sizes, the selective report of results based on p values < 0.05 (or some other threshold), can lead to massively inflated correlation estimates in the literature (Yarkoni, 2009 ☜ if you haven’t done so, you really should read this excellent paper).
Let’s illustrate the problem (code is on GitHub). First, we consider a population rho = 0. Here is the sampling distribution as a function of sample size, as we saw in an earlier post.
Now, here is the sampling distribution conditional on p < 0.05. The estimates are massively inflated and the problem gets worse with smaller sample sizes, because the smaller the sample size, the larger the correlations must be by chance for them to be significant.
So no, don’t get too excited when you see a statistically significant correlation in a paper…
Let’s do the same exercise when the population correlation is relatively large. With rho = 0.4, the sampling distribution looks like this:
If we report only those correlations associated with p < 0.05, the distribution looks like this:
Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations (see small purple bump around -0.6 -0.8). Indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p < 0.05 negative correlation was obtained.
As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:
To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉
I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details:
164 articles reported no correlation.
7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.
48 articles reported correlations.
The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process.
To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution:
The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.
The median sample size is 18, which is far too low to provide sufficiently precise estimation.
The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.
Among the 27 articles that did mention which estimator was used:
So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).
Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.
Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).
Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.
In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results.
Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.
Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours…
10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!
In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.
How can we do a better job?
If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation.
Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).
These details are essential in articles reporting correlation analyses:
Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).
Finally, are you sure you really want to compute a correlation?
“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?John W. Tukey. American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108
McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.
Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.
Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.
Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.
Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.
[preprint]