This post looks at the coverage of confidence intervals for the difference between two independent correlation coefficients. Previously, we saw how the standard Fisher’s r-to-z transform can lead to inflated false positive rates when sampling from non-normal bivariate distributions and the population correlation differs from zero. In this post, we look at a complementary perspective: using simulations, we’re going to determine how often confidence intervals include the population difference. As we saw in our previous post, because we compute say 95% confidence intervals does not mean that over the long run, 95% of such confidence intervals will include the population we’re trying to estimate. In some situations, the coverage is much lower than expected, which means we might fool ourselves more often that we thought (although in practice in most discussions I’ve ever read, authors behave as if their 95% confidence intervals were very narrow 100% confidence intervals — but that’s another story).
We look at confidence interval coverage for the difference between Pearsons’ correlations using Zou’s method (2007) and a modified percentile bootstrap method (Wilcox, 2009). We do the same for the comparison of Spearmans’ correlations using the standard percentile bootstrap. We used simulations with 4,000 iterations. Sampling is from bivariate g & h distributions (see illustrations here).
We consider 4 cases:
g = h = 0, difference = 0.1, vary rho
g = 1, h = 0, difference = 0.1, vary rho
rho = 0.3, difference = 0.2, vary g, h = 0
rho = 0.3, difference = 0.2, vary g, h = 0.2
g = h = 0, difference = 0.1, vary rho
That’s the standard normal bivariate distribution. Group 1 has values of rho1 = 0 to 0.8, in steps of 0.1. Group 2 has values of rho2 = rho1 + 0.1.
For normal bivariate distributions, coverage is at the nominal level for all methods, sample sizes and population correlations. (Here I only considered sample sizes of 50+ because otherwise power is far too low, so there is no point.)
When CIs do not include the population value, are they located to the left or the right of the population? In the figure below, negative values indicate a preponderance of left shifts, positive values a preponderance of right shifts. A value of 1 = 100% right shifts, -1 = 100% left shifts. For Pearson, CIs not including the population value tend to be located evenly to the left and right of the population. For Spearman, there is a preponderance of left shifted CIs for rho1 = 0.8. This left shift implies a tendency to over-estimate the difference (the difference group 1 minus group 2 is negative).
g = 1, h = 0, vary rho
What happens when we sample from a skewed distribution?
The coverage is lower than the expected 95% for Zou’s method and the discrepancy worsens with increasing rho1 and with increasing sample size. The percentile bootstrap does a much better job. Spearman’s combined with the percentile bootstrap is spot on.
For CIs that did not include the population value, the pattern of shifts varies as a function of rho. For Pearson, CIs are more likely to be located to the right of the population (under-estimation of the population value or wrong sign) for rho = 0, whereas for rho = 0.8, CIs are more likely to be located to the left. Spearman + bootstrap produces much more balanced results.
To investigate the asymmetry, we look at CIs for g=1, a sample size of n = 200 and the extremes of the distributions, rho1 = 0 and rho2 = 0.8. The figure below shows the preponderance of right shifted CIs for the two Pearson methods. The vertical line marks the population difference of -0.1.
For rho1 = 0.8, the pattern changes to a preponderance of left shifts for all methods, which means that the CIs tended to over-estimate the population difference. CIs for differences between Spearman’s correlations were quite smaller than Pearson’s ones though, thus showing less bias and less uncertainty.
rho=0.3, diff=0.2, vary g, h = 0
For another perspective on the three methods, we now look at a case with:
group 1: rho1 = 0.3
group 2: rho2 = 0.5
we vary g from 0 to 1.
For Pearson + Zou, coverage progressively decreases with increasing g, and to a much more limited extent with increasing sample size. Pearson + bootstrap is much more resilient to changes in g. And Spearman + bootstrap just doesn’t care about asymmetry!
The better coverage of Pearson + bootstrap seems to be achieved by producing wider CIs.
Matters only get’s worse for Pearson + Zou when outliers are likely (see notebook on GitHub).
Based on this new comparison of the 3 methods, I’d argue again that Spearman + bootstrap should be preferred over the two Pearson methods. But if the goal is to assess linear relationships, then Pearson + bootstrap is preferable to Zou’s method. I’ll report on other methods in another post.
Wilcox, Rand R. Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics – Simulation and Computation 38, no. 10 (1 November 2009): 2220–34. https://doi.org/10.1080/03610910903289151.
(1) the Fisher’s z method to compare two independent correlations can give very inaccurate results when sampling from distributions that are skewed (asymmetric) or heavy tailed (high probability of outliers) and the population correlation rho differs from zero;
(2) even when sampling from normal distributions, Fisher’s z as well as bootstrap methods require very large sample sizes to detect differences, much larger than commonly used in neuroscience & psychology.
We start with an example and then consider the results of a few simulations. The R code is available on GitHub. It is part of a larger project and I’ll report in other blog posts on the rest of the simulations I’m currently running.
Let’s look at an example of correlation analysis reported in Davis et al. (2008). This article had 898 citations on June 11th 2019 according to Google scholar. In their figure 3, the authors reported 4 correlations: 2 correlations in 2 independent groups of participants. Across panels A and B, the frontal activity variable appears twice, so for each group of participants (old and young), the correlations in A and B are dependent and overlapping. Code on how to handle this case was covered in a previous post. Here we’re going to focus on the comparison of two independent correlations, which is like considering A or B and its own.
There are several problems with the analysis from Davis et al.:
within-participant variability was removed by averaging over trials and other variables, so equal weight is wrongly attributed to every observation;
the analyses are split into different correlations instead of attempting mixed-effect modelling with the groups of participants considered together;
association is assessed using Pearson’s correlation, which is not robust and suffers from many issues (it is sensitive to outliers, the magnitude of the slope around which points are clustered, curvature, the magnitude of the residuals, restriction of range, and heteroscedasticity — see details in Wilcox 2017);
sample sizes are very small (far too small to say anything meaningful actually, as described in this previous post);
if a correction for multiple comparisons is applied, nothing passes the magic 0.05 threshold;
the authors commit a classic interaction fallacy by reporting one P<0.05 correlation, the other not, without directly comparing them (Gelman & Stern 2006; Nieuwenhuis et al. 2011).
There is nothing special about the correlation analysis in Davis et al. (2008): in neuroscience and psychology these problems are very common. In the rest of this post we’re going to tackle only two of them: how to compare 2 independent Pearson’s correlations, and what sample sizes are required to tease them apart in the long-run.
The most common approach to compare 2 independent correlations is to use the Fisher’s r-to-z approach. Here is a snippet of R code for Fisher’s z, given r1 and r2 the correlations in group 1 and group 2, and n1 and n2 the corresponding sample sizes:
z1 <- atanh(r1)
z2 <- atanh(r2)
zobs <- (z1-z2) / sqrt( 1 / (n1-3) + 1 / (n2-3) )
pval <- 2 * pnorm(-abs(zobs))
This approach is implemented in the cocor R package for instance (Diedenhofen & Musch, 2015). The problem with this approach is its normality assumption: it works well when sampling from a bivariate normal distribution or when the two distributions have a population correlation rho of zero, but it misbehaves when sampling from asymmetric distributions, or from distributions with heavy tails, when rho differs from zero (e.g. Duncan & Layard, 1973; Berry & Mielke, 2000). And all methods based on some variant of the z transform suffer the same issues.
Simulations can be used to understand how a statistical procedure works in the long-run, over many identical experiments. Let’s imagine that we samplefrom bivariate g & h distributions, in which parameter g controls the asymmetry of the distributions and h controls the thickness of the tails(Hoaglin, 1985). You can see univariate g & h distributions illustrated here. Below you can see bivariate samples with n = 5,000, for combinations of g and h parameters, for a population correlation rho of 0:
and for a population correlation rho of 0.5:
Using g=h=0 gives a normal bivariate distribution.
When h=0.2, the tails are thicker, which means that outliers are more likely.
There is a large parameter space to cover, so here we only look at a subset of results, using simulations with 4,000 iterations. We’ll see examples in which we vary rho, the population correlation, or we vary g for a given rho.
First, we consider false positives, before turning to true positives (power).
For false positives, 2 independent random samples are taken from the same bivariate population, so on average the two correlations should not differ. Here are the results for different rhos and sample sizes for g=h=0, using Fisher’s z test and an alpha of 0.05:
So all is fine when sampling from normal distributions: the type I error rate, or proportion of false positives, is at the nominal 0.05 level (marked by a horizontal black line). The values are not exactly 0.05 because of the limited number of simulations, but they’re close enough.
What happens if we sample from slightly asymmetric populations instead (g=0.2)? Relative to the g=0 condition, the number of false positives increases a bit over 0.05 on average, and more so with increasing rho.
Things get worse if we sample from symmetric distributions in which outliers are likely (g=0, h=0.2):
If rho = 0 then the proportion of false positives is still around 0.05, bit it increases rapidly with rho, an effect exacerbated by sample size (a pattern already reported in Duncan & Layard, 1973). That’s a very bad feature, because, as we will see below, very large sample sizes are required to detect differences between correlation coefficients. Unless of course our experimental samples are from normal distributions, but that’s unlikely and usually unknown, so why make such a gamble? (A cursory look at the literature suggests that skewed distributions and outliers are frequent in correlation analyses —Rousselet & Pernet, 2012).
If we sample from asymmetric distributions in which outliers are likely (g=h=0.2), it just gets slightly worse than in the previous example:
What’s the alternative? According to Wilcox (2009), a percentile bootstrap can be used to compare Pearson’s correlations, leading to satisfactory proportions of false positives. If instead of Fisher’s z we use a percentile bootstrap to compare Pearson’s correlations when g=h=0.2, we get the following results:
Still above the expected 0.05 but much better than Fisher’s z!
To compare the two correlation coefficients using the percentile bootstrap, we proceed like this:
sample participants with replacement, independently in each group;
compute the two correlation coefficients based on the bootstrap samples;
save the difference between correlations;
execute the previous steps many times;
use the distribution of bootstrap differences to derive a confidence interval.
Usually a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution, but for Pearson’s correlation different quantiles are used depending on the sample size. This adjusted procedure is implemented in the R function twopcor() from Rand Wilcox. To compare two robust correlation coefficients, the adjustment is not necessary, a procedure implemented in twocor(). If for instance we use Spearman’s correlation when g=h=0.2, we get these results:
The proportion of false positives is at the nominal level and doesn’t change with rho! I’ll report simulations using the percentage bend correlation and the winsorised correlation, two other robust estimators, in another post.
To assess true positives, we first consider differences of 0.1: that is, if rho1 is 0, rho2 is 0.1, and so on for different rhos. Here are the results for g=h=0 using Fisher’s z:
For a difference of 0.1, power is overall very low, and it depends strongly on rho. The dependence of power on rho follows logically from the reduced sampling variability with increasing rho, which is demonstrated in a previous post. That post also illustrates the asymmetry of the sampling distributions when rho is different from zero. More generally, bounded distributions tend to be asymmetric, which implies that normality assumptions are routinely violated in psychology.
Changing g or h to 0.2 lowers power. Here is what happens with h=0.2 for instance:
But it’s difficult to make comparisons across figures. I would need to make plots of power differences, which is getting tedious for a blog post. So instead we can directly investigate the effect of g and h for fixed rho values. Here we only consider g.
First, we consider the case where rho1 = rho2 = 0.3, we vary g and keep h=0. Again, asymmetry has a strong effect on false positives from Fisher’s z test.
Using the bootstrap gives better results, but still with inflated false positives.
Spearman + bootstrap is better behaved:
What about power? Let’s consider rho1 = 0.3 and rho2 = 0.5 for Fisher’s z. Increasing asymmetry reduces power.
The percentile bootstrap gives even worse results:
Spearman + bootstrap is not affected by asymmetry at all it seems:
But again, notice the large number of trials required to detect an effect. Here we have a population difference of 0.2, and even assuming normality, 80% power requires well over 250 observations! And that’s if you’re ok to be wrong 20% of the time when there is an effect — seems quite a gamble. For 90% power you will need over 350 observations! But again that’s assuming normality… I haven’t looked at the effect of h on power in this scenario yet.
Finally, let’s consider a simulation in which the rhos are set to the values reported in Figure 3A of Davis et al. (2008): rho1 = 0, rho2 = 0.6. That’s a massive and unlikely difference, but let’s look at how many observations we need even in this improbable case.
To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 36 observations. For 90% power, the minimum sample size is 47 observations.
Now with g=1, to achieve at least 80% power the minimum sample size is 61 observations; to achieve at least 90% power the minimum sample size is 91 observations.
Contrast these results with the results obtained using a combination of Spearman + bootstrap:
To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 43 observations. For 90% power, the minimum sample size is 56 observations.
Now with g=1, to achieve at least 80% power the minimum sample size is 42 observations; to achieve at least 90% power the minimum sample size is 55 observations.
So under normality, Pearson + Fisher is more powerful than Spearman + bootstrap, but power can drop a lot for asymmetric distributions. Spearman is very resilient to asymmetry.
Pearson + bootstrap doesn’t fare well:
A good reminder that the bootstrap is not a magic recipe that can handle all situations well (Rousselet, Pernet & Wilcox, 2019).
For the normal case (g=h=0), we look at the proportion of true positives (power) for the difference between Pearsons’ correlations using Fisher’s r-to-z transform. We vary systematically the sampling size n, rho1 and the difference between rho1 and rho2. The title of each facet indicates rho1. The difference between rho1 and rho2 is colour coded. For instance, for rho1=0, a 0.1 difference means that rho2=0.1. For rho1=0.8, only a difference of 0.1 is considered, meaning that rho2=0.9. So, unless we deal with large differences, we need very large numbers of trials to detect differences between independent correlation coefficients.
Given the many problems associated with Pearson’s correlation, it’s probably wise to choose some other, robust alternative (Pernet et al. 2012). Whatever methods you choose, it seems that very large sample sizes are required to detect effects, certainly much larger than I expected before I ran the simulations. I’ll report on the behaviour of other methods in another post. Meanwhile, consider published correlation analyses from small samples with a grain of salt!
Given the present results, I would recommend to use Spearman + bootstrap to compare correlation coefficients. You could also consider a skipped Spearman, which is robust to multivariate outliers, but with long computation times relative to the other techniques mentioned above (Pernet et al. 2012).
Berry, K.J. & Mielke, P.W. (2000) A Monte Carlo investigation of the Fisher Z transformation for normal and nonnormal distributions. Psychol Rep, 87, 1101-1114.
Davis, S.W., Dennis, N.A., Daselaar, S.M., Fleck, M.S. & Cabeza, R. (2008) Que PASA? The posterior-anterior shift in aging. Cereb Cortex, 18, 1201-1209.
Diedenhofen, B. & Musch, J. (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One, 10, e0121945.
Duncan, G.T. & Layard, M.W.J. (1973) Monte-Carlo Study of Asymptotically Robust Tests for Correlation-Coefficients. Biometrika, 60, 551-558.
Gelman, A. & Stern, H. (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat, 60, 328-331.
Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.
Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.
Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.
Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2019) A practical introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations (Preprint). PsyArXiv. https://doi.org/10.31234/osf.io/h8ft7
Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Commun Stat-Simul C, 38, 2220-2234.
Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.