# Power estimation for correlation analyses

Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you).

To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X. For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives. Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.

For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.

To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.

To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations. Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.

So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:

• outliers

• the magnitude of the slope around which points are clustered

• curvature

• the magnitude of the residuals

• restriction of range

• heteroscedasticity

The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).

Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:

X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”

Here is an example of such data: Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity. As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.

Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.

We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X. For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance. From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).

Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power. We can zoom in: The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).

# Conclusion

The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.

But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).

Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.

The R code to reproduce the simulations and the figures is on GitHub.

# References

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

# Small n correlations + p values = disaster

Previously, we saw that with small sample sizes, correlation estimation is very uncertain, which implies that small n correlations cannot be trusted: the observed value in any experiment could be very far from the population value, and the sign could be wrong too. In addition to the uncertainty associated with small sample sizes, the selective report of results based on p values < 0.05 (or some other threshold), can lead to massively inflated correlation estimates in the literature (Yarkoni, 2009 ☜ if you haven’t done so, you really should read this excellent paper).

Let’s illustrate the problem (code is on GitHub). First, we consider a population rho = 0. Here is the sampling distribution as a function of sample size, as we saw in an earlier post. Figure 1: Sampling distribution for rho=0.

Now, here is the sampling distribution conditional on p < 0.05. The estimates are massively inflated and the problem gets worse with smaller sample sizes, because the smaller the sample size, the larger the correlations must be by chance for them to be significant. Figure 2: Sampling distribution for rho=0, given p<0.05

So no, don’t get too excited when you see a statistically significant correlation in a paper…

Let’s do the same exercise when the population correlation is relatively large. With rho = 0.4, the sampling distribution looks like this: Figure 3: Sampling distribution for rho=0.4.

If we report only those correlations associated with p < 0.05, the distribution looks like this: Figure 4: Sampling distribution for rho=0.4, given p<0.05

Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations (see small purple bump around -0.6 -0.8). Indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p < 0.05 negative correlation was obtained.

# Correlations in neuroscience: are small n, interaction fallacies, lack of illustrations and confidence intervals the norm?

As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:

• Pearson’s correlation, a non-robust measure of association, is used;
• R and p values are reported, but not confidence intervals;
• sample sizes tend to be small, leading to large estimation bias and inflated effect sizes in the literature;
• R values and confidence intervals are not considered when interpreting the results;
• instead, most analyses are reported as significant or non-significant (p<0.05), leading to the conclusion that an association exists or not (frequentist fallacy);
• often figures illustrating the correlations are absent;
• the explicit or implicit comparison of two correlations is done without a formal test (interaction fallacy).

To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉

I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details:

• n
• estimator
• confidence interval
• R
• p value
• consideration of effect sizes
• figure illustrating positive result
• figure illustrating negative result
• interaction test.

164 articles reported no correlation.

7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.

48 articles reported correlations.

# Sample size

The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process.

To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution: The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.

The median sample size is 18, which is far too low to provide sufficiently precise estimation.

# Estimator

The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.

Among the 27 articles that did mention which estimator was used:

• 11 used only Pearson’s correlation;
• 11 used only Spearman’s correlation;
• 4 used Pearson’s and Spearman’s correlations;
• 1 used Spearman’s and Kendall’s correlations.

So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).

# R & p values

Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.

# Confidence interval

Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).

# Consideration for effect sizes

Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.

# Illustrations

In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results.

Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.

# Interaction test

Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours…

10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!

# Recommendations

In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.

How can we do a better job?

If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation.

Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).

## Reporting

These details are essential in articles reporting correlation analyses:

• sample size for each correlation;
• estimator of association;
• R value;
• confidence interval;
• scatterplot illustration of every correlation, irrespective of the p value;
• explicit comparison test of all correlations explicitly or implicitly compared;
• consideration of effect sizes (R values) and their uncertainty (confidence intervals) in the interpretation of the results.

Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).

Finally, are you sure you really want to compute a correlation?

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?

John W. Tukey.  American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108

# References

McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

[preprint]

# Small n correlations cannot be trusted

This post illustrates two important effects of sample size on the estimation of correlation coefficients: lower sample sizes are associated with increased variability and lower probability of replication. This is not specific to correlations, but here we’re going to have a detailed look at what it means when using the popular Pearson’s correlation (similar results are obtained using Spearman’s correlation, and the same problems arise with regression). The R code is available on github.

UPDATE: 2018-06-02

In the original post, I mentioned non-linearities in some of the figures. Jan Vanhove replied on Twitter that he was not getting any, and suggested a different code snippet. I’ve updated the simulations using his code, and now the non-linearities are gone! So thanks Jan!

Johannes Algermissen mentioned on Twitter that his recent paper covered similar issues. Have a look! He also reminded me about this recent paper that makes points very similar to those in this blog.

Gjalt-Jorn Peters mentioned on Twitter that “you can also use the Pearson distribution in package `suppdists`. Also see `pwr.confintR` to compute the required sample size for a given desired accuracy in parameter estimation (AIPE), which can also come in handy when planning studies”.

Wolfgang Viechtbauer‏ mentioned on Twitter “that one can just compute the density of r directly (no need to simulate). For example: link. Then everything is nice and smooth”.

UPDATE: 2018-06-30

Frank Harrell wrote on Twitter: “I’ll also push the use of precision of correlation coefficient estimates in justifying sample sizes. Need n > 300 to estimate r. BBR Chapter 8″

Let’s start with an example, shown in the figure below. Nice scatterplot isn’t it! Sample size is 30, and r is 0.703. It seems we have discovered a relatively strong association between variables 1 and 2: let’s submit to Nature or PPNAS! And pollute the literature with another effect that won’t replicate! Yep, the data in the scatterplot are due to chance. They were sampled from a population with zero correlation. I suspect a lot of published correlations might well fall into that category. Nothing new here, false positives and inflated effect sizes are a natural outcome of small n experiments, and the problem gets worse with questionable research practices and incentives to publish positive new results.

To understand the problem with estimation from small n experiments, we can perform a simulation in which we draw samples of different sizes from a normal population with a known Pearson’s correlation (rho) of zero. The sampling distributions of the estimates of rho for different sample sizes look like this: Sampling distributions tell us about the behaviour of a statistics in the long run, if we did many experiments. Here, with increasing sample sizes, the sampling distributions are narrower, which means that in the long run, we get more precise estimates. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on:

• the shape of the univariate and bivariate distributions (if outliers are common, consider robust methods);

• the expected effect size (the larger the effect, the fewer trials are needed – see below);

• the precision we want to afford.

For the sampling distributions in the previous figure, we can ask this question for each sample size:

What is the proportion of correlation estimates that are within +/- a certain number of units from the true population correlation? For instance:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between -0.1 and 0.1), we need at least 109 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between -0.2 and 0.2), we need at least 70 observations.

These values are illustrated in the next figure using black lines and arrows. The figure shows the proportion of estimates near the true value, for different sample sizes, and for different levels of precision. The bottom-line is that even if we’re willing to make imprecise measurements (up to 0.2 from the true value), we need a lot of observations to be precise enough and often enough in the long run. The estimation uncertainty associated with small sample sizes leads to another problem: effects are not likely to replicate. A successful replication can be defined in several ways. Here I won’t consider the relatively trivial case of finding a statistically significant (p<0.05) effect going in the same direction in two experiments. Instead, let’s consider how close two estimates are. We can determine, given a certain level of precision, the probability to observe similar effects in two consecutive experiments. In other words, we can find the probability that two measurements differ by at most a certain amount. Not surprisingly, the results follow the same pattern as those observed in the previous figure: the probability to replicate (y-axis) increases with sample size (x-axis) and with the uncertainty we’re willing to accept (see legend with colour coded difference conditions). In the figure above, the black lines indicates that for 80% of replications to be at most 0.2 apart, we need at least 83 observations.

So far, we have considered samples from a population with zero correlation, such that large correlations were due to chance. What happens when there is an effect? Let see what happens for a fixed sample size of 30, as illustrated in the next figure. As a sanity check, we can see that the modes of the sampling distributions progressively increase with increasing population correlations. More interestingly, the sampling distributions also get narrower with increasing effect sizes. As a consequence, the larger the true effect we’re trying to estimate, the more precise our estimations. Or put another way, for a given level of desired precision, we need fewer trials to estimate a true large effect. The next figure shows the proportion of estimates close to the true estimate, as a function of the population correlation, and for different levels of precision, given a sample size of 30 observations. Overall, in the long run, we can achieve more precise measurements more often if we’re studying true large effects. The exact values will depend on priors about expected effect sizes, shape of distributions and desired precision or achievable sample size. Let’s look in more detail at the sampling distributions for a generous rho = 0.4. The sampling distributions for n<50 appear to be negatively skewed, which means that in the long run, experiments might tend to give biased estimates of the population value; in particular, experiments with n=10 or n=20 are more likely than others to get the sign wrong (long left tail) and to overestimate the true value (distribution mode shifted to the right). From the same data, we can calculate the proportion of correlation estimates close to the true value, as a function of sample size and for different precision levels. We get this approximate results:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between 0.3 and 0.5), we need at least 78 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between 0.2 and 0.6), we need at least 50 observations.

You could repeat this exercise using the R code to get estimates based on your own priors and the precision you want to afford.

Finally, we can look at the probability to observe similar effects in two consecutive experiments, for a given precision. In other words, what is the probability that two measurements differ by at most a certain amount? The next figure shows results for differences ranging from 0.05 (very precise) to 0.4 (very imprecise). The black arrow illustrates that for 80% of replications to be at most 0.2 apart, we need at least 59 observations. We could do the same analyses presented in this post for power. However, I don’t really see the point of looking at power if the goal is to quantify an effect. The precision of our measurements and of our estimations should be a much stronger concern than the probability to flag any effect as statistically significant (McShane et al. 2018).

There is a lot more to say about correlation estimation and I would recommend in particular these papers from Ed Vul and Tal Yarkoni, from the voodoo correlation era. More recently, Schönbrodt & Perugini (2013) looked at the effect of sample size on correlation estimation, with a focus on precision, similarly to this post. Finally, this more general paper (Forstmeier, Wagemakers & Parker, 2016) about false positives is well worth reading.