# Small n correlations cannot be trusted

This post illustrates two important effects of sample size on the estimation of correlation coefficients: lower sample sizes are associated with increased variability and lower probability of replication. This is not specific to correlations, but here we’re going to have a detailed look at what it means when using the popular Pearson’s correlation (similar results are obtained using Spearman’s correlation, and the same problems arise with regression). The R code is available on github.

UPDATE: 2018-06-02

In the original post, I mentioned non-linearities in some of the figures. Jan Vanhove replied on Twitter that he was not getting any, and suggested a different code snippet. I’ve updated the simulations using his code, and now the non-linearities are gone! So thanks Jan!

Johannes Algermissen mentioned on Twitter that his recent paper covered similar issues. Have a look! He also reminded me about this recent paper that makes points very similar to those in this blog.

Gjalt-Jorn Peters mentioned on Twitter that “you can also use the Pearson distribution in package `suppdists`. Also see `pwr.confintR` to compute the required sample size for a given desired accuracy in parameter estimation (AIPE), which can also come in handy when planning studies”.

Wolfgang Viechtbauer‏ mentioned on Twitter “that one can just compute the density of r directly (no need to simulate). For example: link. Then everything is nice and smooth”.

UPDATE: 2018-06-30

Frank Harrell wrote on Twitter: “I’ll also push the use of precision of correlation coefficient estimates in justifying sample sizes. Need n > 300 to estimate r. BBR Chapter 8″

Let’s start with an example, shown in the figure below. Nice scatterplot isn’t it! Sample size is 30, and r is 0.703. It seems we have discovered a relatively strong association between variables 1 and 2: let’s submit to Nature or PPNAS! And pollute the literature with another effect that won’t replicate!

Yep, the data in the scatterplot are due to chance. They were sampled from a population with zero correlation. I suspect a lot of published correlations might well fall into that category. Nothing new here, false positives and inflated effect sizes are a natural outcome of small n experiments, and the problem gets worse with questionable research practices and incentives to publish positive new results.

To understand the problem with estimation from small n experiments, we can perform a simulation in which we draw samples of different sizes from a normal population with a known Pearson’s correlation (rho) of zero. The sampling distributions of the estimates of rho for different sample sizes look like this:

Sampling distributions tell us about the behaviour of a statistics in the long run, if we did many experiments. Here, with increasing sample sizes, the sampling distributions are narrower, which means that in the long run, we get more precise estimates. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on:

• the shape of the univariate and bivariate distributions (if outliers are common, consider robust methods);

• the expected effect size (the larger the effect, the fewer trials are needed – see below);

• the precision we want to afford.

For the sampling distributions in the previous figure, we can ask this question for each sample size:

What is the proportion of correlation estimates that are within +/- a certain number of units from the true population correlation? For instance:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between -0.1 and 0.1), we need at least 109 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between -0.2 and 0.2), we need at least 70 observations.

These values are illustrated in the next figure using black lines and arrows. The figure shows the proportion of estimates near the true value, for different sample sizes, and for different levels of precision. The bottom-line is that even if we’re willing to make imprecise measurements (up to 0.2 from the true value), we need a lot of observations to be precise enough and often enough in the long run.

The estimation uncertainty associated with small sample sizes leads to another problem: effects are not likely to replicate. A successful replication can be defined in several ways. Here I won’t consider the relatively trivial case of finding a statistically significant (p<0.05) effect going in the same direction in two experiments. Instead, let’s consider how close two estimates are. We can determine, given a certain level of precision, the probability to observe similar effects in two consecutive experiments. In other words, we can find the probability that two measurements differ by at most a certain amount. Not surprisingly, the results follow the same pattern as those observed in the previous figure: the probability to replicate (y-axis) increases with sample size (x-axis) and with the uncertainty we’re willing to accept (see legend with colour coded difference conditions).

In the figure above, the black lines indicates that for 80% of replications to be at most 0.2 apart, we need at least 83 observations.

So far, we have considered samples from a population with zero correlation, such that large correlations were due to chance. What happens when there is an effect? Let see what happens for a fixed sample size of 30, as illustrated in the next figure.

As a sanity check, we can see that the modes of the sampling distributions progressively increase with increasing population correlations. More interestingly, the sampling distributions also get narrower with increasing effect sizes. As a consequence, the larger the true effect we’re trying to estimate, the more precise our estimations. Or put another way, for a given level of desired precision, we need fewer trials to estimate a true large effect. The next figure shows the proportion of estimates close to the true estimate, as a function of the population correlation, and for different levels of precision, given a sample size of 30 observations.

Overall, in the long run, we can achieve more precise measurements more often if we’re studying true large effects. The exact values will depend on priors about expected effect sizes, shape of distributions and desired precision or achievable sample size. Let’s look in more detail at the sampling distributions for a generous rho = 0.4.

The sampling distributions for n<50 appear to be negatively skewed, which means that in the long run, experiments might tend to give biased estimates of the population value; in particular, experiments with n=10 or n=20 are more likely than others to get the sign wrong (long left tail) and to overestimate the true value (distribution mode shifted to the right). From the same data, we can calculate the proportion of correlation estimates close to the true value, as a function of sample size and for different precision levels.

We get this approximate results:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between 0.3 and 0.5), we need at least 78 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between 0.2 and 0.6), we need at least 50 observations.

You could repeat this exercise using the R code to get estimates based on your own priors and the precision you want to afford.

Finally, we can look at the probability to observe similar effects in two consecutive experiments, for a given precision. In other words, what is the probability that two measurements differ by at most a certain amount? The next figure shows results for differences ranging from 0.05 (very precise) to 0.4 (very imprecise). The black arrow illustrates that for 80% of replications to be at most 0.2 apart, we need at least 59 observations.

We could do the same analyses presented in this post for power. However, I don’t really see the point of looking at power if the goal is to quantify an effect. The precision of our measurements and of our estimations should be a much stronger concern than the probability to flag any effect as statistically significant (McShane et al. 2018).

There is a lot more to say about correlation estimation and I would recommend in particular these papers from Ed Vul and Tal Yarkoni, from the voodoo correlation era. More recently, Schönbrodt & Perugini (2013) looked at the effect of sample size on correlation estimation, with a focus on precision, similarly to this post. Finally, this more general paper (Forstmeier, Wagemakers & Parker, 2016) about false positives is well worth reading.