Planning for measurement precision, an alternative to power analyses

When we estimate power curves, we ask this question: given some priors about the data generating process, the nature of the effect and measurement variance, what is the probability to detect an effect for a given statistical test (say using an arbitrary p<0.05 threshold) for various sample sizes and effect sizes. While there are very good reasons to focus on power estimation, this is not the only or the most important aspect of an experimental procedure to consider (Gelman & Carlin, 2014). Indeed, finding the number of observations needed so that we get p<0.05 in say 87% of experiments, is not the most exciting part of designing an experiment.

The relevant question is not “What is the power of a test?” but rather is “What might be expected to happen in studies of this size?” (Gelman & Carlin, 2014)

A related but more important question is that of measurement precision: given some priors and a certain number of participants, how close can we get to the unknown population value (Maxwell et al., 2008; Schönbrodt & Perugini, 2013; Peters & Crutzen, 2018; Trafimow, 2019)? Not surprisingly, measurement precision depends on sample size. As we saw in previous posts, sampling distributions get narrower with increasing sample sizes:

And with narrower sampling distributions, measurement precision increases. To illustrate, let’s consider an example from a lexical decision task – hundreds of reaction times (RT) were measured in hundreds of participants who had to distinguish between words and non-words presented on a computer screen.

Here are examples of RT distributions from 100 participants for each condition:

If we save the median of each distribution, for each participant and condition, we get these positively skewed group level distributions:

The distribution of pairwise differences between medians is also positively skewed:

Notably, most participants have a positive difference: 96.4% of participants are faster in the Word than the Non-Word condition – a potential case of stochastic dominance (Rouder & Haaf, 2018; see also this summary blog post).

Now let say we want to estimate the group difference between conditions. Because of the skewness at each level of analysis (within and across participants), we estimate the central tendency at each level using the median: that is, we compute the median for each participant and each condition, then compute the medians of medians across participants (a more detailed assessment could be obtained by performing hierarchical modelling or multiple quantile estimation for instance).

Then we can assess measurement precision at the group level by performing a multi-level simulation. In this simulation, we can ask, for instance, how often the group estimate is no more than 10 ms from the population value across many experiments. To simplify, in each iteration of the simulation, we draw 200 trials per condition and participant, compute the median and save the Non-Word – Word difference. Group estimation of the difference is then based on a random sample of 10 to 300 participants, with the group median computed across participants’ differences between medians. Because the dataset is very large at the two level of analysis, we can pretend we have access to the population values, and define them by first computing, for each condition, the median across all available trials for each participant, second by computing across all participants the median of the pairwise differences.

Having defined population values (the truth we’re trying to estimate, here a group difference of 78 ms), we can calculate measurement precision as the proportion of experiments in which the group estimate is no more than X ms from the population value, with X varying from 5 to 40 ms. Here are the results:

Not surprisingly, the proportion of estimates close to the population value increases with the number of participants. More interestingly, the relationship is non-linear, such that a larger gain in precision can be achieved by increasing sample size for instance from 10 to 20 compared to from 90 to 100.

The results also let us answer useful questions for planning experiments (see the black arrows in the above figure):

• So that in 70% of experiments the group estimate of the median is no more than 10 ms from the population value, we need to test at least 56 participants.

• So that in 90% of experiments the group estimate of the median is no more than 20 ms from the population value, we need to test at least 38 participants.

Obviously, this is just an example, about a narrow problem related to lexical decisions. Other aspects could be considered too, for instance the width of the confidence intervals (Maxwell, Kelley & Rausch, 2008; Peters & Crutzen, 2017; Rothman & Greenland, 2018). And for your particular case, most likely, you won’t have access to a large dataset from which to perform a data driven simulation. In this case, you can get estimates about plausible effect sizes and their variability from various sources (Gelman & Carlin 2014):

• related data;

• (systematic) literature review;

• meta-analysis;

• outputs of a hierarchical model;

• modelling.

To model a range of plausible effect sizes and their consequences on repeated measurements, you need priors about a data generating process and how distributions differ between conditions. For instance, you could use exGaussian distributions to simulate RT data. For research on new effects, it is advised to consider a large range of potential effects, with their plausibility informed by the literature and psychological/biological constraints.

Although relying on the literature alone can lead to over-optimistic expectations because of the dominance of small n studies and a bias towards significant results (Yarkoni 2009; Button et al. 2013), methods are being developed to overcome these limitations (Anderson, Kelley & Maxwell, 2017). In the end, the best cure against effect size over-estimation is a combination of pre-registration/registered reports (to diminish literature bias) and data sharing (to let anyone do their own calculations and meta-analyses).

Code

The code is on figshare: the simulation can be reproduced using the `flp_sim_precision` notebook, the illustrations of the distributions can be reproduced using `flp_illustrate_dataset`.

Shiny app by Malcolm Barrett (@malco_barrett)

https://malcolmbarrett.shinyapps.io/precisely/

References

Anderson, S.F., Kelley, K. & Maxwell, S.E. (2017) Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychol Sci, 28, 1547-1562.

Bland J.M.. The tyranny of power: is there a better way to calculate sample size? https://www.bmj.com/content/339/bmj.b3985)

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

Gelman, A. & Carlin, J. (2014) Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci, 9, 641-651.

Maxwell, S.E., Kelley, K. & Rausch, J.R. (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol, 59, 537-563.

Peters, G.-J.Y. & Crutzen, R. (2017) Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. PsyArXiv. doi:10.31234/osf.io/cjsk2.

Rothman, K.J. & Greenland, S. (2018) Planning Study Size Based on Precision Rather Than Power. Epidemiology, 29, 599-603.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Rousselet, G.A. & Wilcox, R.R. (2018) Reaction times and other skewed distributions: problems with the mean and the median. bioRxiv. doi: https://doi.org/10.1101/383935

Rousselet, G.; Wilcox, R. (2018): Reaction times and other skewed distributions: problems with the mean and the median. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6911924.v1

Schönbrodt, F.D. & Perugini, M. (2013) At what sample size do correlations stabilize? J Res Pers, 47, 609-612.

Trafimow, D. (2019) Five Nonobvious Changes in Editorial Practice for Editors and Reviewers to Consider When Evaluating Submissions in a Post p < 0.05 Universe, The American Statistician, 73:sup1, 340-345,

Yarkoni, T. (2009) Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power‚ Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4, 294-298.

Power estimation for correlation analyses

Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you).

To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X.

For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives.

Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.

For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.

To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.

To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations.

Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.

So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:

• outliers

• the magnitude of the slope around which points are clustered

• curvature

• the magnitude of the residuals

• restriction of range

• heteroscedasticity

The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).

Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:

X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”

Here is an example of such data:

Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity.

As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.

Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.

We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X.

For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance.

From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).

Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power.

We can zoom in:

The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).

Conclusion

The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.

But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).

Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.

The R code to reproduce the simulations and the figures is on GitHub.

References

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Small n correlations + p values = disaster

Previously, we saw that with small sample sizes, correlation estimation is very uncertain, which implies that small n correlations cannot be trusted: the observed value in any experiment could be very far from the population value, and the sign could be wrong too. In addition to the uncertainty associated with small sample sizes, the selective report of results based on p values < 0.05 (or some other threshold), can lead to massively inflated correlation estimates in the literature (Yarkoni, 2009 ☜ if you haven’t done so, you really should read this excellent paper).

Let’s illustrate the problem (code is on GitHub). First, we consider a population rho = 0. Here is the sampling distribution as a function of sample size, as we saw in an earlier post.

Figure 1: Sampling distribution for rho=0.

Now, here is the sampling distribution conditional on p < 0.05. The estimates are massively inflated and the problem gets worse with smaller sample sizes, because the smaller the sample size, the larger the correlations must be by chance for them to be significant.

Figure 2: Sampling distribution for rho=0, given p<0.05

So no, don’t get too excited when you see a statistically significant correlation in a paper…

Let’s do the same exercise when the population correlation is relatively large. With rho = 0.4, the sampling distribution looks like this:

Figure 3: Sampling distribution for rho=0.4.

If we report only those correlations associated with p < 0.05, the distribution looks like this:

Figure 4: Sampling distribution for rho=0.4, given p<0.05

Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations (see small purple bump around -0.6 -0.8). Indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p < 0.05 negative correlation was obtained.

Small n correlations cannot be trusted

This post illustrates two important effects of sample size on the estimation of correlation coefficients: lower sample sizes are associated with increased variability and lower probability of replication. This is not specific to correlations, but here we’re going to have a detailed look at what it means when using the popular Pearson’s correlation (similar results are obtained using Spearman’s correlation, and the same problems arise with regression). The R code is available on github.

UPDATE: 2018-06-02

In the original post, I mentioned non-linearities in some of the figures. Jan Vanhove replied on Twitter that he was not getting any, and suggested a different code snippet. I’ve updated the simulations using his code, and now the non-linearities are gone! So thanks Jan!

Johannes Algermissen mentioned on Twitter that his recent paper covered similar issues. Have a look! He also reminded me about this recent paper that makes points very similar to those in this blog.

Gjalt-Jorn Peters mentioned on Twitter that “you can also use the Pearson distribution in package `suppdists`. Also see `pwr.confintR` to compute the required sample size for a given desired accuracy in parameter estimation (AIPE), which can also come in handy when planning studies”.

Wolfgang Viechtbauer‏ mentioned on Twitter “that one can just compute the density of r directly (no need to simulate). For example: link. Then everything is nice and smooth”.

UPDATE: 2018-06-30

Frank Harrell wrote on Twitter: “I’ll also push the use of precision of correlation coefficient estimates in justifying sample sizes. Need n > 300 to estimate r. BBR Chapter 8″

Let’s start with an example, shown in the figure below. Nice scatterplot isn’t it! Sample size is 30, and r is 0.703. It seems we have discovered a relatively strong association between variables 1 and 2: let’s submit to Nature or PPNAS! And pollute the literature with another effect that won’t replicate!

Yep, the data in the scatterplot are due to chance. They were sampled from a population with zero correlation. I suspect a lot of published correlations might well fall into that category. Nothing new here, false positives and inflated effect sizes are a natural outcome of small n experiments, and the problem gets worse with questionable research practices and incentives to publish positive new results.

To understand the problem with estimation from small n experiments, we can perform a simulation in which we draw samples of different sizes from a normal population with a known Pearson’s correlation (rho) of zero. The sampling distributions of the estimates of rho for different sample sizes look like this:

Sampling distributions tell us about the behaviour of a statistics in the long run, if we did many experiments. Here, with increasing sample sizes, the sampling distributions are narrower, which means that in the long run, we get more precise estimates. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on:

• the shape of the univariate and bivariate distributions (if outliers are common, consider robust methods);

• the expected effect size (the larger the effect, the fewer trials are needed – see below);

• the precision we want to afford.

For the sampling distributions in the previous figure, we can ask this question for each sample size:

What is the proportion of correlation estimates that are within +/- a certain number of units from the true population correlation? For instance:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between -0.1 and 0.1), we need at least 109 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between -0.2 and 0.2), we need at least 70 observations.

These values are illustrated in the next figure using black lines and arrows. The figure shows the proportion of estimates near the true value, for different sample sizes, and for different levels of precision. The bottom-line is that even if we’re willing to make imprecise measurements (up to 0.2 from the true value), we need a lot of observations to be precise enough and often enough in the long run.

The estimation uncertainty associated with small sample sizes leads to another problem: effects are not likely to replicate. A successful replication can be defined in several ways. Here I won’t consider the relatively trivial case of finding a statistically significant (p<0.05) effect going in the same direction in two experiments. Instead, let’s consider how close two estimates are. We can determine, given a certain level of precision, the probability to observe similar effects in two consecutive experiments. In other words, we can find the probability that two measurements differ by at most a certain amount. Not surprisingly, the results follow the same pattern as those observed in the previous figure: the probability to replicate (y-axis) increases with sample size (x-axis) and with the uncertainty we’re willing to accept (see legend with colour coded difference conditions).

In the figure above, the black lines indicates that for 80% of replications to be at most 0.2 apart, we need at least 83 observations.

So far, we have considered samples from a population with zero correlation, such that large correlations were due to chance. What happens when there is an effect? Let see what happens for a fixed sample size of 30, as illustrated in the next figure.

As a sanity check, we can see that the modes of the sampling distributions progressively increase with increasing population correlations. More interestingly, the sampling distributions also get narrower with increasing effect sizes. As a consequence, the larger the true effect we’re trying to estimate, the more precise our estimations. Or put another way, for a given level of desired precision, we need fewer trials to estimate a true large effect. The next figure shows the proportion of estimates close to the true estimate, as a function of the population correlation, and for different levels of precision, given a sample size of 30 observations.

Overall, in the long run, we can achieve more precise measurements more often if we’re studying true large effects. The exact values will depend on priors about expected effect sizes, shape of distributions and desired precision or achievable sample size. Let’s look in more detail at the sampling distributions for a generous rho = 0.4.

The sampling distributions for n<50 appear to be negatively skewed, which means that in the long run, experiments might tend to give biased estimates of the population value; in particular, experiments with n=10 or n=20 are more likely than others to get the sign wrong (long left tail) and to overestimate the true value (distribution mode shifted to the right). From the same data, we can calculate the proportion of correlation estimates close to the true value, as a function of sample size and for different precision levels.

We get this approximate results:

• for 70% of estimates to be within +/- 0.1 of the true correlation value (between 0.3 and 0.5), we need at least 78 observations;

• for 90% of estimates to be within +/- 0.2 of the true correlation value (between 0.2 and 0.6), we need at least 50 observations.

You could repeat this exercise using the R code to get estimates based on your own priors and the precision you want to afford.

Finally, we can look at the probability to observe similar effects in two consecutive experiments, for a given precision. In other words, what is the probability that two measurements differ by at most a certain amount? The next figure shows results for differences ranging from 0.05 (very precise) to 0.4 (very imprecise). The black arrow illustrates that for 80% of replications to be at most 0.2 apart, we need at least 59 observations.

We could do the same analyses presented in this post for power. However, I don’t really see the point of looking at power if the goal is to quantify an effect. The precision of our measurements and of our estimations should be a much stronger concern than the probability to flag any effect as statistically significant (McShane et al. 2018).

There is a lot more to say about correlation estimation and I would recommend in particular these papers from Ed Vul and Tal Yarkoni, from the voodoo correlation era. More recently, Schönbrodt & Perugini (2013) looked at the effect of sample size on correlation estimation, with a focus on precision, similarly to this post. Finally, this more general paper (Forstmeier, Wagemakers & Parker, 2016) about false positives is well worth reading.

Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

This is part 4 of a 4 part series. Part 1 is here.

In this post, I look at median bias in a large dataset of reaction times from participants engaged in a lexical decision task. The dataset was described in a previous post.

After removing a few participants who didn’t pay attention to the task (low accuracy or too many very late responses), we’re left with 959 participants to play with. Each participant had between 996 and 1001 trials for each of two conditions, Word and Non-Word.

Here is an illustration of reaction time distributions from 100 randomly sampled participants in the Word condition:

Same in the Non-Word condition:

Skewness tended to be larger in the Word than the Non-Word condition. Based on the standard parametric definition of skewness, that was the case in 80% of participants. If we use a non-parametric estimate instead (mean – median), it was the case in 70% of participants.

If we save the median of every individual distribution, we get the two following group distributions, which display positive skewness:

The same applies to distributions of means:

So we have to worry about skewness at 2 levels:

• individual distributions

• group distributions

Here I’m only going to explore estimation bias as a result of skewness and sample size in individual distributions. From what we learnt in previous posts, we can already make predictions: because skewness tended to be stronger in the Word than in the Non-Word condition, the bias of the median will be stronger in the former than the later for small sample sizes. That is, the median in the Word condition will tend to be more over-estimated than the median in the Non-Word condition. As a consequence, the difference between the median of the Non-Word condition (larger RT) and the median of the Word condition (smaller RT) will tend to be under-estimated. To check this prediction, I estimated bias in every participant using a simulation with 2,000 iterations. I assumed that the full sample was the population, from which we can compute population means and population medians. Because the Non-Word condition is the least skewed, I used it as the reference condition, which always had 200 trials. The Word condition had 10 to 200 trials, with 10 trial increments. In the simulation, single RT were sampled with replacements among the roughly 1,000 trials available per condition and participant, so that each iteration is equivalent to a fake experiment.

Let’s look at the results for the median. The figure below shows the bias in the long run estimation of the difference between medians (Non-Word – Word), as a function of sample size in the Word condition. The Non-Word condition always had 200 trials. All participants are superimposed and shown as coloured traces. The average across participants is shown as a thicker black line.

As expected, bias tended to be negative with small sample sizes. For the smallest sample size, the average bias was -11 ms. That’s probably substantial enough to seriously distort estimation in some experiments. Also, variability is high, with a 80% highest density interval of [-17.1, -2.6] ms. Bias decreases rapidly with increasing sample size. For n=60, it is only 1 ms.

But inter-participant variability remains high, so we should be cautious interpreting results with large numbers of trials but few participants. To quantify the group uncertainty, we could measure the probability of being wrong, given a level of desired precision, as demonstrated here for instance.

After bootstrap bias correction (with 200 bootstrap resamples), the average bias drops to roughly zero for all sample sizes:

Bias correction also reduced inter-participant variability.

As we saw in the previous post, the sampling distribution of the median is skewed, so the standard measure of bias (taking the mean across simulation iterations) does not provide a good indication of the bias we can expect in a typical experiment. If instead of the mean, we compute the median bias, we get the following results:

Now, at the smallest sample size, the average bias is only -2 ms, and it drops to near zero for n=20. This result is consistent with the simulations reported in the previous post and confirms that in the typical experiment, the average bias associated with the median is negligible.

What happens with the mean?

The average bias of the mean is near zero for all sample sizes. Individual bias values are also much less variable than median values. This difference in bias variability does not reflect a difference in variability among participants for the two estimators of central tendency. In fact, the distributions of differences between Non-Word and Word conditions are very similar for the mean and the median.

Estimates of spread are also similar between distributions:

IQR: mean RT = 78; median RT = 79

MAD: mean RT = 57; median RT = 54

VAR: mean RT = 4507; median RT = 4785

This suggests that the inter-participant bias differences are due to the shape differences observed in the first two figures of this post.

Finally, let’s consider the median bias of the mean.

For the smallest sample size, the average bias across participants is 7 ms. This positive bias can be explained easily from the simulation results of post 3: because of the larger skewness in the Word condition, the sampling distribution of the mean was more positively skewed for small samples in that condition compared to the Non-Word condition, with the bulk of the bias estimates being negative. As a result, the mean tended to be more under-estimated in the Word condition, leading to larger Non-Word – Word differences in the typical experiment.

I have done a lot more simulations and was planning even more, using other datasets, but it’s time to move on! Of particular note, it appears that in difficult visual search tasks, skewness can differ dramatically among set size conditions – see for instance data posted here.

Concluding remarks

The data-driven simulations presented here confirm results from our previous simulations:

• if we use the standard definition of bias, for small sample sizes, mean estimates are not biased, median estimates are biased;
• however, in the typical experiment (median bias), mean estimates can be more biased than median estimates;

• bootstrap bias correction can be an effective tool to reduce bias.

Given the large differences in inter-participant variability between the mean and the median, an important question is how to spend your money: more trials or more participants (Rouder & Haaf 2018)? An answer can be obtained by running simulations, either data-driven or assuming generative distributions (for instance exGaussian distributions for RT data). Simulations that take skewness into account are important to estimate bias and power. Assuming normality can have disastrous consequences.

Despite the potential larger bias and bias variability of the median compared to the mean, for skewed distributions I would still use the median as a measure of central tendency, because it provides a more informative description of the typical observations. Large sample sizes will reduce both bias and estimation variability, such that high-precision single-participant estimation should be easy to obtain in many situations involving non-clinical samples. For group estimations, much larger samples than commonly used are probably required to improve the precision of our inferences.

Although the bootstrap bias correction seems to work very well in the long run, for a single experiment there is no guarantee it will get you closer to the truth. One possibility is to report results with and without bias correction.

For group inferences on the median, traditional techniques use incorrect estimations of the standard error, so consider modern parametric or non-parametric techniques instead (Wilcox & Rousselet, 2018).

References

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Cohen’s d is biased

The R notebook associated with this post is available on github.

Cohen’s d is a popular measure of effect size. In the one-sample case, d is simply computed as the mean divided by the standard deviation (SD). For repeated measures, the same formula is applied to difference scores (see detailed presentation and explanation of variants in Lakens, 2013).

Because d relies on a non-robust measure of central tendency (the mean), and a non-robust measure of dispersion (SD), it is a non-robust measure of effect size, meaning that a single observation can have a dramatic effect on its value, as explained here. Cohen’s d also makes very strict assumptions about the data, so it is only appropriate in certain contexts. As a consequence, it should not be used as the default measure of effect size, and more powerful and informative alternatives should be considered – see a few examples here. For comparisons across studies and meta-analyses, nothing will beat data-sharing though.

Here we look at another limitation of Cohen’s d: it is biased when we draw small samples. Bias is covered in detail in another post. In short, in the one-sample case, when Cohen’s d is estimated from a small sample, in the long run it tends to be larger than the population value. This over-estimation is due to a bias of SD, which tends to be lower than the population’s SD. Because the mean is not biased, when divided by an under-estimated SD, it leads to an over-estimated measure of effect size. The bias of SD is explained in intro stat books, in the section describing Student’s t. Not surprisingly it is never mentioned in the discussions of small n studies, as a limitation of effect size estimation…

In this demonstration, we sample with replacement 10,000 times from the ex-Gaussian distributions below, for various sample sizes, as explained here:

The table below shows the population values for each distribution. For comparison, we also consider a robust equivalent to Cohen’s d, in which the mean is replaced by the median, and SD is replaced by the percentage bend mid-variance (`pbvar`, Wilcox, 2017). As we will see, this robust alternative is also biased – there is no magic solution I’m afraid.

m:          600  600  600  600  600  600  600  600  600  600  600  600

md:        509  512  524  528  540  544  555  562  572  579  588  594

m-md:    92    88     76    72    60    55     45    38    29    21    12     6

m.den:   301  304  251  255  201  206  151  158  102  112  54     71

md.den: 216  224  180  190  145  157  110  126  76    95    44     68

m.es:       2.0   2.0    2.4   2.4   3.0   2.9   4.0   3.8   5.9   5.4   11.1  8.5

md.es:     2.4   2.3    2.9   2.8   3.7   3.5   5.0   4.5   7.5   6.1   13.3  8.8

m = mean

md = median

den = denominator

es = effect size

m.es = Cohen’s d

md.es = md / pbvar

Let’s look at the behaviour of d as a function of skewness and sample size.

Effect size d tends to decrease with increasing skewness, because SD tends to increase with skewness. Effect size also increases with decreasing sample size. This bias is stronger for samples from the least skewed distributions. This is counterintuitive, because one would think estimation tends to get worse with increased skewness. Let’s find out what’s going on.

Computing the bias normalises the effect sizes across skewness levels, revealing large bias differences as a function of skewness. Even with 100 observations, the bias (mean of 10,000 simulation iterations) is still slightly larger than zero for the least skewed distributions. This bias is not due to the mean, because we the sample mean is an unbiased estimator of the population mean.

Let’s check to be sure:

So the problem must be with the denominator:

Unlike the mean, the denominator of Cohen’s d, SD, is biased. Let’s look at bias directly.

SD is most strongly biased for small sample sizes and bias increases with skewness. Negative values indicate that sample SD tends to under-estimate the population values. This is because the sampling distribution of SD is increasingly skewed with increasing skewness and decreasing sample sizes. This can be seen in this plot of the 80% highest density intervals (HDI) for instance:

The sampling distribution of SD is increasingly skewed and variable with increasing skewness and decreasing sample sizes. As a result, the sampling distribution of Cohen’s d is also skewed. The bias is strongest in absolute term for the least skewed distributions because the sample SD is overall smaller for these distributions, resulting in overall larger effect sizes. Although SD is most biased for the most skewed distributions, SD is also overall much larger for them, resulting in much smaller effect sizes than those obtained for less skewed distributions. This strong attenuation of effect sizes with increasing skewness swamps the absolute differences in SD bias. This explains the counter-intuitive lower d bias for more skewed distributions.

As we saw previously, bias can be corrected using a bootstrap approach. Applied, to Cohen’s d, this technique does reduce bias, but it still remains a concern:

Finally, let’s look at the behaviour of a robust equivalent to Cohen’s d, the median normalised by the percentage bend mid-variance.

The median effect size shows a similar profile to the mean effect size. It is overall larger than the mean effect size because it uses a robust measure of spread, which is less sensitive to the long right tails of the skewed distributions we sample from.

The bias disappears quickly with increasing sample sizes, and quicker than for the mean effect size.

However, unlike what we observed for d, in this case the bias correction does not work for small samples, because the repetition of the same observations in some bootstrap samples leads to very large values of the denominator. It’s ok for n>=15, for which bias is relatively small anyway, so at least based on these simulations, I wouldn’t use bias correction for this robust effect size.

Conclusion

Beware of small sample sizes: they are associated with increased variability (see discussion in a clinical context here) and can accentuate the bias of some effect size estimates. If effect sizes tend to be reported more often if they pass some arbitrary threshold, for instance p < 0.05, then the literature will tend to over-estimate them (see demonstration here), a phenomenon exacerbated by small sample sizes (Button et al. 2013).

Can’t say it enough: small n is bad for science if the goal is to provide accurate estimates of effect sizes.

To determine how the precision and accuracy of your results depend on sample size, the best approach is to perform simulations, providing some assumptions about the shape of the population distributions.

References

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Lakens, D. (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol, 4, 863.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

Reaction times and other skewed distributions: problems with the mean and the median (part 3/4)

Bias is defined as the distance between the mean of the sampling distribution (here estimated using Monte-Carlo simulations) and the population value. In part 1 and part 2, we saw that for small sample sizes, the sample median provides a biased estimation of the population median, which can significantly affect group comparisons. However, this bias disappears with large sample sizes, and it can be corrected using a bootstrap bias correction. In part 3, we look in more detail at the shape of the sampling distributions, which was ignored by Miller (1988).

Sampling distributions

Let’s consider the sampling distributions of the mean and the median for different sample sizes and ex-Gaussian distributions with skewness ranging from 6 to 92 (Figure 1). When skewness is limited (6, top row), the sampling distributions are symmetric and centred on the population values: there is no bias. As we saw previously, with increasing sample size, variability decreases, which is why studies with larger samples provide more accurate estimations. The flip side is that studies with small samples are much noisier, which is why their results tend not to replicate…

Figure 1

When skewness is large (92, middle row), sampling distributions get more positively skewed with decreasing sample sizes. To better understand how the sampling distributions change with sample size, we turn to the last row of Figure 1, which shows 50% highest-density intervals (HDI). Each horizontal line is a HDI for a particular sample size. The labels contain the values of the interval boundaries. The coloured vertical tick inside the interval marks the median of the distribution. The red vertical line spanning the entire plot is the population value.

Means For small sample sizes, the 50% HDI is offset to the left of the population mean, and so is the median of the sampling distribution. This demonstrates that the typical sample mean tends to under-estimate the population mean – that is to say, the mean sampling distribution is median biased. This offset reduces with increasing sample size, but is still present even for n=100.

Medians With small sample sizes, there is a discrepancy between the 50% HDI, which is shifted to the left of the population median, and the median of the sampling distribution, which is shifted to the right of the population median. This contrasts with the results for the mean, and can be explained by differences in the shapes of the sampling distributions, in particular the larger skewness and kurtosis of the median sampling distribution compared to that of the mean (see code on github for extra figures). The offset between 50% HDI and the population reduces quickly with increasing sample size. For n=10, the median bias is already very small. From n=15, the median sample distribution is not median bias, which means that the typical sample median is not biased.

Another representation of the sampling distributions is provided in Figure 2: 50% HDI are shown as a function of sample size. For both the mean and the median, bias increase with increasing skewness and decreasing sample size. Skewness also increases the asymmetry of the sampling distributions, but more for the mean than median.

Figure 2

So what’s going on here? Is the mean also biased?  According to the standard definition of bias, which is based on the distance between the population mean and the average of the sampling distribution of the mean, the mean is not biased. But this definition applies to the long run, after we replicate the same experiment many times. In practice, we never do that. So what happens in practice, when we perform only one experiment? In that case, the median of the sampling distribution provides a better description of the typical experiment than the mean of the distribution. And the median of the sample distribution of the mean is inferior to the population mean when sample size is small. So if you conduct one small n experiment and compute the mean of a skewed distribution, you’re likely to under-estimate the true value.

Is the median biased after all? The median is indeed biased according to the standard definition. However, with small n, the typical median (represented by the median of the sampling distribution of the median) is close to the population median, and the difference disappears for even relatively small sample sizes.

Is it ok to use the median then?

If the goal is to accurately estimate the central tendency of a RT distribution, while protecting against the influence of outliers, the median is far more efficient than the mean (Wilcox & Rousselet, 2018). Providing sample sizes are large enough, bias is actually not a problem, and the typical bias is actually very small, as we’ve just seen. So, if you have to choose between the mean and the median, I would go for the median without hesitation.

It’s more complicated though. In an extensive series of simulations, Ratcliff (1993) demonstrated that when performing standard group ANOVAs, the median can lack power compared to other estimators. Ratcliff’s simulations involved ANOVAs on group means, in which for each participant, very few trials (7 to 12) are available for each condition. Based on the simulations, Ratcliff recommended data transformations or computing the mean after applying specific cut-offs to maximise power. However, these recommendations should be considered with caution because the results could be very different with more realistic sample sizes. Also, standard ANOVA on group means are not robust, and alternative techniques should be considered (Wilcox 2017). Data transformations are not ideal either, because they change the shape of the distributions, which contains important information about the nature of the effects. Also, once data are transformed, inferences are made on the transformed data, not on the original ones, an important caveat that tends to be swept under the carpet in articles’ discussions… Finally, truncating distributions introduce bias too, especially with the mean – see next section (Miller 1991; Ulrich & Miller 1994)!

At this stage, I don’t see much convincing evidence against using the median of RT distributions, if the goal is to use only one measure of location to summarise the entire distribution. Clearly, a better alternative is to not throw away all that information, by studying how entire distributions differ (Rousselet et al. 2017). For instance, explicit modelling of RT distributions can be performed with the excellent `brms` R package.

Other problems with the mean

In addition to being median biased, and a poor measure of central tendency for asymmetric distributions, the mean is also associated with several other important problems. Standard procedures using the mean lack of power, offer poor control over false positives, and lead to inaccurate confidence intervals. Detailed explanations of these problems are provided in Field & Wilcox (2017) and Wilcox & Rousselet (2018) for instance. For detailed illustrations of the problems associated with means in the one-sample case, when dealing with skewed distributions, see the companion reproducibility package on figshare.

If that was not enough, common outlier exclusion techniques lead to bias estimation of the mean (Miller, 1991). When applied to skewed distributions, removing any values more than 2 or 3 SD from the mean affects slow responses more than fast ones. As a consequence, the sample mean tends to underestimate the population mean. And this bias increases with sample size because the outlier detection technique does not work for small sample sizes, which results from the lack of robustness of the mean and the SD. The bias also increases with skewness. Therefore, when comparing distributions that differ in sample size, or skewness, or both, differences can be masked or created, resulting in inaccurate quantification of effect sizes.

Truncation using absolute thresholds (for instance RT < 300 ms or RT > 1,200 ms) also leads to potentially severe bias of the mean, median, standard deviation and skewness of RT distributions (Ulrich & Miller 1994). The median is much less affected by truncation bias than the mean though.

In the next and final post of this series, we will explore sampling bias in a real dataset, to see how much of a problem we’re really dealing with. Until then, thanks for reading.

[GO TO POST 4/4]

References

Field, A.P. & Wilcox, R.R. (2017) Robust statistical methods: A primer for clinical psychology and experimental psychopathology researchers. Behav Res Ther, 98, 19-38.

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Miller, J. (1991) Reaction-Time Analysis with Outlier Exclusion – Bias Varies with Sample-Size. Q J Exp Psychol-A, 43, 907-912.

Ratcliff, R. (1993) Methods for dealing with reaction time outliers. Psychol Bull, 114, 510-532.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Ulrich, R. & Miller, J. (1994) Effects of Truncation on Reaction-Time Analysis. Journal of Experimental Psychology-General, 123, 34-80.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.