When we estimate power curves, we ask this question: given some priors about the data generating process, the nature of the effect and measurement variance, what is the probability to detect an effect for a given statistical test (say using an arbitrary p<0.05 threshold) for various sample sizes and effect sizes. While there are very good reasons to focus on power estimation, this is not the only or the most important aspect of an experimental procedure to consider (Gelman & Carlin, 2014). Indeed, finding the number of observations needed so that we get p<0.05 in say 87% of experiments, is not the most exciting part of designing an experiment.
The relevant question is not “What is the power of a test?” but rather is “What might be expected to happen in studies of this size?” (Gelman & Carlin, 2014)
A related but more important question is that of measurement precision: given some priors and a certain number of participants, how close can we get to the unknown population value (Maxwell et al., 2008; Schönbrodt & Perugini, 2013; Peters & Crutzen, 2018)? Not surprisingly, measurement precision depends on sample size. As we saw in previous posts, sampling distributions get narrower with increasing sample sizes:
And with narrower sampling distributions, measurement precision increases. To illustrate, let’s consider an example from a lexical decision task – hundreds of reaction times (RT) were measured in hundreds of participants who had to distinguish between words and non-words presented on a computer screen.
Here are examples of RT distributions from 100 participants for each condition:
If we save the median of each distribution, for each participant and condition, we get these positively skewed group level distributions:
The distribution of pairwise differences between medians is also positively skewed:
Notably, most participants have a positive difference: 96.4% of participants are faster in the Word than the Non-Word condition – a potential case of stochastic dominance (Rouder & Haaf, 2018; see also this summary blog post).
Now let say we want to estimate the group difference between conditions. Because of the skewness at each level of analysis (within and across participants), we estimate the central tendency at each level using the median: that is, we compute the median for each participant and each condition, then compute the medians of medians across participants (a more detailed assessment could be obtained by performing hierarchical modelling or multiple quantile estimation for instance).
Then we can assess measurement precision at the group level by performing a multi-level simulation. In this simulation, we can ask, for instance, how often the group estimate is no more than 10 ms from the population value across many experiments. To simplify, in each iteration of the simulation, we draw 200 trials per condition and participant, compute the median and save the Non-Word – Word difference. Group estimation of the difference is then based on a random sample of 10 to 300 participants, with the group median computed across participants’ differences between medians. Because the dataset is very large at the two level of analysis, we can pretend we have access to the population values, and define them by first computing, for each condition, the median across all available trials for each participant, second by computing across all participants the median of the pairwise differences.
Having defined population values (the truth we’re trying to estimate, here a group difference of 78 ms), we can calculate measurement precision as the proportion of experiments in which the group estimate is no more than X ms from the population value, with X varying from 5 to 40 ms. Here are the results:
Not surprisingly, the proportion of estimates close to the population value increases with the number of participants. More interestingly, the relationship is non-linear, such that a larger gain in precision can be achieved by increasing sample size for instance from 10 to 20 compared to from 90 to 100.
The results also let us answer useful questions for planning experiments (see the black arrows in the above figure):
• So that in 70% of experiments the group estimate of the median is no more than 10 ms from the population value, we need to test at least 56 participants.
• So that in 90% of experiments the group estimate of the median is no more than 20 ms from the population value, we need to test at least 38 participants.
Obviously, this is just an example, about a narrow problem related to lexical decisions. Other aspects could be considered too, for instance the width of the confidence intervals (Maxwell, Kelley & Rausch, 2008; Peters & Crutzen, 2017; Rothman & Greenland, 2018). And for your particular case, most likely, you won’t have access to a large dataset from which to perform a data driven simulation. In this case, you can get estimates about plausible effect sizes and their variability from various sources (Gelman & Carlin 2014):
- related data;
(systematic) literature review;
outputs of a hierarchical model;
To model a range of plausible effect sizes and their consequences on repeated measurements, you need priors about a data generating process and how distributions differ between conditions. For instance, you could use exGaussian distributions to simulate RT data. For research on new effects, it is advised to consider a large range of potential effects, with their plausibility informed by the literature and psychological/biological constraints.
Although relying on the literature alone can lead to over-optimistic expectations because of the dominance of small n studies and a bias towards significant results (Yarkoni 2009; Button et al. 2013), methods are being developed to overcome these limitations (Anderson, Kelley & Maxwell, 2017). In the end, the best cure against effect size over-estimation is a combination of pre-registration/registered reports (to diminish literature bias) and data sharing (to let anyone do their own calculations and meta-analyses).
The code is on figshare: the simulation can be reproduced using the
flp_sim_precision notebook, the illustrations of the distributions can be reproduced using
Anderson, S.F., Kelley, K. & Maxwell, S.E. (2017) Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychol Sci, 28, 1547-1562.
The tyranny of power: is there a better way to calculate sample size? https://www.bmj.com/content/339/bmj.b3985)
Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.
Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.
Gelman, A. & Carlin, J. (2014) Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci, 9, 641-651.
Maxwell, S.E., Kelley, K. & Rausch, J.R. (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol, 59, 537-563.
Peters, G.-J.Y. & Crutzen, R. (2017) Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. PsyArXiv. doi:10.31234/osf.io/cjsk2.
Rothman, K.J. & Greenland, S. (2018) Planning Study Size Based on Precision Rather Than Power. Epidemiology, 29, 599-603.
Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.
Rousselet, G.A. & Wilcox, R.R. (2018) Reaction times and other skewed distributions: problems with the mean and the median. bioRxiv. doi: https://doi.org/10.1101/383935
Rousselet, G.; Wilcox, R. (2018): Reaction times and other skewed distributions: problems with the mean and the median. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6911924.v1
Schönbrodt, F.D. & Perugini, M. (2013) At what sample size do correlations stabilize? J Res Pers, 47, 609-612.
Yarkoni, T. (2009) Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power‚ Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4, 294-298.