Author Archives: garstats

About garstats

Senior Lecturer at the University of Glasgow, in the Institute of Neuroscience and Psychology. I study perception, ageing, and brain dynamics using EEG. Section editor at the European Journal of Neuroscience. I'm not a statistician but when I review and edit papers, I mostly comment on statistics. So here it goes.

“Forget about getting definitive results from a single experiment; instead embrace variation, accept uncertainty, and learn what you can.”

Andrew Gelman 2018

Gelman, A. (2018) The Failure of Null Hypothesis Significance Testing When Studying Incremental Changes, and What to Do About It. Pers Soc Psychol B, 44, 16-23.

[article] [preprint]


R functions for the hierarchical shift function

The hierarchical shift function presented in the previous post is now available in the `rogme` R package. Here is a short demo.

Get the latest version of `rogme`:

# install.packages("devtools")

Load data and compute hierarchical shift function:

df <- flp # get reaction time data - for details `help(flp)`
# Compute shift functions for all participants
out <- hsf(df, rt ~ condition + participant)


Because of the large number of participants, the confidence intervals are too narrow to be visible. So let’s subset a random sample of participants to see what can happen with a more smaller sample size:

set.seed(22) # subset random sample of participants
id <- unique(df$participant) 
df <- subset(df, flp$participant %in% sample(id, 50, replace = FALSE))
out <- hsf(df, rt ~ condition + participant) 


Want to estimate the quartiles only?

out <- hsf(df, rt ~ condition + participant, qseq = c(.25, .5, .75))


Want to reverse the comparison?

out <- hsf(df, rt ~ condition + participant, todo = c(2,1))


P values are here:


P values adjusted for multiple comparisons using Hochberg’s method:


Percentile bootstrap version:

out <- hsf_pb(df, rt ~ condition + participant)

Plot bootstrap highest density intervals – default:



Plot distributions of bootstrap samples of group differences. Bootstrap distributions are shown in orange. Black dot marks the mode. Vertical black lines mark the 50% and 90% highest density intervals.




For more examples, a vignette is available on GitHub.

Feedback would be much appreciated: don’t hesitate to leave a comment or to get in touch directly.

Hierarchical shift function: a powerful alternative to the t-test

In this post I introduce a simple yet powerful method to compare two dependent groups: the hierarchical shift function. The code is on GitHub. More details are in Rousselet & Wilcox (2019), with a reproducibility package on figshare.

Let’s consider different situations in a hierarchical setting: we’ve got trials from 2 conditions in several participants. Imagine we collected data from one participant and the results look like this:


These fake reaction time data were created by sampling from ex-Gaussian distributions. Here the two populations are shifted by a constant, so we expect a uniform shift between the two samples. Later we’ll look at examples showing  differences most strongly in early responses, late responses, and in spread.

To better understand how the distributions differ, let’s look at a shift function, in which the difference between the deciles of the two conditions are plotted as a function of the deciles in condition 1 – see details in Rousselet et al. (2017). The decile differences are all negative, showing stochastic dominance of condition 2 over condition 1. The function is not flat because of random sampling and limited sample size. 


Now, let’s say we collected 100 trials per condition from 30 participants. How do we proceed? There are a variety of approaches available to quantify distribution differences. Ideally, such data would be analysed using a multi-level model, including for instance ex-Gaussian fits, random slopes and intercepts for participants, item analyses… This can be done using the lme4 or brms R packages. However, in my experience, in neuroscience and psychology articles, the most common approach is to collapse the variability across trials into a single number per participant and condition to be able to perform a paired t-test: typically, the mean is computed across trials for each condition and participant, then the means are subtracted, and the distribution of mean differences is entered into a one-sample t-test. Obviously, this strategy throws away a huge amount of information! And the results of such second-tier t-tests are difficult to interpret: a positive test leaves us wondering exactly how the distributions differ; a negative test is ambiguous – beside avoiding the ‘absence of evidence is not evidence of absence’ classic error, we also need to check if the distributions do not differ in other aspects than the mean. So what can we do?

Depending on how conditions differ, looking at other aspects of the data than the mean can be more informative. For instance, in Rousselet & Wilcox (2019), we consider group comparisons of individual medians. Considering that the median is the second quartile, looking at the other quartiles can be of theoretical interest to investigate effects in early or later parts of distributions. This could be done in several ways, for instance by making inferences on the first quartile (Q1) or the third quartile (Q3). If the goal is to detect differences anywhere in the distributions, a more systematic approach consists in quantifying differences at multiple quantiles. Here we consider the case of the deciles, but other quantiles could be used. First, for each participant and each condition, the sample deciles are computed over trials. Second, for each participant, condition 2 deciles are subtracted from condition 1 deciles – we’re dealing with a within-subject (repeated-measure) design. Third, for each decile, the distribution of differences is subjected to a one-sample test. Fourth, a correction for multiple comparisons is applied across the 9 one-sample tests. I call this procedure a hierarchical shift function. There are many options available to implement this procedure and the example used here is not the definitive answer: the goal is simply to demonstrate that a relatively simple procedure can be much more powerful and informative than standard approaches.

In creating a hierarchical shift function we need to make three choices: a quantile estimator, a statistical test to assess quantile differences across participants, and a correction for multiple comparisons technique. The deciles were estimated using type 8 from the base R quantile() function (see justification in Rousselet & Wilcox, 2019). The group comparisons were performed using a one-sample t-test on 20% trimmed means, which performs well in many situations, including in the presence of outliers. The correction for multiple comparisons employed Hochberg’s strategy (Hochberg, 1988), which guarantees that the probability of at least one false positive will not exceed the nominal level as long as the nominal level is not exceeded for each quantile. 

In Rousselet & Wilcox (2019), we consider power curves for the hierarchical shift function (HSF) and contrast them to other approaches: by design, HSF is sensitive to more types of differences than any standard approach using the mean or a single quantile. Another advantage of HSF is that the location of the distribution difference can be interrogated, which is impossible if inferences are limited to a single value.

Here is what the hierarchical shift function looks like for our uniform shift example:


The decile differences between conditions are plotted for each participant (colour coded) and the group 20% trimmed means are superimposed in black. Differences are pretty constant across deciles, suggesting a uniform shift. Most participants have shift functions entirely negative – a case of stochastic dominance of one condition over the other. There is growing uncertainty as we consider higher deciles, which is expected from measurements of right skewed distributions.

We can add confidence intervals:


P values are available in the GitHub code.

Instead of standard parametric confidence intervals, we can also consider percentile bootstrap confidence intervals (or highest density intervals), as done here:


Distributions of bootstrap estimates can be considered cheap Bayesian posterior distributions. They also contain useful information not captured by simply reporting confidence intervals.

Here we plot them using geom_halfeyeh() from tidybayes. 


The distributions of bootstrap estimates of the group 20% trimmed means are shown in orange, one for each decile. Along the base of each distribution, the black dot marks the mode and the vertical lines mark the 50% and 90% highest density intervals.

Nice hey?! Reporting a figure like that is dramatically more informative than reporting a P value and a confidence interval from a t-test!

A bootstrap approach can also be used to perform a cluster correction for multiple comparisons – see details on GitHub. Preliminary simulations suggest that the approach can provide substantial increase in power over the Hochberg’s correction – more on that in another post.

Let’s look at 3 more examples, just for fun…

Example 2: early difference

Example participant:


Shift function:


Hierarchical shift function with confidence intervals:


Percentile bootstrap estimate densities:


Example 3: difference in spread

Example participant:


Shift function:


Hierarchical shift function with confidence intervals:


Percentile bootstrap estimate densities:


Example 4: late difference

Example participant:


Shift function:


Hierarchical shift function with confidence intervals:


Percentile bootstrap estimate densities:



The hierarchical shift function can be used to achieve two goals: 

  • to screen data for potential distribution differences using p values, without limiting the exploration to a single statistics like the mean;
  • to illustrate and quantify how distributions differ.

I think of the hierarchical shift function as the missing link between t-tests and multi-level models. I hope it will help a few people make sense of their data and maybe nudge them towards proper hierarchical modelling.

R functions for the parametric hierarchical shift function are available in the rogme package. I also plan bootstrap functions. Then I’ll tackle the case of 2 independent groups, which requires a third level quantifying differences of differences.


Cluster correction for multiple dependent comparisons

In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:

  • different groups of rats are tested with increasing concentrations of a molecule;

  • different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);

  • pain thresholds are measured on contiguous patches of skin;

  • insects are sampled from neighbouring fields;

  • participants undergo a series of training sessions. 

In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power. 

These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…  

Cluster-based statistics

To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right  panel.


We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group). 


What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated. 


The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.

When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).

In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.


Figure 1 from Maris & Oostenveld, 2007.

If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.

Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!


So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance. 


There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not. 


From the description above, it is clear that using cluster-based statistics require a few choices:

  • a method to estimate the null distribution;
  • a method to form clusters;

  • a choice of cluster statistics;

  • a choice of statistic to form the null distribution (max across clusters for instance);

  • a number of resamples…

Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…

5 dependent groups

First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.


With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes. 

The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.

Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment   where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.


If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:


Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 50 observations for Hochberg;

  • 57 observations for Bonferroni.

7 dependent groups

If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.


As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 56 observations for Hochberg;

  • 59 observations for Bonferroni.


7 independent groups

Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:


The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.

To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:

c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)

And here is an example of a random sample with n = 70 measurements per group:


In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:

  • 50 observations for the cluster test;
  • 67 observations for Hochberg;

  • 81 observations for Bonferroni.



I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself. 

Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). TFCE also affords inferences for each test, not the cluster of tests. But it is computationally much more demanding than the standard cluster test demonstrated in this post. 


Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.

R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.


Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x. 

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.

McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset.

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.

Illustration of continuous distributions using quantiles

In this post I’m going to show you a few simple steps to illustrate continuous distributions. As an example, we consider reaction time data, which are typically positively skewed and can differ in different ways. Reaction time distributions are also a rich source of information to constrain cognitive theories and models. So unless the distributions are at least illustrated, this information is lost (which is typically the case when distributions are summarised using a single value like the mean). Other approaches not covered here include explicit mathematical models of decision making and fitting functions to model the shape of the distributions (Balota & Yap, 2011).

For our current example, I made up data for 2 independent groups with four patterns of differences:

  • no clear differences;

  • uniform shift between distributions;

  • mostly late differences;

  • mostly early differences.

The R code is on GitHub.


For our first visualisation, we use geom_jitter() from ggplot2. The 1D scatterplots give us a good idea of how the groups differ but they’re not the easiest to read. The main reason is probably that we need to estimate local densities of points in different regions and compare them between groups.


For the purpose of this exercise, each group (g1 and g2) is composed of 1,000 observations, so the differences in shapes are quite striking. With smaller sample sizes the evaluation of these graphs could be much more challenging.

Kernel density plots

Relative to scatterplots, I find that kernel density plots make the comparisons between groups much easier.


Improved scatterplots

Scatterplots and kernel density plots can be combined by using beeswarm plots. Here we create scatterplots shaped by local density using the geom_quasirandom() function from the ggbeeswarm package. Essentially, the function creates violin plots in which the constituent points are visible. 


To make the plots even more informative, I’ve superimposed quantiles – here deciles computed using the Harrell-Davis quantile estimator. The deciles are represented by vertical black lines, with medians shown with thicker lines. Medians are informative about the location of the bulk of the observations and comparing the lower to upper quantiles let us appreciate the amount of asymmetry within distributions. Comparing quantiles between groups give us a sense of the amount of relative compression/expansion on each side of the distributions. This information would be lost if we only compared the medians. 

Quantile plots

If we remove the scatterplots and only show the quantiles, we obtain quantile plots, which provide a compact description of how distributions differ (please post a comment if you know of older references using quantile plots). Because the quantiles are superimposed, they are easier to compare than in the previous scatterplots. To help with the group comparisons, I’ve also added plots of the quantile differences, which emphasise the different patterns of group differences.


Vincentile plots

An alternative to quantiles are Vincentiles, which are computed by sorting the data and splitting them in equi-populated bins (there is the same number of observations in each bin). Then the mean is computed for each bin (Balota et al. 2008; Jiang et al. 2004). Below means were computed for 9 equi-populated bins. As expected from the way they are computed, quantile plots and Vincentile plots look very similar for our large samples from continuous variables.


Group quantile and Vincentile plots can be created by averaging quantiles and Vincentiles across participants (Balota & Yap, 2011; Ratcliff, 1979). This will be the topic of another post.

Delta plots

Related to quantile plots and Vincentile plots, delta plots show the difference between conditions, bin by bin (for each Vincentile) along the y-axis, as a function of the mean across conditions for each bin along the x-axis (De Jong et al., 1994). Not surprisingly, these plots have very similar shapes to the quantile difference plots we considered earlier. 


Negative delta plots (nDP, delta plots with a negative slope) have received particular attention because of their theoretical importance (Ellinghaus & Miller, 2018; Schwarz & Miller, 2012).

Shift function

Delta plots are related to the shift function, a powerful tool introduced in the 1970s: it consists in plotting the difference between the quantiles of two groups as a function of the quantiles in one group, with some measure of uncertainty around the difference (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977). It was later refined by Rand Wilcox (Rousselet et al. 2017). This modern version is shown below, with deciles estimated using the Harrell-Davis quantile estimator, and percentile bootstrap confidence intervals of the quantile differences. The sign of the difference is colour-coded (purple for negative, orange for positive).


Unlike other graphical quantile techniques presented here, the shift function affords statistical inferences because of it’s use of confidence intervals (the shift function also comes in a few Bayesian flavours). It is probably one of the easiest ways to compare entire distributions, without resorting to explicit models of the distributions. But the shift function and the other graphical methods demonstrated in this post are not meant to compete with hierarchical models. Instead, they can be used to better understand data patterns within and between participants, before modelling attempts. They also provide powerful alternatives to the mindless application of t-tests and bar graphs, helping to nudge researchers away from the unique use of the mean (or the median) and towards considering the rich information available in continuous distributions.


Balota, D.A. & Yap, M.J. (2011) Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. Curr Dir Psychol Sci, 20, 160-166.

Balota, D.A., Yap, M.J., Cortese, M.J. & Watson, J.M. (2008) Beyond mean response latency: Response time distributional analyses of semantic priming. J Mem Lang, 59, 495-523.

Clarke, E. & Sherrill-Mix, S. (2016) ggbeeswarm: Categorical Scatter (Violin Point) Plots.

De Jong, R., Liang, C.C. & Lauber, E. (1994) Conditional and Unconditional Automaticity – a Dual-Process Model of Effects of Spatial Stimulus – Response Correspondence. J Exp Psychol Human, 20, 731-750.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Ann Stat, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Ellinghaus, R. & Miller, J. (2018) Delta plots with negative-going slopes as a potential marker of decreasing response activation in masked semantic priming. Psychol Res, 82, 590-599.

Jiang, Y., Rouder, J.N. & Speckman, P.L. (2004) A note on the sampling properties of the Vincentizing (quantile averaging) procedure. J Math Psychol, 48, 186-195.

Ratcliff, R. (1979) Group Reaction-Time Distributions and an Analysis of Distribution Statistics. Psychol Bull, 86, 446-461.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Schwarz, W. & Miller, J. (2012) Response time models of delta plots with negative-going slopes. Psychon B Rev, 19, 555-574.

Planning for measurement precision, an alternative to power analyses

When we estimate power curves, we ask this question: given some priors about the data generating process, the nature of the effect and measurement variance, what is the probability to detect an effect for a given statistical test (say using an arbitrary p<0.05 threshold) for various sample sizes and effect sizes. While there are very good reasons to focus on power estimation, this is not the only or the most important aspect of an experimental procedure to consider (Gelman & Carlin, 2014). Indeed, finding the number of observations needed so that we get p<0.05 in say 87% of experiments, is not the most exciting part of designing an experiment. 

The relevant question is not “What is the power of a test?” but rather is “What might be expected to happen in studies of this size?” (Gelman & Carlin, 2014)

A related but more important question is that of measurement precision: given some priors and a certain number of participants, how close can we get to the unknown population value (Maxwell et al., 2008; Schönbrodt & Perugini, 2013; Peters & Crutzen, 2018; Trafimow, 2019)? Not surprisingly, measurement precision depends on sample size. As we saw in previous posts, sampling distributions get narrower with increasing sample sizes:

And with narrower sampling distributions, measurement precision increases. To illustrate, let’s consider an example from a lexical decision task – hundreds of reaction times (RT) were measured in hundreds of participants who had to distinguish between words and non-words presented on a computer screen.

Here are examples of RT distributions from 100 participants for each condition:


Reaction time distributions from 100 participants. Participants were randomly selected among 959. Distributions are shown for the same participants (colour coded) in the Word (A) and Non-Word (B) conditions.

If we save the median of each distribution, for each participant and condition, we get these positively skewed group level distributions:


The distribution of pairwise differences between medians is also positively skewed:


Notably, most participants have a positive difference: 96.4% of participants are faster in the Word than the Non-Word condition – a potential case of stochastic dominance (Rouder & Haaf, 2018; see also this summary blog post).

Now let say we want to estimate the group difference between conditions. Because of the skewness at each level of analysis (within and across participants), we estimate the central tendency at each level using the median: that is, we compute the median for each participant and each condition, then compute the medians of medians across participants (a more detailed assessment could be obtained by performing hierarchical modelling or multiple quantile estimation for instance).

Then we can assess measurement precision at the group level by performing a multi-level simulation. In this simulation, we can ask, for instance, how often the group estimate is no more than 10 ms from the population value across many experiments. To simplify, in each iteration of the simulation, we draw 200 trials per condition and participant, compute the median and save the Non-Word – Word difference. Group estimation of the difference is then based on a random sample of 10 to 300 participants, with the group median computed across participants’ differences between medians. Because the dataset is very large at the two level of analysis, we can pretend we have access to the population values, and define them by first computing, for each condition, the median across all available trials for each participant, second by computing across all participants the median of the pairwise differences. 

Having defined population values (the truth we’re trying to estimate, here a group difference of 78 ms), we can calculate measurement precision as the proportion of experiments in which the group estimate is no more than X ms from the population value, with X varying from 5 to 40 ms. Here are the results:


Group measurement precision for the difference between the Non-Word and Word conditions. Measurement precision was estimated by using a simulation with 10,000 iterations, 200 trials per condition and participant, and varying numbers of participants.

Not surprisingly, the proportion of estimates close to the population value increases with the number of participants. More interestingly, the relationship is non-linear, such that a larger gain in precision can be achieved by increasing sample size for instance from 10 to 20 compared to from 90 to 100. 

The results also let us answer useful questions for planning experiments (see the black arrows in the above figure):

 • So that in 70% of experiments the group estimate of the median is no more than 10 ms from the population value, we need to test at least 56 participants. 

• So that in 90% of experiments the group estimate of the median is no more than 20 ms from the population value, we need to test at least 38 participants.

Obviously, this is just an example, about a narrow problem related to lexical decisions. Other aspects could be considered too, for instance the width of the confidence intervals (Maxwell, Kelley & Rausch, 2008; Peters & Crutzen, 2017; Rothman & Greenland, 2018). And for your particular case, most likely, you won’t have access to a large dataset from which to perform a data driven simulation. In this case, you can get estimates about plausible effect sizes and their variability from various sources (Gelman & Carlin 2014):

  • related data;

  • (systematic) literature review;

  • meta-analysis;

  • outputs of a hierarchical model;

  • modelling.

To model a range of plausible effect sizes and their consequences on repeated measurements, you need priors about a data generating process and how distributions differ between conditions. For instance, you could use exGaussian distributions to simulate RT data. For research on new effects, it is advised to consider a large range of potential effects, with their plausibility informed by the literature and psychological/biological constraints.  

Although relying on the literature alone can lead to over-optimistic expectations because of the dominance of small n studies and a bias towards significant results (Yarkoni 2009; Button et al. 2013), methods are being developed to overcome these limitations (Anderson, Kelley & Maxwell, 2017). In the end, the best cure against effect size over-estimation is a combination of pre-registration/registered reports (to diminish literature bias) and data sharing (to let anyone do their own calculations and meta-analyses).


The code is on figshare: the simulation can be reproduced using the flp_sim_precision notebook, the illustrations of the distributions can be reproduced using flp_illustrate_dataset.


Anderson, S.F., Kelley, K. & Maxwell, S.E. (2017) Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychol Sci, 28, 1547-1562.

Bland J.M.. The tyranny of power: is there a better way to calculate sample size?

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

Gelman, A. & Carlin, J. (2014) Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci, 9, 641-651.

Maxwell, S.E., Kelley, K. & Rausch, J.R. (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol, 59, 537-563.

Peters, G.-J.Y. & Crutzen, R. (2017) Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. PsyArXiv. doi:10.31234/

Rothman, K.J. & Greenland, S. (2018) Planning Study Size Based on Precision Rather Than Power. Epidemiology, 29, 599-603.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Rousselet, G.A. & Wilcox, R.R. (2018) Reaction times and other skewed distributions: problems with the mean and the median. bioRxiv. doi:

Rousselet, G.; Wilcox, R. (2018): Reaction times and other skewed distributions: problems with the mean and the median. figshare. Fileset.

Schönbrodt, F.D. & Perugini, M. (2013) At what sample size do correlations stabilize? J Res Pers, 47, 609-612.

Trafimow, D. (2019) Five Nonobvious Changes in Editorial Practice for Editors and Reviewers to Consider When Evaluating Submissions in a Post p < 0.05 Universe, The American Statistician, 73:sup1, 340-345, DOI: 10.1080/00031305.2018.1537888

Yarkoni, T. (2009) Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power‚ Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4, 294-298.