Category Archives: summary

How to illustrate a 2×2 mixed ERP design

Let’s consider a simple mixed ERP design with 2 repeated measures (2 tasks) and 2 independent groups of participants (young and older participants). The Matlab code and the data are available on github. The data are time-courses of mutual information, with one vector time-course per participant and task. These results are preliminary and have not been published yet, but you can get an idea of how we use mutual information in the lab in recent publications (Ince et al. 2016a, 2016b; Rousselet et al. 2014). The code and illustrations presented in the rest of the post are not specific to mutual information.

Our 2 x 2 experimental design could be analysed using the LIMO EEG toolbox for instance, by computing a 2 x 2 ANOVA at every time point, and correcting for multiple comparisons using cluster based bootstrap statistics (Pernet et al. 2011, 2015). LIMO EEG has been used to investigate task effects for instance (Rousselet et al. 2011). But here, instead of ANOVAs, I’d like to focus on graphical representations and non-parametric assessment of our simple group design, to focus on effect sizes and to demonstrate how a few figures can tell a rich data-driven story.

First, we illustrate the 4 cells of our design. Figure 1 shows separately each group and each task: in each cell all participants are superimposed using thin coloured lines. We can immediately see large differences among participants and between groups, with overall smaller effects (mutual information) in older participants. There also seems to be task differences, in particular in young participants, which tend to present more sustained effects past 200 ms in the expressive task than the gender task.


Figure 1

To complement the individual traces, we can add measures of central tendency. The mean is shown with a thick green line, the median with a thick black line. See how the mean can be biased compared to the median in the presence of extreme values. The median was calculated using the Harrell-Davis estimator of the 50th quantile. To illustrate the group median with a measure of uncertainty, we can add a 95% percentile bootstrap confidence interval for instance (Figure 2).


We can immediately see discrepancies between the median time-courses and their confidence intervals on the one hand, and the individual time-courses on the other hand. There are indeed many distributions of participants that can lead to the same average time-course. That’s why it is essential to show individual results, at least in some illustrations.

In our 2 x 2 design, we now have 3 aspects to consider: group differences, task differences and their interactions. We illustrate them in turn.

Age group differences for each task

We can look at the group differences in each task separately, as shown in Figure 3. The medians of each group is shown with 95% percentile bootstrap confidence intervals. On average, older participants tend to have weaker mutual information than young participants – less than half around 100-200 ms post-stimulus. This will need to be better quantified, for instance by reporting the median of all pairwise differences.


Figure 3

Under each panel showing the median + CI for each group, we plot the time-course of the group differences (young-older), with a confidence interval. For group comparisons we cannot illustrate individuals, because participants are be paired. However, we can illustrate all the bootstrap samples, shown in grey. Each sample was obtained by:

  • sampling with replacement Ny observations among Ny young observers
  • sampling with replacement No observations among No older observers
  • compute the median of each group
  • subtract the two medians

It is particularly important to illustrate the bootstrap distributions if they are skewed or contain outliers, or both, to check that the confidence intervals provide a good summary. If the bootstrap samples are very skewed, highest density intervals might be a good alternative to classic confidence intervals.

The lower panels of Figure 3 reveal relatively large group differences in a narrow window within 200 ms. The effect also appears to be stronger in the expressive task. Technically, one could also say that the effects are statistically significant, in a frequentist sense, when the 95% confidence intervals do not include zero. But not much is gained from that because some effects are large and others are small. Correction for multiple comparisons would also be required.

Task differences for each group

Figure 4 has a similar layout to Figure 3, now focusing on the task differences. The top panels suggest that the group medians don’t differ much between tasks, except maybe in young participants around 300-500 ms.


Figure 4

Because task effects are paired, we are not limited to the comparison of the medians between tasks; we can also illustrate the individual task differences and the medians of these differences [1]. These are shown in the bottom panels of Figure 4. In both groups, the individual differences are large and the time-courses of the task differences are scattered around zero, except in the young group starting around 300 ms, where most participants have positive differences (expressive > gender).

[1] When the mean is used as a measure of central tendency, these two perspectives are identical, because the difference between two means is the same as the mean of the pairwise differences. However, this is not the case for the median: the difference between medians is not the same as the median of the differences. Because we are interested in effect sizes, it is more informative to report descriptive statistics of the pairwise differences. The advantage of the Matlab code provided with this post is that instead of looking at the median, we can also look at other quantiles, thus getting a better picture of the strength of the effects.

Interaction between tasks and groups

Finally, in Figure 5 we consider the interactions between task and group factors. To do that we first superimpose the medians of the task differences with their confidence intervals (top panel). These traces are the same shown in the bottom panels of Figure 4. I can’t say I’m very happy with the top panel of Figure 5 because the two traces are difficult to compare. Essentially the don’t seem to differ much, except maybe for the late effect in young participants being higher than what is observed in older participants.


In the lower panel of Figure 5 we illustrate the age group differences (young – older) between the medians of the pairwise task differences. Again confidence intervals are also provided, along with the original bootstrap samples. Overall, there is very little evidence for a 2 x 2 interaction, suggesting that the age group differences are fairly stable across tasks. Put another way, the weak task effects don’t appear to change much in the two age groups.


Ince, R.A., Jaworska, K., Gross, J., Panzeri, S., van Rijsbergen, N.J., Rousselet, G.A. & Schyns, P.G. (2016a) The Deceptively Simple N170 Reflects Network Information Processing Mechanisms Involving Visual Feature Coding and Transfer Across Hemispheres. Cereb Cortex.

Ince, R.A., Giordano, B.L., Kayser, C., Rousselet, G.A., Gross, J. & Schyns, P.G. (2016b) A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Hum Brain Mapp.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Problems with small sample sizes

In psychology and neuroscience, the typical sample size is too small. I’ve recently seen several neuroscience papers with n = 3-6 animals. For instance, this article uses n = 3 mice per group in a one-way ANOVA. This is a real problem because small sample size is associated with:

  • low statistical power

  • inflated false discovery rate

  • inflated effect size estimation

  • low reproducibility

Here is a list of excellent publications covering these points:

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Forstmeier, W., Wagenmakers, E.J. & Parker, T.H. (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev Camb Philos Soc.

Small sample size also prevents us from properly estimating and modelling the populations we sample from. As a consequence, small n stops us from answering a fundamental, yet often ignored empirical question: how do distributions differ?

This important aspect is illustrated in the figure below. Columns show distributions that differ in four different ways. The rows illustrate samples of different sizes. The scatterplots were jittered using ggforce::geom_sina in R. The vertical black bars indicate the mean of each sample. In row 1, examples 1, 3 and 4 have exactly the same mean. In example 2 the means of the two distributions differ by 2 arbitrary units. The remaining rows illustrate random subsamples of data from row 1. Above each plot, the t value, mean difference and its confidence interval are reported. Even with 100 observations we might struggle to approximate the shape of the parent population. Without additional information, it can be difficult to determine if an observation is an outlier, particularly for skewed distributions. In column 4, samples with n = 20 and n = 5 are very misleading.


Small sample size could be less of a problem in a Bayesian framework, in which information from prior experiments can be incorporated in the analyses. In the blind and significance obsessed frequentist world, small n is a recipe for disaster.