I read, review and edit a lot of ERP papers. A lot of these papers have in common shockingly poor figures. Here I’d like to go over a few simple steps that can help to produce much more informative figures. The data and the code to reproduce all the examples are available on github.
Let’s first consider what I would call the standard ERP figure, the one available in so many ERP papers (Figure 1). It presents two paired group averages for one of the largest ERP effect on the market: the contrast between ERP to noise textures (in black) and ERP to face images (in grey). This standard figure is essentially equivalent to a bar graph without error bars: it is simply unacceptable. At least, in this one, positive values are plotted up, not down, as can still be seen in some papers.
Figure 1. Standard ERP figure.
How can we improve this figure? As a first step, one could add some symbols to indicate at which time points the two ERPs differ significantly. So in Figure 2 I’ve added red dots marking time points at which a paired ttest gave p<0.05. The red dots appear along the xaxis so their timing is easy to read. This is equivalent to a bar graph without error bars but with little stars to mark p<0.05.
Figure 2. Standard figure with significant time points.
You know where this is going: next we will add confidence intervals, and then more. But it’s important to consider why Figure 2 is not good enough.
First, are significant effects that interesting? We can generate noise in Matlab or R for instance, perform ttests, and find significant results – doesn’t mean we should write papers about these effects. Although no one would question that significant effects can be obtained by chance, I am yet to see a single paper in which an effect is described as potential false positive. Anyway, more information is required about significant effects:
 do they make sense physiologically? For instance, you might find a significant ERP difference between 2 object categories at 20 ms, but that does not mean that the retina performs object categorisation;

how many participants actually show the group effect? It is possible to get significant group effects with very few individual participants showing a significant effect themselves. Actually, with large enough sample sizes you can pretty much guarantee significant group effects;

what is the group effect size, e.g. how large is the difference between two conditions?

how large are effect sizes in individual participants?

how do effect sizes compare to other known effects, or to effects observed at other time points, such as in the baseline, before stimulus presentation?
Second, because an effect is not statistically significant (p<0.05), it does not mean it is not there, or that you have evidence for the lack of effect. Similarly to the previous point, we should be able to answer these questions about seemingly nonsignificant effects:
 how many participants do not show the effect?

how many participants actually show an effect?

how large are the effects in individual participants?

is the group effect nonsignificant because of the lack of statistical power, e.g. due to skewness, outliers, heavy tails?
Third, most ERP papers report inferences on means using nonrobust statistics. Typically, results are then discussed in very general terms as showing effects or not, following a p<0.05 cutoff. What is assumed, at least implicitly, is that the lack of significant mean differences implies that the distributions do not differ. This is clearly unwarranted because distributions can differ in other aspects than the mean, e.g. in dispersion, in the tails, and the mean is not a robust estimator of central tendency. Thus, interpretations should be limited to what was measured: group differences in means, probably using a nonrobust statistical test. That’s right, if you read an ERP paper in which the authors report:
“condition A did not differ from condition B”
the subtitle really is:
“we only measured a few timewindows or peaks of interest, and we only tested group means using nonrobust statistics and used poor illustrations, so there could well be interesting effects in the data, but we don’t know”.
Some of the points raised above can be addressed by making more informative figures. A first step is to add confidence intervals, which is done in Figure 3. Confidence intervals can provide a useful indication of the dispersion around the average given the interparticipant variability. But be careful with the classic confidence interval formula: it uses mean and standard deviation and is therefore not robust. I’ll demonstrate Bayesian highest density intervals in another post.
Figure 3. ERPs with confidence intervals.
Ok, Figure 3 would look nicer with shaded areas, an example of which is provided in Figure 4 – but this is rather cosmetic. The important point is that Figures 3 and 4 are not sufficient because the difference is sometimes difficult to assess from the original conditions.
Figure 4. ERPs with nicer confidence intervals.
So in Figure 5 we present the timecourse of the average difference, along with a confidence interval. This is a much more useful representation of the results. I learnt that trick in 1997, when I first visited the lab of Michele FabreThorpe & Simon Thorpe in Toulouse. In that lab, we mostly looked at differences – ERP peaks were deemed uninterpretable and not really worth looking at…
Figure 5. ERP timecourses for each condition and their difference.
In Figure 5, the two vertical red lines mark the latency of the two difference peaks. They coincide with a peak from one of the two ERP conditions, which might be reassuring for folks measuring peaks. However, between the two difference peaks, there is a discrepancy between the top and bottom representations: whereas the top plot suggests small differences between the two conditions around ~180 ms, the bottom plot reveals a strong difference with a narrow confidence interval. The apparent discrepancy is due the difficulty in mentally subtracting two timecourses. It seems that in the presence of large peaks, we tend to focus on them and neglect other aspects of the data. Figure 6 uses fake data to illustrate the relationship between two ERPs and their difference in several situations. In row 1, try to imagine the timecourse of the difference from the two conditions, without looking at the solution in row 2 – it’s not as trivial as it seems.
Figure 6. Fake ERP timecourses and their differences.
Because it can be difficult to mentally subtract two timecourses, it is critical to always plot the timecourse of the difference. More generally, you should plot the timecourse of the effect you are trying to quantify, whatever that is.
We can make another important observation from Figure 5: there are large differences before the ERP peaks ~140180 ms shown in the top plot. Without showing the timecourse of the difference, it is easy to underestimate potentially large effects occurring before or after peaks.
So, are we done? Well, as much as Figure 5 is a great improvement on the standard figure, in a lot of situations it is not sufficient, because it does not portray individual results. This is essential to interpret significant and nonsignificant results. For instance, in Figure 5, there is nonsignificant group negative difference ~100 ms, and a large positive difference ~120 to 280 ms. What do they mean? The answer is in Figure 7: a small number of participants seem to have clear differences ~100 ms despite the lack of significant group effect, and all participants have a positive difference ~120 to 250 ms poststimulus. There are also large individual differences at most time points. So Figure 7 presents a much richer and compelling story than the group averages on their own.
Figure 7. A more detailed look at the group results. In the middle panel, individual differences are shown in grey and the group mean and its confidence interval are superimposed in red. The lower panel shows at every time point the proportion of participants with a positive difference.
Given the presence of a few participants with differences ~100 ms but the lack of significant group effects, it is interesting to consider participants individually, as shown in Figure 8. There, we can see that participants 6, 13, 16, 17 and 19 have a negative difference ~100 ms, unlike the rest of the participants. These individual differences are wiped out by the group statistics. Of course, in this example we cannot conclude that there is something special about these participants, because we only looked at one electrode: other participants could show similar effects at other electrodes. I’ll demonstrate how to assess effects potentially spread across electrodes in another post.
Figure 8. ERP differences with 95% confidence intervals for every participant.
To conclude: in my own research, I have seen numerous examples of large discrepancies between plots of individual results and plots of group results, such that in certain cases group averages do not represent any particular participant. For this reason, and because most ERP papers do not illustrate individual participants and use nonrobust statistics, I simply do not trust them.
Finally, I do not see the point of measuring ERP peaks. It is trivial to perform analyses at all time points and sensors to map the full spatialtemporal distribution of the effects. Limiting analyses to peaks is a waste of data and defeats the purpose of using EEG or MEG for their temporal resolution.
References
Allen et al. 2012 is a very good reference for making better figures overall and with an ERP example, although they do not make the most crucial recommendation of plotting the timecourse of the difference.
For one of the best example of clear ERP figures, including figures showing individual participants, check out Kovalenko, Chaumon & Busch 2012.
I have discussed issues with ERP figures and analyses here and here. And here are probably some of the most detailed figures of ERP results you can find in the literature – brace yourself for figure overkill.