There are many changes necessary to improve the quality of neuroscience & psychology research. Suggestions abound to increase science openness, promote better experimental designs, and educate researchers about statistical inferences. These changes are necessary and will take time to implement. As part of this process, here, I’d like to propose one simple step to dramatically improve the assessment of statistical results in psychology & neuroscience: to ban bar graphs.
The benefits of illustrating data distributions has been emphasised in many publications and is often the topic of one of the first chapters of introductory statistics books. One of the most striking example is provided by Anscombe’s quartet, in which very different distributions are associated with the same summary statistics:
Moving away from bar graphs can achieve a badly needed departure from current statistical standards. Indeed, using for instance scatterplots instead of bar graphs can help shift the emphasis from the unproductive significant vs. non-significant dichotomy to a focus on what really matters: effect sizes and individual differences. By effect size, here, I do not mean Cohen’s d and other normalised non-robust equivalents (Wilcox, 2006); I mean, literally how big the effect is. Why does it matter? Say you have a significant group effect, it would be (more) informative to answer these questions as well:
- how many participants actually show an effect in the same direction as the group?
- how many participants show no effect, or an effect in the opposite direction as the group?
- is there a smooth continuum of effects across participants, or can we identify sub-clusters of participants who appear to behave differently from the rest?
- exactly how big are the individual results? For instance, what does it mean for a participant to be 20 ms faster in one condition than another? What if someone else is 40 ms faster? Our incapacity to answer these last questions in many situations simply reflects our lack of knowledge and the poverty of our models and predictions. But this is not an excuse to hide the richness of our data behind bar graphs.
Let’s consider an example from a published paper, which I will not identify. On the left is the bar graph alone representation, whereas the right panel contains both bars and scatterplots. The graphs show results from two independent groups: participants in each group were tested in two conditions, and the pairwise differences are illustrated here. For paired designs, illustrating each condition of the pair separately is inadequate to portray effect sizes because one doesn’t know which points are part of a pair. So here the authors selected the best option: to plot the differences, so that readers can appreciate effect sizes and their distributions across participants. Then they performed two mixed linear analyses, one per group, and found a significant effect for controls, and a non-significant effect in patients. These results seem well supported by the bar graph on the left, and the authors concluded that unlike controls, patients did not demonstrate the effect.
We can immediately flag two problems with this conclusion. First, the authors did not test the group interaction, which is a common fallacy (Nieuwenhuis et al. 2011). Second, the lack of significance (p<0.05) does not provide evidence for the lack of effect, again a common fallacy (see e.g. Kruschke 2013). And obviously there is a third problem: without showing the content of the bars, I would argue that no conclusion can be drawn at all. Well, in fact the authors did report the graph on the right in the above figure! Strangely, they based their conclusions on the statistical tests instead of simply looking at the data.
The data show large individual differences and overlap between the two distributions. In patients, except for 2 outliers showing large negative effects, the remaining observations are within the range observed in controls. Six patients have results suggesting an effect in the same direction as controls, 2 are near zero, 3 go in the opposite direction. So, clearly, the lack of significant group effect in patients is not supported by the data, and arises from the use of a statistical test non-robust to outliers.
Here is what I would conclude about this dataset: both groups show an effect, but the effect sizes tend to be larger in controls than in patients. There are large individual differences, and in both groups, not all participants seem to show an effect. Because of these inter-participant differences, larger sample sizes need to be tested to properly quantify the effect. In light of the current data, there is evidence that patients do show an effect. Finally, the potential lack of effect in certain control participants, and the rather large effects in some patients, questions the use of this particular effect as a diagnostic tool.
I will describe how I would go about analysing this dataset in another post. At the moment, I would just point out that group analyses are highly questionable when groups are small and heterogenous. In the example above, depending on the goals of the experiment, it might suffice to report the scatterplots and a verbal description, as I provided in the previous paragraph. I would definitely favour that option to reporting a single statistical test of central tendency, whether it is robust or not.
The example of the non-significant statistical test in patients illustrate a critical point: if a paper reports bar graphs and non-significant statistical analyses of the mean, not much can be concluded! There might be differences in other aspects than the mean; central tendency differences might exist, but the assumptions of the test could have been violated because of skewness or outliers for instance. Without informative illustrations of the results, it is impossible to tell.
In my experience as reviewer and editor, once bar graphs are replaced by scatterplots (or boxplots etc.) the story can get much more interesting, subtle, convincing, or the opposite… It depends what surprises the bars are holding. So show your data, and ask others to do the same.
“But what if I have clear effects, low within-group dispersion, and I know what I’m doing? Why can’t I use bar graphs?”
This is rather circular: unless you show the results using, for instance, scatterplots, no one knows for sure that you have clear effects and low within-group dispersion. So, if you have nothing to hide and you want to convince your readers, show your results. And honestly, how often do we get clear effects with low intra-group variability? Showing scatterplots is the start of a discussion about the nature of the results, an invitation to go beyond the significant vs. non-significant dichotomy.
“But scatterplots are ugly, they make my results look messy!”
First, your results are messy – scatterplots do not introduce messiness. Second, there is nothing stopping you from adding information to your scatterplots, for instance lines marking the quartiles of the distributions, or superimposing boxplots or many of the other options available.
Examples of more informative figures
Wilcox, R.R. (2006) Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education, 74, 353-367.
Allen, E.A., Erhardt, E.B. & Calhoun, V.D. (2012) Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron, 74, 603-608.
Weissgerber, T.L., Milic, N.M., Winham, S.J. & Garovic, V.D. (2015) Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol, 13, e1002128.
Overview of robust methods, to go beyond ANOVAs on means
Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.
Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.
Extremely useful resources to go Bayesian
Kruschke, J.K. (2015) Doing Bayesian data analysis : a tutorial with R, JAGS, and Stan. Academic Press, San Diego, CA.
Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603.
Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.