One simple step to improve statistical inferences

There are many changes necessary to improve the quality of neuroscience & psychology research. Suggestions abound to increase science openness, promote better experimental designs, and educate researchers about statistical inferences. These changes are necessary and will take time to implement. As part of this process, here, I’d like to propose one simple step to dramatically improve the assessment of statistical results in psychology & neuroscience: to ban bar graphs.


The benefits of illustrating data distributions has been emphasised in many publications and is often the topic of one of the first chapters of introductory statistics books. One of the most striking example is provided by Anscombe’s quartet, in which very different distributions are associated with the same summary statistics:


Moving away from bar graphs can achieve a badly needed departure from current statistical standards. Indeed, using for instance scatterplots instead of bar graphs can help shift the emphasis from the unproductive significant vs. non-significant dichotomy to a focus on what really matters: effect sizes and individual differences. By effect size, here, I do not mean Cohen’s d and other normalised non-robust equivalents (Wilcox, 2006); I mean, literally how big the effect is. Why does it matter? Say you have a significant group effect, it would be (more) informative to answer these questions as well:

  • how many participants actually show an effect in the same direction as the group?
  • how many participants show no effect, or an effect in the opposite direction as the group?
  • is there a smooth continuum of effects across participants, or can we identify sub-clusters of participants who appear to behave differently from the rest?
  • exactly how big are the individual results? For instance, what does it mean for a participant to be 20 ms faster in one condition than another? What if someone else is 40 ms faster? Our incapacity to answer these last questions in many situations simply reflects our lack of knowledge and the poverty of our models and predictions. But this is not an excuse to hide the richness of our data behind bar graphs.

Let’s consider an example from a published paper, which I will not identify. On the left is the bar graph alone representation, whereas the right panel contains both bars and scatterplots. The graphs show results from two independent groups: participants in each group were tested in two conditions, and the pairwise differences are illustrated here. For paired designs, illustrating each condition of the pair separately is inadequate to portray effect sizes because one doesn’t know which points are part of a pair. So here the authors selected the best option: to plot the differences, so that readers can appreciate effect sizes and their distributions across participants. Then they performed two mixed linear analyses, one per group, and found a significant effect for controls, and a non-significant effect in patients. These results seem well supported by the bar graph on the left, and the authors concluded that unlike controls, patients did not demonstrate the effect.


We can immediately flag two problems with this conclusion. First, the authors did not test the group interaction, which is a common fallacy (Nieuwenhuis et al. 2011). Second, the lack of significance (p<0.05) does not provide evidence for the lack of effect, again a common fallacy (see e.g. Kruschke 2013). And obviously there is a third problem: without showing the content of the bars, I would argue that no conclusion can be drawn at all. Well, in fact the authors did report the graph on the right in the above figure! Strangely, they based their conclusions on the statistical tests instead of simply looking at the data.

The data show large individual differences and overlap between the two distributions. In patients, except for 2 outliers showing large negative effects, the remaining observations are within the range observed in controls. Six patients have results suggesting an effect in the same direction as controls, 2 are near zero, 3 go in the opposite direction. So, clearly, the lack of significant group effect in patients is not supported by the data, and arises from the use of a statistical test non-robust to outliers.

Here is what I would conclude about this dataset: both groups show an effect, but the effect sizes tend to be larger in controls than in patients. There are large individual differences, and in both groups, not all participants seem to show an effect. Because of these inter-participant differences, larger sample sizes need to be tested to properly quantify the effect. In light of the current data, there is evidence that patients do show an effect. Finally, the potential lack of effect in certain control participants, and the rather large effects in some patients, questions the use of this particular effect as a diagnostic tool.

I will describe how I would go about analysing this dataset in another post. At the moment, I would just point out that group analyses are highly questionable when groups are small and heterogenous. In the example above, depending on the goals of the experiment, it might suffice to report the scatterplots and a verbal description, as I provided in the previous paragraph. I would definitely favour that option to reporting a single statistical test of central tendency, whether it is robust or not.

The example of the non-significant statistical test in patients illustrate a critical point: if a paper reports bar graphs and non-significant statistical analyses of the mean, not much can be concluded! There might be differences in other aspects than the mean; central tendency differences might exist, but the assumptions of the test could have been violated because of skewness or outliers for instance. Without informative illustrations of the results, it is impossible to tell.

In my experience as reviewer and editor, once bar graphs are replaced by scatterplots (or boxplots etc.) the story can get much more interesting, subtle, convincing, or the opposite… It depends what surprises the bars are holding. So show your data, and ask others to do the same.

“But what if I have clear effects, low within-group dispersion, and I know what I’m doing? Why can’t I use bar graphs?”

This is rather circular: unless you show the results using, for instance, scatterplots, no one knows for sure that you have clear effects and low within-group dispersion. So, if you have nothing to hide and you want to convince your readers, show your results. And honestly, how often do we get clear effects with low intra-group variability? Showing scatterplots is the start of a discussion about the nature of the results, an invitation to go beyond the significant vs. non-significant dichotomy.

“But scatterplots are ugly, they make my results look messy!”

First, your results are messy – scatterplots do not introduce messiness. Second, there is nothing stopping you from adding information to your scatterplots, for instance lines marking the quartiles of the distributions, or superimposing boxplots or many of the other options available.


Examples of more informative figures

Wilcox, R.R. (2006) Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education, 74, 353-367.

Allen, E.A., Erhardt, E.B. & Calhoun, V.D. (2012) Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron, 74, 603-608.

Weissgerber, T.L., Milic, N.M., Winham, S.J. & Garovic, V.D. (2015) Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol, 13, e1002128.

Overview of robust methods, to go beyond ANOVAs on means

Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.

Extremely useful resources to go Bayesian

Kruschke, J.K. (2015) Doing Bayesian data analysis : a tutorial with R, JAGS, and Stan. Academic Press, San Diego, CA.

Understanding Bayes

Other references

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.


10 thoughts on “One simple step to improve statistical inferences

  1. curious neuro

    I wonder if the problem lies in expecting too much perfection from human studies. Most psychology and/or neuroscience behavior research is testing very subtle effects and this need not (and maybe should not) be consistent across populations depending on their background, history, context etc. In this example as well, throwing out those two ‘outliers’ leads to a sampling bias. Maybe that is their response to the treatment. I am not sure how exactly to go about this but I think the possible solution is to move away from group statistics and do individual statistics.


    1. garstats Post author

      I agree, with small samples, focusing on the group perspective can lead to the wrong conclusions about individual differences. In this case, I don’t think there is enough information to properly qualify the 2 seemingly outlier observations. If data from patients are much more difficult to obtain, a good strategy would be to test a large number of controls to establish a representative and informative benchmark. That way, patients could be classified as falling within, below or above the bulk of the controls – a strategy we used to study dyslexia for instance.

      I completely agree with your last idea: it would be very valuable to make statistical inferences in every participant, and then report group descriptions of these analyses. I will make some suggestions on how to go about it in a future post. Before I do that I will need to introduce a few tools – more soon. In the lab, we already use this approach to quantify effects in each participant and then perform group aggregates or comparisons of more meaningful quantities. Here are some examples:

      Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

      Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

      Liked by 1 person

      1. curious neuro

        Thanks for the links, yes I am aware of your papers and toolbox (use and refer to them, great work). So then my follow up question would be: lets say there is a large N (no measurement error), and say 70% show effects in one particular direction, 10% show no effect and 20% show an effect in the opposite direction, is it incorrect then to average across all and make population level inferences? With small N, this is clearly an issue as this blog post shows. This is by far the most common approach in nearly all neuroscience research. I feel it is validated as end of the day the results should speak to the population at large (this is how drug testing is also done). Also assumes when comparing effects that errors in the linear model are normally distributed which in my experience is almost never the case, but we keep chugging along nevertheless. I have switched the past few years to non parametric stats or ML approaches and am picking up robust stats based on your recent paper.


      2. garstats Post author

        The case you describe clearly highlights that we have a huge problem with common group statistics in neuroscience. A simple way forward is to stop caring about p values and binary group decisions, and instead describe the results faithfully. Concretely, that means less story telling and more detailed graphs. If you can perform statistical analyses in every participant, then it would be valuable to report maybe more qualitative, but more accurate group descriptions. So, to answer your question: I would report group tendencies, but clearly highlight the existence of sub-groups, and report the breakdown in percentages, as you did. With larger samples, it also become possible to introduce co-variates to try to understand individual differences.


  2. Pingback: Robust effect sizes for 2 independent groups | basic statistics

  3. Pingback: how to fix erroneous error bars for percent correct data | basic statistics

  4. Pingback: How to chase ERP monsters hiding behind bars | basic statistics

  5. Pingback: A few simple steps to improve the description of neuroscience group results | basic statistics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s