Have you ever seen accurate bar graphs portrayed for percent correct data? For other bounded quantities, such as average scores from an ordinal scale (for instance a 19 Likert scale)? It is entirely possible that you have never seen accurate bar graphs of these quantities, because most of these graphs rely on the wrong tools: typically, the mean +/ SD or SEM is shown, or a classic confidence interval of the mean. Why are these techniques wrong? First, they use the mean, which is a nonrobust estimator of central tendency; second, they use the variance, a nonrobust estimator of dispersion; third, they assume symmetry; fourth, the results are not bounded, such that they can span impossible values, for instance percent correct beyond 100%. This is simply impossible: participants cannot be more than 100% correct. Yet, I regularly see articles with error bars beyond 100% correct, and authors, reviewers and editors seem to be ok with that.
How do we fix the problem? They are four simple answers, and one more elaborate:
 Do not use bar graphs, use scatterplots instead. There is absolutely no reason why you should have to report means + error bars and hide your data.

Use a percentile bootstrap confidence interval – it will not produce boundaries with impossible values and will accommodate asymmetric distributions. If there is skewness or outliers, the mean will produce misleading results – use a robust estimator of central tendency instead, for instance the median or a trimmed mean (Wilcox & Keselman, 2003).

Use a binomial proportion confidence interval such as the Jeffreys interval. A quick google search indicates it is available in several R packages.

Compute d’ instead of percent correct: you will get a measure of sensitivity independent of bias, and on a continuous scale amenable to regular confidence interval calculations.

Use a generalised mixed model, for instance a logit mixed model (Jaeger, 2008).
References
Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434446.
Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254274.
A Wilson or Jeffrey’s Interval are reasonable choices and probably a simpler fix than bootstrapping.
Don’t get me wrong, I completely agree, but insisting on these simpler fix would probably lead to less resistance in the review process.
LikeLike
Thanks for the suggestion Chris. There is a good selection of alternatives here, including Jeffreys interval.
Wilcox 2012 also has a few options for binomial fits, but I’ve never experimented with them, so I don’t feel confident making a recommendation.
LikeLike
At least in the case of percent correct, an even better alternative is to report d’ instead. Unlike percent correct, d’ is a linear measure of capacity where d’=3 is actually 3 times better than d’=1. In any case, all of your points still stand.
LikeLike
Very good point! Thanks for the reminder.
LikeLike
What a strange comment about d’ – I am not at all clear on what you mean. In what sense do you mean d’ is linear? For a yes/no task d’ is the ztransform of hits minus the ztransform of false alarms. That is a nonlinear transform and relies heavily on assumptions about the observer (e.g., equal variances) and the task. Lastly, at high levels of performance d’ has an even more troublesome issue – at 100% it is infinite. I am an advocate for SDT but I don’t know how it is the solution here…
LikeLike
d’ is not without problems and its strong underlying assumptions. But across participants, d’ is continuous and unbounded, so a confidence interval method that assumes continuous variables should be fine.
LikeLike