# how to fix erroneous error bars for percent correct data

Have you ever seen accurate bar graphs portrayed for percent correct data? For other bounded quantities, such as average scores from an ordinal scale (for instance a 1-9 Likert scale)? It is entirely possible that you have never seen accurate bar graphs of these quantities, because most of these graphs rely on the wrong tools: typically, the mean +/- SD or SEM is shown, or a classic confidence interval of the mean. Why are these techniques wrong? First, they use the mean, which is a non-robust estimator of central tendency; second, they use the variance, a non-robust estimator of dispersion; third, they assume symmetry; fourth, the results are not bounded, such that they can span impossible values, for instance percent correct beyond 100%. This is simply impossible: participants cannot be more than 100% correct. Yet, I regularly see articles with error bars beyond 100% correct, and authors, reviewers and editors seem to be ok with that.

How do we fix the problem? They are four simple answers, and one more elaborate:

1. Do not use bar graphs, use scatterplots instead. There is absolutely no reason why you should have to report means + error bars and hide your data.

2. Use a percentile bootstrap confidence interval – it will not produce boundaries with impossible values and will accommodate asymmetric distributions. If there is skewness or outliers, the mean will produce misleading results – use a robust estimator of central tendency instead, for instance the median or a trimmed mean (Wilcox & Keselman, 2003).

3. Use a binomial proportion confidence interval such as the Jeffreys interval. A quick google search indicates it is available in several R packages.

4. Compute d’ instead of percent correct: you will get a measure of sensitivity independent of bias, and on a continuous scale amenable to regular confidence interval calculations.

5. Use a generalised mixed model, for instance a logit mixed model (Jaeger, 2008).

## References

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434-446.

Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.

# the percentile bootstrap

Update: more in depth coverage is available in this tutorial, including R code:

A practical introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations

For a very short introduction focused on the R implementation of the bootstrap, see this other tutorial:

The percentile bootstrap: a primer with step-by-step instructions in R

To make inferences about the population mean, the percentile bootstrap can perform poorly; instead use the bootstrap-t technique.

“The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates.” Efron & Tibshirani, An introduction to the bootstrap, 1993

“The central idea is that it may sometimes be better to draw conclusions about the characteristics of a population strictly from the sample at hand, rather than by making perhaps unrealistic assumptions about the population.” Mooney & Duval, Bootstrapping, 1993

Like all bootstrap methods, the percentile bootstrap relies on a simple & intuitive idea: instead of making assumptions about the underlying distributions from which our observations could have been sampled, we use the data themselves to estimate sampling distributions. In turn, we can use these estimated sampling distributions to compute confidence intervals, estimate standard errors, estimate bias, and test hypotheses (Efron & Tibshirani, 1993; Mooney & Duval, 1993; Wilcox, 2012). The core principle to estimate sampling distributions is resampling, a technique pioneered in the 1960’s by Julian Simon (particularly inspiring is how he used dice and cards to teach resampling in statistics classes). The technique was developed & popularised by Brad Efron as the bootstrap.

Let’s consider an example, starting with this small set of 10 observations:

1.2 1.1 0.1 0.8 2.6 0.7 0.2 0.3 1.9 0.4

To take a bootstrap sample, we sample n observations with replacement. That is, given the 10 original observations above, we sample with replacement 10 observations from the 10 available. For instance, one bootstrap sample from the example above could be (sorted for convenience):

0.4 0.4 0.4 0.8 0.8 1.1 1.2 2.6 2.6 2.6

a second one:

0.1 0.3 0.4 0.8 1.1 1.2 1.2 1.9 1.9 1.9

a third one:

0.1 0.4 0.7 0.7 1.1 1.1 1.1 1.1 1.9 2.6

etc.

As you can see, in some bootstrap samples, certain observations were sampled once, others more than once, and yet others not at all. The resampling process is akin to running many experiments.

Figure 1. Bootstrap philosophy.

Essentially, we are doing fake experiments using only the observations from our sample. And for each of these fake experiments, or bootstrap sample, we can compute any estimate of interest, for instance the median. Because of random sampling, we get different medians from different draws, with some values more likely than other. After repeating the process above many times, we get a distribution of bootstrap estimates, let say 1,000 bootstrap estimates of the sample median. That distribution of bootstrap estimates is a data driven estimation of the sampling distribution of the sample median. Similarly, we can use resampling to estimate the sampling distribution of any statistics, without requiring any analytical formula. This is the major appeal of the bootstrap.

Let’s consider another example, using data from figure 5 of Harvey Motulsky’s 2014 article. We’re going to reproduce his very useful figure and add a 95% percentile bootstrap confidence interval. The data and Matlab code + pointers to R code are available on github. The file `pb_demo.m` will walk you through the different steps of bootstrap estimation, and can be used to recreate the figures from the rest of this post.

With the bootstrap, we estimate how likely we are, given the data, to obtain medians of different values. In other words, we estimate the sampling distribution of the sample median. Here is an example of a distribution of 1,000 bootstrap medians.

Figure 2. Kernel density distribution of the percentile bootstrap distribution of the sample median.

The distribution is skewed and rather rough, because of the particular data we used and the median estimator of central tendency. The Matlab code let you estimate other quantities, so for instance using the mean as a measure of central tendency would produce a much smoother and symmetric distribution. This is an essential feature of the bootstrap: it will suggest sampling distributions given the data at hand and a particular estimator, without assumptions about the underlying distribution. Thus, bootstrap sampling distributions can take many unusual shapes.

The interval, in the middle of the bootstrap distribution, that contains 95% of medians constitutes a percentile bootstrap confidence interval of the median.

Figure 3. Percentile bootstrap confidence interval of the median. CI = confidence interval.

Because the bootstrap sample distribution above is skewed, it might be more informative to report a highest-density interval – a topic for another post.

To test hypotheses, we can reject a point hypothesis if it is not included in the 95% confidence interval (a p value can also be obtained – see online code). Instead of testing a point hypothesis, or in addition, it can be informative to report the bootstrap distribution in a paper, to illustrate likely sample estimates given the data.

Now that we’ve got a 95% percentile bootstrap confidence interval, how do we know that it is correct? In particular, how many bootstrap samples do we need? The answer to this question depends on your goal. One goal might be to achieve stable results: if you repeatedly compute a confidence interval using the same data and the same bootstrap technique, you should obtain very similar confidence intervals. Going back to our example, if we take a sub-sample of the data, and compute many confidence intervals of the median, we sometimes get very different results. The figure below illustrates 7 confidence intervals of the median using the same small dataset. The upper boundaries of the different confidence intervals vary far too much:

Figure 4. Repeated calculations of the percentile bootstrap confidence interval of the median for the same dataset.

The variability is due in part to the median estimator, which introduces strong non-linearities. This point is better illustrated by looking at 1,000 sorted bootstrap median estimates:

Figure 5. Sorted bootstrap median estimates.

If we take another series of 1,000 bootstrap samples, the non-linearities will appear at slightly different locations, which will affect confidence interval boundaries. In that particular case, one way to solve the variability problem is to increase the number of bootstrap samples – for instance using 10,000 samples produces much more stable confidence intervals (see code). Using more observations also improves matters significantly.

If we get back to the question of the number of bootstrap samples needed, another goal is to achieve accurate probability coverage. That is, if you build a 95% confidence interval, you want the interval to contain the population value 95% of the time in the long run. Concretely, if you repeat the same experiment over and over, and for each experiment you build a 95% confidence interval, 95% of these intervals should contain the population value you are trying to estimate if the sample size is large enough. This can be achieved by using a conjunction of 2 techniques: a technique to form the confidence interval (for instance a percentile bootstrap), and a technique to estimate a particular quantity (for instance the median to estimate the central tendency of the distribution). The only way to find out which combo of techniques work is to run simulations covering a lot of hypothetical scenarios – this is what statisticians do for a living, and this is why every time you ask one of them what you should do with your data, the answer will inevitably be “it depends”. And it depends on the shape of the distributions we are sampling from and the number of observations available in a typical experiment in your field. Needless to say, the best approach to use in one particular case is not straightforward: there is no one-size-fits-all technique to build confidence intervals; so any sweeping recommendation should be regarded suspiciously.

The percentile bootstrap works very well, and in certain situations is the only (frequentist) technique known to perform satisfactorily to build confidence intervals of or to compare for instance medians and other quantiles, trimmed means, M estimators, regression slopes estimates, correlation coefficients (Wilcox 2012). However, the percentile bootstrap

does not perform well with all quantities, in particular with the mean (Wilcox & Keselman, 2003). You can still use the percentile bootstrap to illustrate the variability in the sample at hand, without making inferences about the underlying population. We do this in the figure below to see how the percentile bootstrap confidence interval compares to other ways to summarise the data.

Figure 6. Updated version of Motulsky’s 2014 figure 5.

This is a replication of Motulsky’s 2014 figure 5, to which I’ve added a 95% percentile bootstrap confidence interval of the mean. This figure makes a critical point: there is no substitute for a scatterplot, at least for relatively small sample sizes. Also, using the mean +/- SD, +/- SEM, with a classic confidence interval (using t formula) or with a percentile bootstrap confidence interval can provide very different impressions about the spread in the data (although it is not their primary objective). The worst representation clearly is mean +/- SEM, because it provides a very misleading impression of low variability. Here, because the sample is skewed, mean +/- SEM does not even include the median, thus providing a wrong estimation of the location of the bulk of the observations. It follows that results in an article reporting only mean +/- SEM cannot be assessed unless  scatterplots are provided, or at least estimates of skewness, bi-modality and complementary measures of uncertainty for comparison. Reporting a boxplot or the quartiles does a much better job at conveying the shape of the distribution than any of the other techniques. These representations are also robust to outliers. In the next figure, we consider a subsample of the observations from Figure 6, to which we add an outlier of increasing size: the quartiles do not move.

Figure 7. Outlier effect on the quartiles. The y-axis is truncated.

Contrary to the quartiles, the classic confidence interval of the mean is not robust, so it provides very inaccurate results. In particular, it assumes symmetry, so even though the outlier is on the right side of the distribution, both sides of the confidence interval get larger. The mean is also  pulled towards the outlier, to the point where it is completely outside the bulk of the observations. I cannot stress this enough: you cannot trust mean estimates if scatterplots are not provided.

Figure 8. Outlier effect on the classic confidence interval of the mean.

In comparison, the percentile bootstrap confidence interval of the mean performs better: only its right side, the side affected by the outlier, expends as the outlier gets larger.

Figure 9. Outlier effect on the percentile bootstrap confidence interval of the mean.

Of course, we do not have to use the mean as a measure of central tendency. It is trivial to compute a percentile bootstrap confidence interval of the median instead, which, as expected, does not change with outlier size:

Figure 10. Outlier effect on the percentile bootstrap confidence interval of the median.

## Conclusion

The percentile bootstrap can be used to build a confidence interval for any quantity, whether its sampling distribution can be estimated analytically or not. However, there is no guarantee that the confidence interval obtained will be accurate. In fact, in many situations alternative methods outperform the percentile bootstrap (such as percentile-t, bias corrected, bias corrected & accelerated (BCa), wild bootstraps). With this caveat in mind, I think the percentile bootstrap remains an amazingly simple yet powerful tool to summarise the accuracy of an estimate given the variability in the data. It is also

the only frequentist tool that performs well in many situations – see Wilcox 2012 for an extensive coverage of these situations.

Finally, it is important to realise that the bootstrap does make a very strong & unwarranted assumption: only the observations in the sample can ever be observed. For this reason, for small samples the bootstrap can produce rugged sampling distributions, as illustrated above. Rasmus Bååth wrote about the limitations of the percentile bootstrap and its link to Bayesian estimation in a blog post I highly recommend; he also provided R code for the bootstrap and the Bayesian bootstrap in another post.

## References

Efron, B. & Tibshirani Robert, J. (1993) An introduction to the bootstrap. Chapman & Hall, London u.a.

Mooney, C.Z. & Duval, R.D. (1993) Bootstrapping : a nonparametric approach to statistical inference. Sage Publications, Newbury Park, Calif. ; London.

Motulsky, H.J. (2014) Common misconceptions about data analysis and statistics. J Pharmacol Exp Ther, 351, 200-205.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam ; Boston.

Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.

# Robust effect sizes for 2 independent groups

When I was an undergrad, I was told that beyond a certain sample size (n=30 if I recall correctly), t-tests and ANOVAs are fine. This was a lie. I wished I had been taught robust methods and that t-tests and ANOVAs on means are only a few options among many alternatives. Indeed, t-tests and ANOVAs on means are not robust to outliers, skewness, heavy-tails, and for independent groups, differences in skewness, variance (heteroscedasticity) and combinations of these factors (Wilcox & Keselman, 2003; Wilcox, 2012). The main consequence is a lack of statistical power. For this reason, it is often advised to report a measure of effect size to determine, for instance, if a non-significant effect (based on some arbitrary p value threshold) could be due to lack of power, or reflect a genuine lack of effect. The rationale is that an effect could be associated with a sufficiently large effect size but yet fail to trigger the arbitrary p value threshold. However, this advise is pointless, because classic measures of effect size, such as Cohen’s d, its variants, and its extensions to ANOVA are not robust.

To illustrate the problem, first, let’s consider a simple situation in which we compare 2 independent groups of 20 observations, each sampled from a normal distribution with mean = 0 and standard deviation = 1. We then add a constant of progressively larger value to one of the samples, to progressively shift it away from the other. As illustrated in Figure 1, as the difference between the two groups increases, so does Cohen’s d. The Matlab code to reproduce all the examples is available here, along with a list of matching R functions from Rand Wilcox’s toolbox.

Figure 1. Examples of Cohen’s d as a function of group differences. For simplicity, I report the absolute value of Cohen’s d, here and in subsequent figures.

We can map the relationship between group mean differences and d systematically, by running a simulation in which we repeatedly generate two random samples and progressively shift one away from the other by a small amount. We get a nice linear relationship (Figure 2).

Figure 2. Linear relationship between Cohen’s d and group mean differences.

Cohen’s d appears to behave nicely, so what’s the problem? Let’s consider another example, in which we generate 2 samples of 20 observations from a normal distribution, and shift their means by a fixed amount of 2. Then, we replace the largest observation from group 2 by progressively larger values. As we do so, the difference between the means of group 1 and group 2 increases, but Cohen’s d decreases (Figure 3).

Figure 3. Cohen’s d is not robust to outliers.

Figure 4 provides a more systematic illustration of the effect of extreme values on Cohen’s d for the case of 2 groups of 20 observations. As the group difference increases, Cohen’s d wrongly suggests progressively lower effect sizes.

Figure 4. Cohen’s d as a function of group mean differences in the presence of one outlier. There is an inverse and slightly non-linear relationship between the two variables.

What is going on? Remember that Cohen’s d is the difference between the two group means divided by the pooled standard deviation. As such, neither the numerator nor the denominator are robust, so that even one unusual value can potentially significantly alter d and lead to the wrong conclusions about  effect size. In the example provided in Figure 4, d gets smaller as the mean difference increases because the denominator of d is composed of a non-robust estimator of dispersion, the variance, such that the outlier increases variability, which leads to an increase of the denominator, and thus a lower d. The outlier also has a strong effect on the mean, which leads to an increase of the numerator, and thus larger d. However, the outlier has a stronger effect on the variance than the mean: this imbalance explains the overall decrease of d with increasing outlier size. I leave it as an exercise to understand the origin of the non-linearity in Figure 4. It has to do with the differential effect of the outlier on the mean and the variance.

One could argue that the outlier value added to one of the groups could be removed, which would solve the problem. There are 3 objections to this argument:

• there are situations in which extreme values are not outliers but expected and plausible observations from a skewed or heavy tail distribution, and thus physiologically or psychologically meaningful values. In other words, what looks like an outlier in a sample of 20 observations could well look very natural in a sample of 200 observations;
• for small sample sizes, relatively small outliers could go unnoticed but still affect effect size estimation;
• outliers are not the only problem: skewness & heavy tails can affect the mean and the variance and thus d.

For instance, in some cases, two groups can differ in skewness, as illustrated in Figure 5. In the left panel, the two kernel density estimates illustrate two samples of 100 observations from a normal distribution. The two groups overlap only moderately, and Cohen’s d is high. In the right panel, group 1, with a mean of zero, is the same as in the previous panel; group 2, with a mean of 2, is almost identical to the one in the left panel, except that its largest 10% observations were replaced with slightly larger observations. As a result, the overlap between the two distributions is the same in the two panels – yet Cohen’s d is quite smaller in the second example.

Figure 5. Cohen’s d for normal & skewed distributions.

The point of this example is to illustrate the potential for discrepancies between a visual inspection of two distributions and Cohen’s d. Clearly, in Figure 5, a useful measure of effect size should provide the same estimates for the two examples. Fortunately, several robust alternatives have this desirable property, including Cliff’s delta, the Kolmogorov-Smirnov test statistic, Wilcox & Muska’s Q, and mutual information.

## Robust versions of Cohen’s d

Before going over the 4 robust alternatives listed above, it is useful to consider that Cohen’s d is part of a large family of estimators of effect size, which can be described as the ratio of a difference between two measures of central tendency (CT), over some measure of variability:

(CT1 – CT2) / variability

From this expression, it follows that robust effect size estimators can be derived by plugging in robust estimators of central tendency in the numerator and robust estimators of variability in the denominator. Several examples of such robust alternatives are available, for instance using trimmed means and Winsorised variances (Keselman et al. 2008; Wilcox 2012). R users might want to check these functions from Wilcox for instance:

• `akp.effect`
• `yuenv2`
• `med.effect`

There are also extensions of these quantities to the comparison of more than one group (Wilcox 2012).

## Robust & intuitive measures of effect sizes

In many situations, the robust effect sizes presented above can bring a great improvement over Cohen’s d and its derivatives. However, they provide only a limited perspective on the data. First, I don’t find this family of effect sizes the easiest to interpret: having to think of effects in standard deviation (or robust equivalent) units is not the most intuitive. Second, this type of effect sizes does not always answer the questions we’re most interested in (Cliff, 1996; Wilcox, 2006).

### The simplest measure of effect size: the difference

Fortunately, effect sizes don’t have to be expressed as the ratio difference / variability. The simplest effect size is simply a difference. For instance, when reporting that group A differs from group B, typically people report the mean for each group. It is also very useful to report the difference, without normalisation, but with a confidence or credible interval around it, or some other estimate of uncertainty. This simple measure of effect size can be very informative, particularly if you care about the units. It is also trivial to make it robust by using robust estimators, such as the median when dealing with reaction times and other skewed distributions.

### Probabilistic effect size and the Wilcoxon-Mann-Whitney U statistic

For two independent groups, asking by how much the central tendencies of the two groups differ is useful, but this certainly does not exhaust all the potential differences between the two groups. Another perspective relates to a probabilistic description: for instance, given two groups of observations, what is the probability that one random observation from group 1 is larger than a random observation from group 2? Given two independent variables X and Y, this probability can be defined as P(X > Y). Such probability gives a very useful indication of the amount of overlap between the two groups, in a way that is not limited to and dependent on measures of central tendency. More generally, we could consider these 3 probabilities:

• P(X > Y)
• P(X = Y)
• P(X < Y)

These probabilities are worth reporting in conjunction with illustrations of the group distributions. Also, there is a direct relationship between these probabilities and the Wilcoxon-Mann-Whitney U statistic (Birnbaum, 1956; Wilcox 2006). Given sample sizes Nx and Ny:

U / NxNy = P(X > Y) + 0.5 x P(X = Y)

In the case of two strictly continuous distributions, for which ties do not occur:

U / NxNy = P(X > Y)

### Cliff’s delta

Cliff suggested to use P(X > Y) and P(X < Y) to compute a new measure of effect size. He defined what is now called Cliff’s delta as:

delta = P(X > Y) – P(X < Y)

Cliff’s delta estimates the probability that a randomly selected observation from one group is larger than a randomly selected observation from another group, minus the reverse probability (Cliff, 1996). It is estimated as:

delta = (sum(x > y) – sum(x < y)) / NxNy

In this equation, each observation from one group is compared to each observation in the other group, and we count how many times the observations from one group are higher or lower than in the other group. The difference between these two counts is then divided by the total number of observations, the product of their sample sizes NxNy. This statistic ranges from 1 when all values from one group are higher than the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0. Because delta is a statistic based on ordinal properties of the data, it is unaffected by rank preserving data transformations. Its non-parametric nature reduces the impact of extreme values or distribution shape. For instance, Cliff’s delta is not affected by the outlier or the difference in skewness in the examples from Figure 3 & 5.

For an MEEG application, we’ve used Cliff’s delta to quantify effect sizes in single-trial ERP distributions (Bieniek et al. 2015). We also used Q, presented later on in this post, but it behaved so similarly to delta that it does not feature in the paper.

An estimate of the standard error of delta can be used to compute a confidence interval for delta. When conditions differ, the statistical test associated with delta can be more  powerful than the Wilcoxon-Mann-Whitney test, which uses the wrong standard error (Cliff, 1996; Wilcox, 2006). Also, contrary to U, delta is a direct measure of effect size, with an intuitive interpretation. There are also some attempts at extending delta to handle more than two groups (e.g. Wilcox, 2011). Finally, Joachim Goedhart has provided an Excel macro to compute Cliff’s delta.

Update: Cliff’s delta is also related to the later introduced “common-language effect size” – see this post from Jan Vanhove.

### All pairwise differences

Cliff’s delta is a robust and informative measure of effect size. Because it relies on probabilities, it normalises effect sizes onto a common scale useful for comparisons across experiments. However, the normalisation gets rid of the original units. So, what if the units matter? A complementary perspective to that provided by delta can be gained by considering all the pairwise differences between individual observations from the two groups (Figure 6). Such distribution can be used to answer a very useful question: given that we randomly select one observation from each group, what is the typical difference we can expect? This can be obtained by computing for instance the  median of the pairwise differences. An illustration of the full distribution provides a lot more information: we can see how far away the bulk of the distribution is from zero, get a sense of how large differences can be in the tails…

Figure 6. Illustration of all pairwise differences. Left panel: scatterplots of the two groups of observations. One observation from group 1 (in red) is compared to all the observations from group 2 (in orange). The difference between all the pairs of observations is saved and the same process is applied to all the observations from group 1. Right panel: kernel density estimate of the distribution of all the pairwise differences between the two groups. The median of these differences is indicated by the continuous vertical line; the 1st & 3rd quartiles are indicated by the dashed vertical lines.

Something like Figure 6, in conjunction with Cliff’s delta and associated probabilities, would provide a very useful summary of the data.

### When Cohen’s d & Cliff’s delta fail

Although robust alternatives to Cohen’s d considered so far, including Cliff’s delta, can handle well situations in which 2 conditions differ in central tendency, they fail completely to describe situations like the one in Figure 7. In this example, the two distributions are dramatically different from each other, yet Cohen’s d is exactly zero, and Cliff’s delta is very close to zero.

Figure 7. Measures of effect size for two distributions that differ in spread, not in location. Cd = Cohen’s d, delta = Cliff’s delta, MI = mutual information, KS = Kolmogorov-Smirnov test statistics, Q = Wilcox & Muska’s Q.

Here the two distributions differ in spread, not in central tendency, so it would wise to estimate spread instead. This is indeed one possibility. But it would also be nice to have an estimator of effect size that can handle special cases like this one as well. Three estimators fit the bill, as suggested by the title of Figure 7.

### The Kolmogorov-Smirnov statistic

It’s time to introduce a powerful all-rounder: the Kolmogorov-Smirnov test statistic. The KS test is often mentioned to compare one distribution to a normal distribution. It can also be used to compare two independent samples. In that context, the KS test statistic is defined as the maximum of the absolute differences between the empirical cumulative distribution functions (ecdf) of the two groups. As such KS is not limited to differences in central tendency; it is also robust, independent of the shape of distributions, and provides a measure of effect size bounded between 0 and 1. Figure 8 illustrates the statistic using the example from Figure 7. The KS statistic is quite large, suggesting correctly that the two distributions differ. More generally, because it is robust and sensitive to differences located anywhere in the distributions, the KS test is a solid candidate for a default test for two independent samples. However, the KS test is more sensitive to differences in the middle of the distributions than in the tails. To correct this problem, there is also a weighted version of the KS test which provides increased sensitivity to differences in the tails of the distributions – check out the `ks` R function from Wilcox.

Figure 8. Illustration of the KS statistic for two independent samples. The top panel shows the kernel density estimates for the two groups. The lower panel shows the matching empirical cumulative distribution functions. The thick black line marks the maximum absolute difference between the two ecdfs – the KS statistic. Figure 8 is the output of the `ksstat_fig` Matlab function written for this post.

The KS statistic non-linearly increases as the difference in variance between two samples of 100 observations progressively increases (Figure 9). The two samples were drawn from a standard normal distribution and do not differ in mean.

Figure 9. Relationship between effect sizes and variance differences. The 3 measures of effect size illustrated here are sensitive to distribution differences other than central tendency, and are therefore better able to handle a variety of cases compared to traditional effect size estimates.

### Wilcox & Muska’s Q

Similarly to KS, the Q statistic is also a non-parametric measure of effect size. It ranges from 0 to 1, with chance level at 0.5. It is the probability of correctly deciding whether a randomly selected observation from one of two groups belongs to the first group, based on the kernel density estimates of the two groups (Wilcox & Muska, 1999). Essentially, it reflects the degree of separation between two groups. Again, similarly to KS, in situations in which two distributions differ in other aspects than central tendency, Q might suggest that a difference exists, whereas other methods such as Cohen’s d or Cliff’s delta would not (Figure 9).

### Mutual information

In addition to the KS statistic and Q, a third estimator can be used to quantify many sorts of differences between two or more independent samples: mutual information (MI). MI is a non-parametric measure of association between distributions. As shown in Figure 9, it is sensitive to group differences in spread. MI is expressed in bits and is quite popular in neuroscience – much more so than in psychology. MI is a powerful and much more versatile quantity than any of the tools we have considered so far. To learn more about MI, check out Robin Ince’s tutorial with Matlab & Python code and examples, with special applications to brain imaging. There is also a clear illustration of MI calculation using bins in Figure S3 of Schyns et al. 2010.

In the lab, we use MI to quantify the relationship between stimulus variability and behaviour or brain activity (e.g. Rousselet et al. 2014). This is done using single-trial distributions in every participant. Then, at the group level, we compare distributions of MI between conditions or groups of participants. We thus use MI as a robust measure of within-participant effect size, applicable to many situations. This quantity can then be illustrated and tested across participants. This strategy is particularly fruitful to compare brain activity between groups of participants, such as younger and older participants. Cliff’s delta for instance could then be used to quantify the MI difference between groups.

## Comparisons of effect sizes

We’ve covered several useful robust measures of effect size, with different properties. So, which one should be used? In statistics, the answer to this sort of questions often is “it depends”. Indeed, it depends on your needs and on the sort of data you’re dealing with. It also depends on which measure makes more sense to you. The code provided with this post will let you explore the different options using simulated data or your own data. For now, we can get a sense of the behaviour of delta, MI, KS and Q for relatively large samples of observations from a normal distribution. In Figure 10, two distributions are progressively shifted from each other.

Figure 10. Examples of effect size estimates for different distribution shifts.

Figure 11 provides a more systematic mapping of the relationship between effect size estimates and the difference between the means of two groups of 100 observations. The KS statistic and Q appear to have similar profiles, with a linear rise for small differences, before progressively reaching a plateau. In contrast, Cliff’s delta appears to be less variable and to reach a maximum earlier than KS and Q. MI differs from the other 3 quantities with its non-linear rise for small mean differences.

Figure 11. Relationship between effect sizes and mean differences.

To more clearly contrast the 4 effect sizes, all their pairwise comparisons are provided in Figure 12. From these comparisons, it seems that KS and Q are almost completely linearly related. If this is the case, then there isn’t much advantage in using Q given that it is much slower to compute than KS. Other comparisons reveal different non-linearities between estimators. These differences would certainly be worth exploring in particular experimental contexts… But enough for this post.

Figure 12. Relationship between effect sizes.

## Final notes

Given that Cohen’s d and related estimators of effect size are not robust suggests that they should be abandoned in favour of robust methods. This is not to say that Cohen’s d is of no value – for instance in the case of single-trial ERP distributions of 100s of trials, it would be appropriate (Bieniek et al. 2015). But for typical group level analyses, I see no reason to use non-robust methods such as Cohen’s d. And defending the use of Cohen’s d and related measures for the sake of continuity in the literature, so that readers can compare them across studies is completely misguided: non-robust measures cannot be compared because the same value can be obtained for different amounts of overlap between distributions. For this reason, I am highly suspicious of any attempt to perform meta-analysis or to quantify effect sizes in the literature using published values, without access to the raw data. To allow true comparisons across studies, there is only one necessary and sufficient step: to share your data.

In the literature, there is a rampant misconception assuming that statistical tests and measures of effect size are different  entities. The Kolmogorov-Smirnov test and Cliff’s delta demonstrate that both aspects can be combined elegantly. Other useful measures of effect size, such as mutual information, can be used to test hypotheses by combining them with a bootstrap or permutation approach.

Which technique to use in which situation is something best worked out by yourself, given your own data and extensive tests. Essentially, you want to find measures that are informative and intuitive to use, and that you can trust in the long run. The alternatives described in this post are not the only ones on the market, but they are robust, informative, intuitive, and they cover a lot of useful situations. For instance, if the fields of neuroscience and psychology were to use the Kolmogorov-Smirnov test as default test when comparing two independent groups, I would expect a substantial reduction in the number of false negatives reported in the literature. The Kolmogorov-Smirnov test statistic is also a useful measure of effect size on its own. But because the KS test does not tell us how two distributions differ, it requires the very beneficial addition of detailed illustrations to understand how two groups differ.  This comment applies to all the techniques described in this post, which, although useful, do not provide a full picture of the effects. Notably, they do not tell us how two distributions differ. But detailed illustrations can be combined with robust estimation to compare 2 entire distributions.

## References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Birnbaum ZW. 1955. On a use of the Mann-Whitney statistic

Cliff N. 1996. Ordinal methods for behavioral data analysis. Mahwah, N.J.: Erlbaum

Keselman HJ, Algina J, Lix LM, Wilcox RR, Deering KN. 2008. A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13: 110-29

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Wilcox RR. 2006. Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education 74: 353-67

Wilcox, R.R. (2011) Inferences about a Probabilistic Measure of Effect Size When Dealing with More Than Two Groups. Journal of Data Science, 9, 471-486.

Wilcox RR. 2012. Introduction to robust estimation and hypothesis testing. Amsterdam ; Boston: Academic Press

Wilcox RR, Keselman HJ. 2003. Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods 8: 254-74

Wilcox RR, Muska J. 2010. Measuring effect size: A non-parametric analogue of omega(2). The British journal of mathematical and statistical psychology 52: 93-110