# Trimmed means

The R code for this post is on github.

Trimmed means are robust estimators of central tendency. To compute a trimmed mean, we remove a predetermined amount of observations on each side of a distribution, and average the remaining observations. If you think you’re not familiar with trimmed means, you already know one famous member of this family: the median. Indeed, the median is an extreme trimmed mean, in which all observations are removed except one or two.

Using trimmed means confers two advantages:

• trimmed means provide a better estimation of the location of the bulk of the observations than the mean when sampling from asymmetric distributions;
• the standard error of the trimmed mean is less affected by outliers and asymmetry than the mean, so that tests using trimmed means can have more power than tests using the mean.

Important point: if we use a trimmed mean in an inferential test (see below), we make inferences about the population trimmed mean, not the population mean. The same is true for the median or any other measure of central tendency. So each robust estimator is a tool to answer a specific question, and this is why different estimators can return different answers…

Here is how we compute a 20% trimmed mean.

Let’s consider a sample of 20 observations:

39 92 75 61 45 87 59 51 87 12  8 93 74 16 32 39 87 12 47 50

First we sort them:

8 12 12 16 32 39 39 45 47 50 51 59 61 74 75 87 87 87 92 93

The number of observations to remove is `floor(0.2 * 20) = 4`. So we trim 4 observations from each end:

(8 12 12 16) 32 39 39 45 47 50 51 59 61 74 75 87 (87 87 92 93)

And we take the mean of the remaining observations, such that our 20% trimmed mean = `mean(c(32,39,39,45,47,50,51,59,61,74,75,87)) = 54.92`

Let’s illustrate the trimming process with a normal distribution and 20% trimming: We can see how trimming gets rid of the tails of the distribution, to focus on the bulk of the observations. This behaviour is particularly useful when dealing with skewed distributions, as shown here: In this skewed distribution (it’s an F distribution), there is more variability on the right side, which appears as stretched compared to the left side. Because we trim the same amount on each side, trimming removes a longer chunk of the distribution on the right side than the left side. As a consequence, the mean of the remaining points is more representative of the location of the bulk of the observations. This can be seen in the following examples. Panel A shows the kernel density estimate of 100 observations sampled from a standard normal distribution (MCT stands for measure of central tendency). By chance, the distribution is not perfectly symmetric, but the mean, 20% trimmed mean and median give very similar estimates, as expected. In panel B, however, the sample is from a lognormal distribution. Because of the asymmetry of the distribution, the mean is dragged towards the right side of the distribution, away from the bulk of the observations. The 20% trimmed mean is to the left of the mean, and the median further to the left, closer to the location of most observations. Thus, for asymmetric distributions, trimmed means provide more accurate information about central tendency than the mean.

**Q: “By trimming, don’t we loose information?”**

I have heard that question over and over. The answer depends on your goal. Statistical methods are only tools to answer specific questions, so it always depends on your goal. I have never met anyone with a true interest in the mean: the mean is always used, implicitly or explicitly, as a tool to indicate the location of the bulk of the observations. Thus, if your goal is to estimate central tendency, then no, trimming doesn’t discard information, it actually increases the quality of the information about central tendency.

I have also heard that criticism: “I’m interested in the tails of the distributions and that’s why I use the mean, trimming gets rid of them”. Tails certainly have interesting stories to tell, but the mean is absolutely not the tool to study them because it mingles all observations into one value, so we have no way to tell why means differ among samples. If you want to study entire distributions, they are fantastic graphical tools available (Rousselet, Pernet & Wilcox 2017).

## Implementation

Base R has trimmed means built in:

`mean` can be used by changing the `trim` argument to the desired amount of trimming:

`mean(x, trim = 0.2)` gives a 20% trimmed mean.

In Matlab, try the `tm` function available here.

In Python, try the `scipy.stats.tmean` function. More Python functions are listed here.

## Inferences

There are plenty of R functions using trimmed means on Rand Wilcox’s website.

We can use trimmed means instead of means in t-tests. However, the calculation of the standard error is different from the traditional t-test formula. This is because after trimming observations, the remaining observations are no longer independent. The formula for the adjusted standard error was originally proposed by Karen Yuen in 1974, and it involves winsorization. To winsorize a sample, instead of removing observations, we replace them with the remaining extreme values. So in our example, a 20% winsorized sample is:

32 32 32 32 32 39 39 45 47 50 51 59 61 74 75 87 87 87 87 87

Taking the mean of the winsorized sample gives a winsorized mean; taking the variance of the winsorized sample gives a winsorized variance etc. I’ve never seen anyone using winsorized means, however the winsorized variance is used to compute the standard error of the trimmed mean (Yuen 1974). There is also a full mathematical explanation in Wilcox (2012).

You can use all the functions below to make inferences about means too, by setting `tr=0`. How much trimming to use is an empirical question, depending on the type of distributions you deal with. By default, all functions set `tr=0.2`, 20% trimming, which has been studied a lot and seems to provide a good compromise. Most functions will return an error with an alternative function suggestion if you set `tr=0.5`: the standard error calculation is inaccurate for the median and often the only satisfactory solution is to use a percentile bootstrap.

**Q: “With trimmed means, isn’t there a danger of users trying different amounts of trimming and reporting the one that give them significant results?”**

This is indeed a possibility, but dishonesty is a property of the user, not a property of the tool. In fact, trying different amounts of trimming could be very informative about the nature of the effects. Reporting the different results, along with graphical representations, could help provide a more detailed description of the effects.

The Yuen t-test performs better than the t-test on means in many situations. For even better results, Wilcox recommends to use trimmed means with a percentile-t bootstrap or a percentile bootstrap. With small amounts of trimming, the percentile-t bootstrap performs better; with at least 20% trimming, the percentile bootstrap is preferable. Details about these choices are available for instance in Wilcox (2012) and Wilcox & Rousselet (2017).

Yuen’s approach

1-alpha confidence interval for the trimmed mean: `trimci(x,tr=.2,alpha=0.05)`

Yuen t-test for 2 independent groups: `yuen(x,y,tr=.2)`

Yuen t-test for 2 dependent groups: `yuend(x,y,tr=.2)`

Bootstrap percentile-t method

One group: `trimcibt(x,tr=.2,alpha=.05,nboot=599)`

Two independent groups: `yuenbt(x,y,tr=.2,alpha=.05,nboot=599)`

Two dependent groups: `ydbt(x,y,tr=.2,alpha=.05,nboot=599)`

Percentile bootstrap approach

One group: `trimpb(x,tr=.2,alpha=.05,nboot=2000)`

Two independent groups: `trimpb2(x,y,tr=.2,alpha=.05,nboot=2000)`

Two dependent groups: `dtrimpb(x,y=NULL,alpha=.05,con=0,est=mean)`

### Matlab

There are some Matlab functions here:

`tm` – trimmed mean

`yuen` – t-test for 2 independent groups

`yuend` – t-test for 2 dependent groups

`winvar` – winsorized variance

`winsample` – winsorized sample

`wincov` – winsorized covariance

These functions can be used with several estimators including  trimmed means:

`pb2dg` – percentile bootstrap for 2 dependent groups

`pb2ig`– percentile bootstrap for 2 independent groups

`pbci`– percentile bootstrap for 1 group

Several functions for trimming large arrays and computing confidence intervals are available in the LIMO EEG toolbox.

## References

Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165

Rousselet, Guillaume; Pernet, Cyril; Wilcox, Rand (2017): Beyond differences in means: robust graphical methods to compare two groups in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.4055970.v7

Rand R. Wilcox, Guillaume A. Rousselet. A guide to robust statistical methods in neuroscience bioRxiv 151811; doi: https://doi.org/10.1101/151811

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

# Matlab code for the shift function: a powerful tool to compare two entire marginal distributions

Recently, I presented R code for the shift function, a powerful tool to compare two entire marginal distributions.

The Matlab code is now available on github.

`shifthd` has the same name as its R version, which was originally programmed by Rand Wilcox and first documented in 1995 (see details ). It computes a shift function for independent groups, using a percentile bootstrap estimation of the SE of the quantiles to compute confidence intervals.

`shiftdhd` is the version for dependent groups.

More recently, Wilcox introduced a new version of the shift function in which a straightforward percentile bootstrap is used to compute the confidence intervals, without estimation of the SE of the quantiles. This is implemented in Matlab as `shifthd_pbci` for independent groups (equivalent to `qcomhd` in R); as `shiftdhd_pbci` for dependent groups (equivalent to `Dqcomhd` in R).

A demo file `shift_function_demo` is available here, along with the function `shift_fig` and dependencies `cmu` and `UnivarScatter`.

For instance, if we use the ozone data covered in the previous shift function post, a call to `shifthd` looks like this:

[xd, yd, delta, deltaCI] = shifthd(control,ozone,200,1);

producing this figure: The output of `shifthd`, or any of the other 3 sf functions, can be used as input into `shift_fig`:

shift_fig(xd, yd, delta, deltaCI,control,ozone,1,5);

producing this figure: This is obviously work in progress, and `shift_fig` is meant as a starting point.

Have fun exploring how your distributions differ!

And if you have any question, don’t hesitate to get in touch.

# A few simple steps to improve the description of neuroscience group results

This post is a draft of an editorial letter I’m writing for the European Journal of Neuroscience. It builds on previous posts on visualisation of behavioural and ERP data.

Update 2016-09-16: the editorial is now accepted:

Rousselet, G. A., Foxe, J. J. and Bolam, J. P. (2016), A few simple steps to improve the description of group results in neuroscience. Eur J Neurosci. Accepted Author Manuscript. doi:10.1111/ejn.13400

The final illustrations are available on Figshare: Rousselet, G.A. (2016): A few simple steps to improve the description of group results in neuroscience. figshare. https://dx.doi.org/10.6084/m9.figshare.3806487

There are many changes necessary to improve the quality of neuroscience research. Suggestions abound to increase openness, promote better experimental designs and analyses, and educate researchers about statistical inferences. These changes are necessary and will take time to implement. As part of this process, here, we would like to propose a few simple steps to improve the assessment of statistical results in neuroscience, by focusing on detailed graphical representations.

Despite a potentially sophisticated experimental design, in a typical neuroscience experiment, raw continuous data tend to undergo drastic simplifications. As a result, it is common for the main results of an article to be summarised in a few figures and a few statistical tests. Unfortunately, graphical representations in many scientific journals, including neuroscience journals, tend to hide underlying distributions, with their excessive use of line and bar graphs (Allen et al., 2012; Weissgerber et al., 2015). This is problematic because common basic summary statistics, such as mean and standard deviation are not robust and do not provide enough information about a distribution, and can thus give misleading impressions about a dataset, particularly for the small sample sizes we are accustomed to in neuroscience (Anscombe, 1973; Wilcox, 2012). As a consequence of poor data representation, there can be a mismatch between the outcome of statistical tests, their interpretations, and the information available in the raw data distributions.

Let’s consider a general and familiar scenario in which observations from two groups of participants are summarised using a bar graph, and compared using a t-test on means. If the p value is inferior to 0.05, we might conclude that we have a significant effect, with one group having larger values than the other one; if the p value is not inferior to 0.05, we might conclude that the two distributions do not differ. What is wrong with this description? In addition to the potentially irrational use of p values (Gigerenzer, 2004; Wagenmakers, 2007; Wetzels et al., 2011), the situation above highlights many caveats in current practices. Indeed, using bar graphs and an arbitrary p<0.05 cut-off turns a potentially rich pattern of results into a simplistic, binary outcome, in which effect sizes and individual differences are ignored. For instance, a more fruitful approach to describing a seemingly significant group effect would be to answer these questions as well:

• how many participants show an effect in the same direction as the group? It is possible to get significant group effects with very few individual participants showing a significant effect themselves. Actually, with large enough sample sizes you can pretty much guarantee significant group effects (Wagenmakers, 2007);

• how many participants show no effect, or an effect in the opposite direction as the group?

• is there a smooth continuum of effects across participants, or can we identify sub-clusters of participants who appear to behave differently from the rest?

• how large are the individual effects?

These questions can only be answered by using scatterplots or other detailed graphical representations of the results, and by reporting other quantities than the mean and standard deviation of each group. Essentially, a significant t-test is neither necessary nor sufficient to understand how two distributions differ (Wilcox, 2006). And because t-tests and ANOVAs on means are not robust (for instance to skewness & outliers), failure to reach the 0.05 cut-off should not be used to claim that distributions do not differ: first, the lack of significance (p<0.05) is not the same as evidence for the lack of effect (Kruschke, 2013); second, robust statistical tests should be considered (Wilcox, 2012); third, distributions can potentially differ in their left or right tails, but not in their central tendency, for instance when only weaker animals respond to a treatment (Doksum, 1974; Doksum & Sievers, 1976; Wilcox, 2006; Wilcox et al., 2014). Essentially, if an article reports bar graphs and non-significant statistical analyses of the mean, not much can be concluded at all. Without detailed and informative illustrations of the results, it is impossible to tell if the distributions truly do not differ.

Let’s consider the example presented in Figure 1, in which two groups of participants were tested in two conditions (2 independent x 2 dependent factor design). Panel A illustrates the results using a mean +/- SEM bar graph. An ANOVA on these data reveals a non-significant group effect, a significant main effect of condition, and a significant group x condition interaction. Follow-up paired t-tests reveal a significant condition effect in group 1, but not in group 2. These results seem well supported by the bar graph in Figure 1A. Based on this evidence, it is very common to conclude that group 1 is sensitive to the experimental manipulation, but not group 2. The discussion of the article might even pitch the results in more general terms, making claims about the brain in general. Figure 1. Different representations of the same behavioural data. Results are in arbitrary units. A Bar graph with mean +/- SEM. B Stripcharts (1D scatterplots) of difference scores. C Stripcharts of linked observations. D Scatterplot of paired observations. The diagonal line has slope 1 and intercept 0. This figure is licensed CC-BY and available on Figshare, along with data and R code to reproduce it (Rousselet 2016a).

Although the scenario just described is very common in the literature, the conclusions are unwarranted. First, the lack of significance (p<0.05) does not necessarily provide evidence for the lack of effect (Wetzels et al., 2011; Kruschke, 2013). Second, without showing the content of the bars, no conclusion should be drawn at all. So let’s look inside the bars. Figure 1B shows the results from the two independent groups: participants in each group were tested in two conditions, so the pairwise differences are illustrated to reveal the effect sizes and their distributions across participants. The data show large individual differences and overlap between the two distributions. In group 2, except for 2 potential outliers showing large negative effects, the remaining observations are within the range observed in group 1. Six participants from group 2 have differences suggesting an effect in the same direction as group 1, two are near zero, three go in the opposite direction. So, clearly, the lack of significant difference in group 2 is not supported by the data: yes group 2 has overall smaller differences than group 1, but if group 1 is used as a control group, then most participants in group 2 appear to have standard effects. Or so it seems, until we explore the nature of the difference scores by visualising paired observations in each group (Figure 1C). In group 1, as already observed, results in condition 2 are overall larger than in condition 1. In addition, participants with larger scores in condition 1 tend to have proportionally larger differences between conditions 1 and 2. Such relationship seems to be absent in group 2, which suggests that the two groups differ not only in the overall sensitivity to the experimental manipulation, but that other factors could be at play in group 1, and not in group 2. Thus, the group differences might actually be much subtler than suggested by our first analyses. The group dichotomy is easier to appreciate in Figure 1D, which shows a scatterplot of the paired observations in the two groups. In group 1, the majority of paired observations are above the unity line, demonstrating an overall group effect; there is also a positive relationship between the scores in condition 2 and the scores in condition 1. Again, no such relationship seems to be present in group 2. In particular, the two larger negative scores in group 2 are not associated with participants who scored particularly high or low in condition 1, giving us no clue as to the origin of these seemingly outlier scores.

At this stage, we’ve learnt a great deal more about our dataset using detailed graphical representations than relying only on a bar graph and an ANOVA. However, we would need many more than n = 11 participants in both groups to quantify the effects and understand how they differ across groups. We have also not exhausted all the representations that could help us make sense of the results. There is also potentially more to the data, because we haven’t considered the full distribution of single-trials/repetitions. For instance, it is very common to summarise a reaction time distribution of potentially hundreds of trials using a single number, which is then used to perform group analyses. An alternative is to study these distributions in each participant, to understand exactly how they differ between conditions. This single-participant approach would be necessary here to understand how the two groups of participants respond to the experimental manipulation.

In sum, there is much more to the data than what we could conclude from the bar graphs and the ANOVA and t-tests. Once bar graphs and their equivalents are replaced by scatterplots (or boxplots etc.) the story can get much more interesting, subtle, convincing, or the opposite… It depends what surprises the bars are holding. Showing scatterplots is the start of a discussion about the nature of the results, an invitation to go beyond the significant vs. non-significant dichotomy. For the particular results presented in Figure 1, it is rather unclear what is gained by the ANOVA at all compared to detailed graphical representations. Instead of blind statistical significance testing, it would of course be beneficial to properly model the data to make predictions (Kuhn & Johnson, 2013), and to allow integration across subsequent experiments and replication attempts – a critical step that requires Bayesian inference (Verhagen & Wagenmakers, 2014).

The problems described so far are not limited to relatively simple one dimensional data: they are present in more complex datasets as well, such as EEG and MEG time-series. For instance, it is common to see EEG and MEG evoked responses illustrated using solely the mean across participants (Figure 2A). Although common, this representation is equivalent to a bar graph without error bars/whiskers, and is therefore unacceptable. At a minimum, some measure of uncertainty should be provided, for instance so-called confidence intervals (Figure 2B). Also, because it can be difficult to mentally subtract two time-courses, it is important to illustrate the time-course of the difference as well (Figure 2C). In particular, showing the difference helps to consider all the data, not just large peaks, to avoid underestimating potentially large effects occurring before or after the main peaks. In addition, Figure 2C illustrates ERP differences for every participant – an ERP version of a scatterplot. This more detailed illustration is essential to allow readers to assess effect sizes, inter-participant differences, and ultimately to interpret significant and non-significant results. For instance, in Figure 2C, there is a non-significant group negative difference 100 ms, and a large positive difference 120 to 280 ms. What do they mean? The individual traces reveal a small number of participants with relatively large differences 100 ms despite the lack of significant group effect, and all participants have a positive difference 120 to 250 ms post-stimulus. There are also large individual differences at most time points. So Figure 2C, although certainly not the ultimate representation, offers a much richer and compelling description than the group averages on their own; Figure 2C also suggests that more detailed group analyses would be beneficial, as well as single-participant analyses (Pernet et al., 2011; Rousselet & Pernet, 2011). MATLAB Handle Graphics

Figure 2. Different representations of the same ERP data.  Paired design in which the same participants saw two image categories. A Standard ERP figure showing the mean across participants for two conditions. B Mean ERPs with 95% confidence intervals. The black dots along the x-axis mark time points at which there is a significant paired t-test (p<0.05).  C Time course of the ERP differences. Differences from individual participants are shown in grey. The mean difference is superimposed using a thick black curve. The thinner black curves mark the mean’s 95% confidence interval. This figure is licensed CC-BY and available on Figshare, along with data and Matlab code to reproduce it (Rousselet 2016b).

To conclude, we urge authors, reviewers and editors to promote and implement these guidelines to achieve higher standards in reporting neuroscience research:

• as much as possible, do not use line and bar graphs; use scatterplots instead, or, if you have large sample sizes, histograms, kernel density plots, or boxplots;

• for paired designs, show distributions of pairwise differences, so that readers can assess how many comparisons go in the same direction as the group, their size, and their variability; this recommendation also applies to brain imaging data, for instance MEEG and fMRI BOLD time-courses;

• report how many participants show an effect in the same direction as the group;

• only draw conclusions about what was assessed: for instance, if you perform a t-test on means, you should only conclude about differences in means, not about group differences in general;

• don’t use a star system to dichotomise p values: p values do not measure effect sizes or the amount of evidence against or in favour of the null hypothesis (Wagenmakers, 2007);

• don’t agonise over p values: focus on detailed graphical representations and robust effect sizes instead (Wilcox, 2006; Wickham, 2009; Allen et al., 2012; Wilcox, 2012; Weissgerber et al., 2015);

• consider Bayesian statistics, to get the tools to align statistical and scientific reasoning (Cohen, 1994; Goodman, 1999; 2016).

Finally, we cannot ignore that using detailed illustrations for potentially complex designs, or designs involving many group comparisons, is not straightforward: research in that direction, including the creation of open-access toolboxes, is of great value to the community, and should be encouraged by funding agencies.

## References

Allen, E.A., Erhardt, E.B. & Calhoun, V.D. (2012) Data visualization in the neurosciences: overcoming the curse of dimensionality. Neuron, 74, 603-608.

Anscombe, F.J. (1973) Graphs in Statistical Analysis. Am Stat, 27, 17-21.

Cohen, D. (1994) The earth is round (p<.05). American Psychologist, 49, 997-1003.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Gigerenzer, G. (2004) Mindless statistics. Journal of Behavioral and Experimental Economics (formerly The Journal of Socio-Economics), 33, 587-606.

Goodman, S.N. (1999) Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med, 130, 995-1004.

Goodman, S.N. (2016) Aligning statistical and scientific reasoning. Science, 352, 1180-1181.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603.

Kuhn, M. & Johnson, K. (2013) Applied predictive modeling. Springer, New York.

Pernet, C.R., Sajda, P. & Rousselet, G.A. (2011) Single-trial analyses: why bother? Frontiers in psychology, 2, doi: 10.3389-fpsyg.2011.00322.

Rousselet, G.A. & Pernet, C.R. (2011) Quantifying the Time Course of Visual Object Processing Using ERPs: It’s Time to Up the Game. Front Psychol, 2, 107.

Rousselet, G. (2016a). Different representations of the same behavioural data. figshare.
https://dx.doi.org/10.6084/m9.figshare.3504539

Rousselet, G. (2016b). Different representations of the same ERP data. figshare.
https://dx.doi.org/10.6084/m9.figshare.3504566

Verhagen, J. & Wagenmakers, E.J. (2014) Bayesian tests to quantify the result of a replication attempt. J Exp Psychol Gen, 143, 1457-1475.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Weissgerber, T.L., Milic, N.M., Winham, S.J. & Garovic, V.D. (2015) Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol, 13, e1002128.

Wetzels, R., Matzke, D., Lee, M.D., Rouder, J.N., Iverson, G.J. & Wagenmakers, E.J. (2011) Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on Psychological Science, 6, 291-298.

Wickham, H. (2009) ggplot2 : elegant graphics for data analysis. Springer, New York ; London.

Wilcox, R.R. (2006) Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education, 74, 353-367.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

# How to quantify typical differences between distributions

In this post, I describe two complementary lines of enquiry for group comparisons:

(1) How do typical levels compare between groups?

(2.1) for independent groups What is the typical difference between randomly selected members of the two groups?

(2.2) for dependent groups What is the typical pairwise difference?

These two questions can be answered by exploring entire distributions, not just one measure of central tendency.

The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using `ggplot2`. I will describe Matlab code in another post.

## Independent groups

When comparing two independent groups, the typical approach consists in comparing the marginal distributions using a proxy: each distribution is summarised using one value, usually the non-robust mean. The difference between means is then normalised by some measure of variability – usually involving the non-robust variance, in which case we get the usual t-test. There is of course no reason to use only the mean as a measure of central tendency: robust alternatives such as trimmed means and M-estimators are more appropriate in many situations (Wilcox, 2012a). However, whether we compare the means or the medians or the 20% trimmed means of two groups, we focus on one question:

“How does the typical level/participant in one group compares to the typical level/participant in the other group?” Q1

There is no reason to limit our questioning of the data to the average Joe in each distribution: to go beyond differences in central tendency, we can perform systematic group comparisons using shift functions. Nevertheless, shift functions are still based on a comparison of the two marginal distributions, even if a more complete one.

An interesting alternative approach consists in asking:

“What is the typical difference between any member of group 1 and any member of group 2?” Q2

This approach involves computing all the pairwise differences between groups, as covered previously.

Let’s look at an example. Figure 1A illustrates two independent samples. The scatterplots indicate large differences in spread between the two groups, and also suggest larger differences in the right than the left tails of the distributions. The medians of the two groups appear very similar, so the two distributions do not seem to differ in central tendency. In keeping with these observations, a t-test and a Mann-Whitney-Wilcoxon test are non-significant, but a Kolmogorov-Smirnov test is. Figure 1. Independent groups: non-uniform shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Vertical lines mark the deciles, with a thick line for the median. C Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense. The 95% confidence intervals are controlled for multiple comparisons. D Difference asymmetry plot with 95% confidence intervals. The family-wise error is controlled by adjusting the critical p values using Hochberg’s method; the confidence intervals are not adjusted.

This discrepancy between tests highlights an important point: if a t-test is not significant, one cannot conclude that the two distributions do not differ. A shift function helps us understand how the two distributions differ (Figure 1C): the overall profile corresponds to two centred distributions that differ in spread; for each decile, we can estimate by how much they differ, and with what uncertainty; finally, the differences appear asymmetric, with larger differences in the right tails.

Is this the end of the story? No, because so far we have only considered Q1, how the two marginal distributions compare. We can get a different but complementary perspective by considering Q2, the typical difference between any member of group 1 and any member of group 2. To address Q2, we compute all the pairwise differences between members of the two groups. In this case each group has n=50, so we end up with 2,500 differences. Figure 1B shows a kernel density representation of these differences. So what does the typical difference looks like? The median of the differences is very near zero, so it seems on average, if we randomly select one observation from each group, they will differ very little. However, the differences can be quite substantial, and with real data we would need to put these differences in context, to understand how large they are, and their physiological/psychological interpretation. The differences are also asymmetrically distributed, with negative skewness: negative scores extend to -10, whereas positive scores don’t even reach +5. This asymmetry relates to our earlier observation of asymmetric differences in the shift function.

Recently, Wilcox (2012) suggested a new approach to quantify asymmetries in difference distributions. To understand his approach, we first need to consider how difference scores are usually characterised. It helps to remember that for continuous distributions, the Mann—Whitney-Wilcoxon U statistics = sum(X>Y) for all pairwise comparisons, i.e. the sum of the number of times observations in group X are larger than observations in group Y. Concretely, to compute U we sum the number of times observations in group X are larger than observations on group Y. This calculation requires to compute all pairwise differences between X and Y, and then count the number of positive differences. So the MWW test assesses P(X>Y) = 0.5. Essentially, the MWW test is a non- parametric test of the hypothesis that the distributions are identical. The MWW test does not compare the medians of the marginal distributions as often stated; also, it estimates the wrong standard error (Cliff, 1996). A more powerful test is Cliff’s delta, which uses P(X>Y) – P(X<Y) as a measure of effect size. As expected, in our current example Cliff’s delta is not significant, because the difference distribution has a median very near zero.

Wilcox’s approach is an extension of the MWW test: the idea is to get a sense of the asymmetry of the difference distribution by computing a sum of quantiles = q + (1-q), for various quantiles estimated using the Harrell-Davis estimator. A percentile bootstrap technique is used to derive confidence intervals. Figure 1D shows the resulting difference asymmetry plot  (Wilcox has not given a clear name to that new function, so I made one up). In this plot, 0.05 stands for the sum of quantile 0.05 + quantile 0.95; 0.10 stands for the sum of quantile 0.10 + quantile 0.90; and so on… The approach is not limited to these quantiles, so sparser or denser functions could be tested too. Figure 1D reveals negative sums of the extreme quantiles (0.05 + 0.95), and progressively smaller, converging to zero sums as we get closer to the centre of the distribution. So the q+(1-q) plot suggests that the two groups differ, with maximum differences in the tails, and no significant differences in central tendency. Contrary to the shift function, the q+(1-q) plot let us conclude that the difference distribution is asymmetric, based on the 95% confidence intervals. Other alpha levels can be assessed too.

In the case of two random samples from a normal population, one shifted by a constant compared to the other, the shift function and the difference asymmetry function should be about flat, as illustrated in Figure 2. In this case, because of random sampling and limited sample size, the two approaches provide different perspectives on the results: the shift function suggests a uniform shift, but fails to reject for the three highest deciles; the difference asymmetry function more strongly suggests a uniform shift, with all sums at about the same value. This shows that all estimated pairs of quantiles are asymmetric about zero, because the difference function is uniformly shifted away from zero. Figure 2. Independent groups: uniform shift. Two random samples of 50 observations were generated using `rnorm`. A constant of 1 was added to group 2.

When two distributions do not differ, both the shift function and the difference asymmetry function should be about flat and centred around zero – however this is not necessarily the case, as shown in Figure 3. Figure 3. Independent groups: no shift – example 1. Two random samples of 50 observations were generated using `rnorm`.

Figure 4 shows another example in which no shift is present, and with n=100 in each group, instead of n=50 in the previous example. Figure 4. Independent groups: no shift – example 2.  Two random samples of 100 observations were generated using `rnorm`.

In practice, the asymmetry plot will often not be flat. Actually, it took me several attempts to generate two random samples associated with such flat asymmetry plots. So, before getting too excited about your results, it really pays to run a few simulations to get an idea of what random fluctuations can look like. This can’t be stressed enough: you might be looking at noise!

## Dependent groups

Wilcox & Erceg-Hurn (2012) described a difference asymmetry function for dependent group. We’re going to apply the technique to the dataset presented in Figure 5. Panel A shows the two marginal distributions. However, we’re dealing with a paired design, so it is impossible to tell how observations are linked between conditions. This association is revealed in two different ways in panels B & C, which demonstrate a striking pattern: for participants with weak scores in condition 1, differences tend to be small and centred about zero; beyond a certain level, with increasing scores in condition 1, the differences get progressively larger. Finally, panel D shows the distribution of differences, which is shifted up from zero, with only 6 out of 35 differences inferior to zero.

At this stage, we’ve learnt a lot about our dataset – certainly much more than would be possible from current standard figures. What else do we need? Statistical tests?! I don’t think they are absolutely necessary. Certainly, providing a t-test is of no interest whatsoever if Figure 5 is provided, because it cannot provide information we already have. Figure 5. Dependent groups: data visualisation. A Stripcharts of the two distributions. Horizontal lines mark the deciles, with a thick line for the median. B Stripcharts of paired observations. Scatter was introduced along the x axis to reveal overlapping observations. C Scatterplot of paired observations. The diagonal black reference line of no effect has slope one and intercept zero. The dashed grey lines mark the quartiles of the two conditions. In panel C, it would also be useful to plot the pairwise differences as a function of condition 1 results. D Stripchart of difference scores. Horizontal lines mark the deciles, with a thick line for the median.

Figure 6 provides quantifications and visualisations of the effects using the same layout as Figure 5. The shift function (Figure 6C) shows a non-uniform shift between the marginal distributions: the first three deciles do not differ significantly, the remaining deciles do, and there is an overall trend of growing differences as we progress towards the right tails of the distributions. The difference asymmetry function provides a different perspective. The function is positive and almost flat, demonstrating that the distribution of differences is uniformly shifted away from zero, a result that cannot be obtained by only looking at the marginal distributions. Of course, when using means comparing the marginals or assessing the difference scores give the same results, because the difference of the means is the same as the mean of the differences. That’s why a paired t-test is the same as a one-sample test on the pairwise differences. With robust estimators the two approaches differ: for instance the difference between the medians of the marginals is not the same as the median of the differences. Figure 6. Dependent groups: uniform difference shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Horizontal lines mark the deciles, with a thick line for the median. C Shift function. D Difference asymmetry plot with 95% confidence intervals.

As fancy as Figure 6 can be, it still misses an important point: nowhere do we see the relationship between condition 1 and condition 2 results, as shown in panels B & C of Figure 5. This is why detailed illustrations are absolutely necessary to make sense of even the simplest datasets.

If you want to make more inferences about the distribution of differences, as shown in Figure 6B, Figure 7 shows a complementary description of all the deciles with their 95% confidence intervals. These could be substituted with highest density intervals or credible intervals for instance. Figure 7. Dependent groups: deciles of the difference distribution. Each disk marks a difference decile, and the horizontal green line makes its 95% percentile bootstrap confidence interval. The reference line of no effect appears as a continuous black line. The dashed black line marks the difference median.

Finally, in Figure 8 we look at an example of a non-uniform difference shift. Essentially, I took the data used in Figure 6, and multiplied the four largest differences by 1.5. Now we see that the 9th decile does not respect the linear progression suggested by previous deciles, (Figure 8, panels A & B), and the difference asymmetry function suggests an asymmetric shift of the difference distribution, with larger discrepancies between extreme quantiles. Figure 8. Dependent groups: non-uniform difference shift. A Stripchart of difference scores. B Deciles of the difference distribution. C Difference asymmetry function.

## Conclusion

The techniques presented here provide a very useful perspective on group differences, by combining detailed illustrations and quantifications of the effects. The different techniques address different questions, so which technique to use depends on the question you want to ask. This choice should be guided by experience: to get a good sense of the behaviour of these techniques will require a lot of practice with various datasets, both real and simulated. If you follow that path, you will soon realise that classic approaches such as t-tests on means combined with bar graphs are far too limited, and can hide rich information about a dataset.

I see three important developments for the approach outlined here:

• to make it Bayesian, or at least p value free using highest density intervals;

• to extend it to multiple group comparisons (the current illustrations don’t scale up very easily);

• to extend it to ANOVA type designs with interaction terms.

## References

Cliff, N. (1996) Ordinal methods for behavioral data analysis. Erlbaum, Mahwah, N.J.

Wilcox, R.R. (2012a) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R. (2012b) Comparing Two Independent Groups Via a Quantile Generalization of the Wilcoxon-Mann-Whitney Test. Journal of Modern Applied Statistical Methods, 11, 296-302.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

# the shift function: a powerful tool to compare two entire distributions

The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using `ggplot2`. The R code for the 2013 percentile bootstrap version of the shift function was also covered here and here. Matlab code is described in another post.

UPDATE: The shift function and its cousin the difference asymmetry function are described in a review article with many examples. And a Bayesian shift function is now available! The hierarchical shift function provides a powerful alternative to the t-test.

In neuroscience & psychology, group comparison is usually an exercise that involves comparing two typical observations. This is most of the time achieved using a t-test on means. This standard procedure makes very strong assumptions:

• the distributions differ only in central tendency, not in other aspects;
• the typical observation in each distribution can be summarised by the mean;
• the t-test is sufficient to detect changes in location.

As we saw previously, t-tests on means are not robust. In addition, there is no reason a priori to assume that two distributions differ only in the location of the bulk of the observations. Effects can occur in the tails of the distributions too: for instance a particular intervention could have an effect only in animals with a certain hormonal level at baseline; a drug could help participants with severe symptoms, but not others with milder symptoms… Because effects are not necessarily homogenous among participants, it is useful to have appropriate tools at hand, to determine how, and by how much, two distributions differ. Here we’re going to consider a powerful family of tools that are robust and let us compare entire distributions: shift functions.

A more systematic way to characterise how two independent distributions differ was originally proposed by Doksum (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977): to plot the difference between the quantiles of two distributions as a function of the quantiles of one group. The original shift function approach is implemented in the functions `sband` and `wband` in Rand Wilcox’s WRS R package.

In 1995, Wilcox proposed an alternative technique which has better probability coverage and potentially more power than Doksum & Sievers’ approach. Wilcox’s technique:

• uses the Harrell-Davis quantile estimator;
• computes confidence intervals of the decile differences with a bootstrap estimation of the standard error of the deciles;
• controls for multiple comparisons so that the type I error rate remains around 0.05 across the 9 confidence intervals. This means that the confidence intervals are a bit larger than what they would be if only one decile was compared, so that the long-run probability of a type I error across all 9 comparisons remains near 0.05;
• is implemented in the `shifthd` function.

Let’s start with an extreme and probably unusual example, in which two distributions differ in spread, not in location (Figure 1). In that case, any test of central tendency will fail to reject, but it would be wrong to conclude that the two distributions do not differ. In fact, a Kolmogorov-Smirnov test reveals a significant effect, and several measures of effect sizes would suggest non-trivial effects. However, a significant KS test just tells us that the two distributions differ, not how. Figure 1. Two distributions that differ in spread A Kernel density estimates for the groups. B Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense.

The shift function can help us understand and quantify how the two distributions differ. The shift function describes how one distribution should be re-arranged to match the other one: it estimates how and by how much one distribution must be shifted. In Figure 1, I’ve added annotations to help understand the link between the KDE in panel A and the shift function in panel B. The shift function shows the decile differences between group 1 and group 2, as a function of group 1 deciles. The deciles for each group are marked by coloured vertical lines in panel A. The first decile of group 1 is slightly under 5, which can be read in the top KDE of panel A, and on the x-axis of panel B. The first decile of group 2 is lower. As a result, the first decile difference between group 1 and group 2 is positive, as indicated by a positive value around 0.75 in panel B, as marked by an upward arrow and a `+` symbol. The same symbol appears in panel A, linking the deciles from the two groups: it shows that to match the first deciles, group 2’s first decile needs to be shifted up. Deciles 2, 3 & 4 show the same pattern, but with progressively weaker effect sizes. Decile 5 is well centred, suggesting that the two distributions do not differ in central tendency. As we move away from the median, we observe progressively larger negative differences, indicating that to match the right tails of the two groups, group 2 needs to be shifted to the left, towards smaller values – hence the negative sign.

To get a good understanding of the shift function, let’s look at its behaviour in several other clear-cut situations. First, let’s consider a  situation in which two distributions differ in location (Figure 2). In that case, a t-test is significant, but again, it’s not the full story. The shift function looks like this: Figure 2. Complete shift between two distributions

What’s happening? All the differences between deciles are negative and around -0.45. Wilcox (2012) defines such systematic effect has the hallmark of a completely effective method. In other words, there is a complete and seemingly uniform shift between the two distributions.

In the next example (Figure 3), only the right tails differ, which is captured by significant differences for deciles 6 to 9. This is a case described by Wilcox (2012) as involving a partially effective experimental manipulation. Figure 3. Positive right tail shift

Figure 4 also shows a right tail shift, this time in the negative direction. I’ve also scaled the distributions so they look a bit like reaction time distributions. It would be much more informative to use shift functions in individual participants to study how RT distributions differ between conditions, instead of summarising each distribution by its mean (sigh)! Figure 4. Negative right tail shift

Figure 5 shows two large samples drawn from a standard normal population. As expected, the shift function suggests that we do not have enough evidence to conclude that the two distributions differ. The shift function does look bumpy tough, potentially suggesting local differences – so keep that in mind when you plug-in your own data. Figure 5. No difference?

And be careful not to over-interpret the shift function: the lack of significant differences should not be used to conclude that we have evidence for the lack of effect; indeed, failure to reject in the frequentist sense can still be associated with non-trivial evidence against the null – it depends on prior results (Wagenmakers, 2007).

So far, we’ve looked at simulated examples involving large sample sizes. We now turn to a few real-data examples.

Doksum & Sievers (1976) describe an example in which two groups of rats were kept in an environment with or without ozone for 7 days and their weight gains measured (Figure 6). The shift function suggests two results: overall, ozone reduces weight gain; ozone might promote larger weight gains in animals gaining the most weight. However, these conclusions are only tentative given the small sample size, which explains the large confidence intervals. Figure 6. Weight gains A Because the sample sizes are much smaller than in the previous examples, the distributions are illustrated using 1D scatterplots. The deciles are marked by grey vertical lines, with lines for the 0.5 quantiles. B Shift function.

Let’s consider another example used in (Doksum, 1974; Doksum, 1977), concerning the survival time in days of 107 control guinea pigs and 61 guinea pigs treated with a heavy dose of tubercle bacilli (Figure 7). Relative to controls, the animals that died the earliest tended to live longer in the treatment group, suggesting that the treatment was beneficial to the weaker animals (decile 1). However, the treatment was harmful to animals with control survival times larger than about 200 days (deciles 4-9). Thus, this is a case where the treatment has very different effects on different animals. As noted by Doksum, the same experiment was actually performed 4 times, each time giving similar results. Figure 7. Survival time

## Shift function for dependent groups

All the previous examples were concerned with independent groups. There is a version of the shift function for dependent groups implemented in `shiftdhd`. We’re going to apply it to ERP onsets from an object detection task (Bieniek et al., 2015). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. Typically, test-retest assessment is performed using a correlation. However, we care about the units (ms), which a correlation would get rid of, and we had a more specific hypothesis, which a correlation cannot test; so we used a shift function (Figure 8). If you look at the distributions of onsets across participants, you will see that it is overall positively skewed, and with a few participants with particularly early or late onsets. With the shift function, we wanted to test for the overall reliability of the results, but also in particular the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). The shift function does not provide enough evidence to suggest a uniform or non-uniform shift – but we would probably need many more observations to make a strong claim. Figure 8. ERP onsets

Because we’re dealing with a paired design, the illustration of the marginal distributions in Figure 8 is insufficient: we should illustrate the distribution of pairwise differences too, as shown in Figure 9. Figure 9. ERP onsets with KDE of pairwise differences

Figure 10 provides an alternative representation of the distribution of pairwise differences using a violin plot. Figure 10. ERP onsets with violin plot of pairwise differences

Figure 11 uses a 1D scatterplot (strip chart). Figure 11. ERP onsets with 1D scatterplot of pairwise differences

## Shift function for other quantiles

Although powerful, Wilcox’s 1995 technique is not perfect, because it:

• is limited to the deciles;
• can only be used with alpha = 0.05;
• does not work well with tied values.

More recently, Wilcox’s proposed a new version of the shift function that uses a straightforward percentile bootstrap (Wilcox & Erceg-Hurn, 2012; Wilcox et al., 2014). This new approach:

• allows tied values;
• can be applied to any quantile;
• can have more power when looking at extreme quantiles (<=0.1, or >=0.9).
• is implemented in `qcomhd` for independent groups;
• is implemented in `Dqcomhd` for dependent groups.

Examples are provided in the R script for this post.

In the percentile bootstrap version of the shift function, p values are corrected, but not the confidence intervals. For dependent variables, Wilcox & Erceg-Hurn (2012) recommend at least 30 observations to compare the .1 or .9 quantiles. To compare the quartiles, 20 observations appear to be sufficient. For independent variables, Wilcox et al. (2014) make the same recommendations made for dependent groups; in addition, to compare the .95 quantiles, they suggest at least 50 observations per group.

## Conclusion

The shift function is a powerful tool that can help you better understand how two distributions differ, and by how much. It provides much more information than the standard t-test approach.

Although currently the shift function only applies to two groups, it can in theory be extended to more complex designs, for instance to quantify interaction effects.

Finally, it would be valuable to make a Bayesian version of the shift function, to focus on effect sizes, model the data, and integrate them with other results.

## References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam; Boston.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

# Robust effect sizes for 2 independent groups

When I was an undergrad, I was told that beyond a certain sample size (n=30 if I recall correctly), t-tests and ANOVAs are fine. This was a lie. I wished I had been taught robust methods and that t-tests and ANOVAs on means are only a few options among many alternatives. Indeed, t-tests and ANOVAs on means are not robust to outliers, skewness, heavy-tails, and for independent groups, differences in skewness, variance (heteroscedasticity) and combinations of these factors (Wilcox & Keselman, 2003; Wilcox, 2012). The main consequence is a lack of statistical power. For this reason, it is often advised to report a measure of effect size to determine, for instance, if a non-significant effect (based on some arbitrary p value threshold) could be due to lack of power, or reflect a genuine lack of effect. The rationale is that an effect could be associated with a sufficiently large effect size but yet fail to trigger the arbitrary p value threshold. However, this advise is pointless, because classic measures of effect size, such as Cohen’s d, its variants, and its extensions to ANOVA are not robust.

To illustrate the problem, first, let’s consider a simple situation in which we compare 2 independent groups of 20 observations, each sampled from a normal distribution with mean = 0 and standard deviation = 1. We then add a constant of progressively larger value to one of the samples, to progressively shift it away from the other. As illustrated in Figure 1, as the difference between the two groups increases, so does Cohen’s d. The Matlab code to reproduce all the examples is available here, along with a list of matching R functions from Rand Wilcox’s toolbox. Figure 1. Examples of Cohen’s d as a function of group differences. For simplicity, I report the absolute value of Cohen’s d, here and in subsequent figures.

We can map the relationship between group mean differences and d systematically, by running a simulation in which we repeatedly generate two random samples and progressively shift one away from the other by a small amount. We get a nice linear relationship (Figure 2). Figure 2. Linear relationship between Cohen’s d and group mean differences.

Cohen’s d appears to behave nicely, so what’s the problem? Let’s consider another example, in which we generate 2 samples of 20 observations from a normal distribution, and shift their means by a fixed amount of 2. Then, we replace the largest observation from group 2 by progressively larger values. As we do so, the difference between the means of group 1 and group 2 increases, but Cohen’s d decreases (Figure 3). Figure 3. Cohen’s d is not robust to outliers.

Figure 4 provides a more systematic illustration of the effect of extreme values on Cohen’s d for the case of 2 groups of 20 observations. As the group difference increases, Cohen’s d wrongly suggests progressively lower effect sizes. Figure 4. Cohen’s d as a function of group mean differences in the presence of one outlier. There is an inverse and slightly non-linear relationship between the two variables.

What is going on? Remember that Cohen’s d is the difference between the two group means divided by the pooled standard deviation. As such, neither the numerator nor the denominator are robust, so that even one unusual value can potentially significantly alter d and lead to the wrong conclusions about  effect size. In the example provided in Figure 4, d gets smaller as the mean difference increases because the denominator of d is composed of a non-robust estimator of dispersion, the variance, such that the outlier increases variability, which leads to an increase of the denominator, and thus a lower d. The outlier also has a strong effect on the mean, which leads to an increase of the numerator, and thus larger d. However, the outlier has a stronger effect on the variance than the mean: this imbalance explains the overall decrease of d with increasing outlier size. I leave it as an exercise to understand the origin of the non-linearity in Figure 4. It has to do with the differential effect of the outlier on the mean and the variance.

One could argue that the outlier value added to one of the groups could be removed, which would solve the problem. There are 3 objections to this argument:

• there are situations in which extreme values are not outliers but expected and plausible observations from a skewed or heavy tail distribution, and thus physiologically or psychologically meaningful values. In other words, what looks like an outlier in a sample of 20 observations could well look very natural in a sample of 200 observations;
• for small sample sizes, relatively small outliers could go unnoticed but still affect effect size estimation;
• outliers are not the only problem: skewness & heavy tails can affect the mean and the variance and thus d.

For instance, in some cases, two groups can differ in skewness, as illustrated in Figure 5. In the left panel, the two kernel density estimates illustrate two samples of 100 observations from a normal distribution. The two groups overlap only moderately, and Cohen’s d is high. In the right panel, group 1, with a mean of zero, is the same as in the previous panel; group 2, with a mean of 2, is almost identical to the one in the left panel, except that its largest 10% observations were replaced with slightly larger observations. As a result, the overlap between the two distributions is the same in the two panels – yet Cohen’s d is quite smaller in the second example. Figure 5. Cohen’s d for normal & skewed distributions.

The point of this example is to illustrate the potential for discrepancies between a visual inspection of two distributions and Cohen’s d. Clearly, in Figure 5, a useful measure of effect size should provide the same estimates for the two examples. Fortunately, several robust alternatives have this desirable property, including Cliff’s delta, the Kolmogorov-Smirnov test statistic, Wilcox & Muska’s Q, and mutual information.

## Robust versions of Cohen’s d

Before going over the 4 robust alternatives listed above, it is useful to consider that Cohen’s d is part of a large family of estimators of effect size, which can be described as the ratio of a difference between two measures of central tendency (CT), over some measure of variability:

(CT1 – CT2) / variability

From this expression, it follows that robust effect size estimators can be derived by plugging in robust estimators of central tendency in the numerator and robust estimators of variability in the denominator. Several examples of such robust alternatives are available, for instance using trimmed means and Winsorised variances (Keselman et al. 2008; Wilcox 2012). R users might want to check these functions from Wilcox for instance:

• `akp.effect`
• `yuenv2`
• `med.effect`

There are also extensions of these quantities to the comparison of more than one group (Wilcox 2012).

## Robust & intuitive measures of effect sizes

In many situations, the robust effect sizes presented above can bring a great improvement over Cohen’s d and its derivatives. However, they provide only a limited perspective on the data. First, I don’t find this family of effect sizes the easiest to interpret: having to think of effects in standard deviation (or robust equivalent) units is not the most intuitive. Second, this type of effect sizes does not always answer the questions we’re most interested in (Cliff, 1996; Wilcox, 2006).

### The simplest measure of effect size: the difference

Fortunately, effect sizes don’t have to be expressed as the ratio difference / variability. The simplest effect size is simply a difference. For instance, when reporting that group A differs from group B, typically people report the mean for each group. It is also very useful to report the difference, without normalisation, but with a confidence or credible interval around it, or some other estimate of uncertainty. This simple measure of effect size can be very informative, particularly if you care about the units. It is also trivial to make it robust by using robust estimators, such as the median when dealing with reaction times and other skewed distributions.

### Probabilistic effect size and the Wilcoxon-Mann-Whitney U statistic

For two independent groups, asking by how much the central tendencies of the two groups differ is useful, but this certainly does not exhaust all the potential differences between the two groups. Another perspective relates to a probabilistic description: for instance, given two groups of observations, what is the probability that one random observation from group 1 is larger than a random observation from group 2? Given two independent variables X and Y, this probability can be defined as P(X > Y). Such probability gives a very useful indication of the amount of overlap between the two groups, in a way that is not limited to and dependent on measures of central tendency. More generally, we could consider these 3 probabilities:

• P(X > Y)
• P(X = Y)
• P(X < Y)

These probabilities are worth reporting in conjunction with illustrations of the group distributions. Also, there is a direct relationship between these probabilities and the Wilcoxon-Mann-Whitney U statistic (Birnbaum, 1956; Wilcox 2006). Given sample sizes Nx and Ny:

U / NxNy = P(X > Y) + 0.5 x P(X = Y)

In the case of two strictly continuous distributions, for which ties do not occur:

U / NxNy = P(X > Y)

### Cliff’s delta

Cliff suggested to use P(X > Y) and P(X < Y) to compute a new measure of effect size. He defined what is now called Cliff’s delta as:

delta = P(X > Y) – P(X < Y)

Cliff’s delta estimates the probability that a randomly selected observation from one group is larger than a randomly selected observation from another group, minus the reverse probability (Cliff, 1996). It is estimated as:

delta = (sum(x > y) – sum(x < y)) / NxNy

In this equation, each observation from one group is compared to each observation in the other group, and we count how many times the observations from one group are higher or lower than in the other group. The difference between these two counts is then divided by the total number of observations, the product of their sample sizes NxNy. This statistic ranges from 1 when all values from one group are higher than the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0. Because delta is a statistic based on ordinal properties of the data, it is unaffected by rank preserving data transformations. Its non-parametric nature reduces the impact of extreme values or distribution shape. For instance, Cliff’s delta is not affected by the outlier or the difference in skewness in the examples from Figure 3 & 5.

For an MEEG application, we’ve used Cliff’s delta to quantify effect sizes in single-trial ERP distributions (Bieniek et al. 2015). We also used Q, presented later on in this post, but it behaved so similarly to delta that it does not feature in the paper.

An estimate of the standard error of delta can be used to compute a confidence interval for delta. When conditions differ, the statistical test associated with delta can be more  powerful than the Wilcoxon-Mann-Whitney test, which uses the wrong standard error (Cliff, 1996; Wilcox, 2006). Also, contrary to U, delta is a direct measure of effect size, with an intuitive interpretation. There are also some attempts at extending delta to handle more than two groups (e.g. Wilcox, 2011). Finally, Joachim Goedhart has provided an Excel macro to compute Cliff’s delta.

Update: Cliff’s delta is also related to the later introduced “common-language effect size” – see this post from Jan Vanhove.

### All pairwise differences

Cliff’s delta is a robust and informative measure of effect size. Because it relies on probabilities, it normalises effect sizes onto a common scale useful for comparisons across experiments. However, the normalisation gets rid of the original units. So, what if the units matter? A complementary perspective to that provided by delta can be gained by considering all the pairwise differences between individual observations from the two groups (Figure 6). Such distribution can be used to answer a very useful question: given that we randomly select one observation from each group, what is the typical difference we can expect? This can be obtained by computing for instance the  median of the pairwise differences. An illustration of the full distribution provides a lot more information: we can see how far away the bulk of the distribution is from zero, get a sense of how large differences can be in the tails… Figure 6. Illustration of all pairwise differences. Left panel: scatterplots of the two groups of observations. One observation from group 1 (in red) is compared to all the observations from group 2 (in orange). The difference between all the pairs of observations is saved and the same process is applied to all the observations from group 1. Right panel: kernel density estimate of the distribution of all the pairwise differences between the two groups. The median of these differences is indicated by the continuous vertical line; the 1st & 3rd quartiles are indicated by the dashed vertical lines.

Something like Figure 6, in conjunction with Cliff’s delta and associated probabilities, would provide a very useful summary of the data.

### When Cohen’s d & Cliff’s delta fail

Although robust alternatives to Cohen’s d considered so far, including Cliff’s delta, can handle well situations in which 2 conditions differ in central tendency, they fail completely to describe situations like the one in Figure 7. In this example, the two distributions are dramatically different from each other, yet Cohen’s d is exactly zero, and Cliff’s delta is very close to zero. Figure 7. Measures of effect size for two distributions that differ in spread, not in location. Cd = Cohen’s d, delta = Cliff’s delta, MI = mutual information, KS = Kolmogorov-Smirnov test statistics, Q = Wilcox & Muska’s Q.

Here the two distributions differ in spread, not in central tendency, so it would wise to estimate spread instead. This is indeed one possibility. But it would also be nice to have an estimator of effect size that can handle special cases like this one as well. Three estimators fit the bill, as suggested by the title of Figure 7.

### The Kolmogorov-Smirnov statistic

It’s time to introduce a powerful all-rounder: the Kolmogorov-Smirnov test statistic. The KS test is often mentioned to compare one distribution to a normal distribution. It can also be used to compare two independent samples. In that context, the KS test statistic is defined as the maximum of the absolute differences between the empirical cumulative distribution functions (ecdf) of the two groups. As such KS is not limited to differences in central tendency; it is also robust, independent of the shape of distributions, and provides a measure of effect size bounded between 0 and 1. Figure 8 illustrates the statistic using the example from Figure 7. The KS statistic is quite large, suggesting correctly that the two distributions differ. More generally, because it is robust and sensitive to differences located anywhere in the distributions, the KS test is a solid candidate for a default test for two independent samples. However, the KS test is more sensitive to differences in the middle of the distributions than in the tails. To correct this problem, there is also a weighted version of the KS test which provides increased sensitivity to differences in the tails of the distributions – check out the `ks` R function from Wilcox. Figure 8. Illustration of the KS statistic for two independent samples. The top panel shows the kernel density estimates for the two groups. The lower panel shows the matching empirical cumulative distribution functions. The thick black line marks the maximum absolute difference between the two ecdfs – the KS statistic. Figure 8 is the output of the `ksstat_fig` Matlab function written for this post.

The KS statistic non-linearly increases as the difference in variance between two samples of 100 observations progressively increases (Figure 9). The two samples were drawn from a standard normal distribution and do not differ in mean. Figure 9. Relationship between effect sizes and variance differences. The 3 measures of effect size illustrated here are sensitive to distribution differences other than central tendency, and are therefore better able to handle a variety of cases compared to traditional effect size estimates.

### Wilcox & Muska’s Q

Similarly to KS, the Q statistic is also a non-parametric measure of effect size. It ranges from 0 to 1, with chance level at 0.5. It is the probability of correctly deciding whether a randomly selected observation from one of two groups belongs to the first group, based on the kernel density estimates of the two groups (Wilcox & Muska, 1999). Essentially, it reflects the degree of separation between two groups. Again, similarly to KS, in situations in which two distributions differ in other aspects than central tendency, Q might suggest that a difference exists, whereas other methods such as Cohen’s d or Cliff’s delta would not (Figure 9).

### Mutual information

In addition to the KS statistic and Q, a third estimator can be used to quantify many sorts of differences between two or more independent samples: mutual information (MI). MI is a non-parametric measure of association between distributions. As shown in Figure 9, it is sensitive to group differences in spread. MI is expressed in bits and is quite popular in neuroscience – much more so than in psychology. MI is a powerful and much more versatile quantity than any of the tools we have considered so far. To learn more about MI, check out Robin Ince’s tutorial with Matlab & Python code and examples, with special applications to brain imaging. There is also a clear illustration of MI calculation using bins in Figure S3 of Schyns et al. 2010.

In the lab, we use MI to quantify the relationship between stimulus variability and behaviour or brain activity (e.g. Rousselet et al. 2014). This is done using single-trial distributions in every participant. Then, at the group level, we compare distributions of MI between conditions or groups of participants. We thus use MI as a robust measure of within-participant effect size, applicable to many situations. This quantity can then be illustrated and tested across participants. This strategy is particularly fruitful to compare brain activity between groups of participants, such as younger and older participants. Cliff’s delta for instance could then be used to quantify the MI difference between groups.

## Comparisons of effect sizes

We’ve covered several useful robust measures of effect size, with different properties. So, which one should be used? In statistics, the answer to this sort of questions often is “it depends”. Indeed, it depends on your needs and on the sort of data you’re dealing with. It also depends on which measure makes more sense to you. The code provided with this post will let you explore the different options using simulated data or your own data. For now, we can get a sense of the behaviour of delta, MI, KS and Q for relatively large samples of observations from a normal distribution. In Figure 10, two distributions are progressively shifted from each other. Figure 10. Examples of effect size estimates for different distribution shifts.

Figure 11 provides a more systematic mapping of the relationship between effect size estimates and the difference between the means of two groups of 100 observations. The KS statistic and Q appear to have similar profiles, with a linear rise for small differences, before progressively reaching a plateau. In contrast, Cliff’s delta appears to be less variable and to reach a maximum earlier than KS and Q. MI differs from the other 3 quantities with its non-linear rise for small mean differences. Figure 11. Relationship between effect sizes and mean differences.

To more clearly contrast the 4 effect sizes, all their pairwise comparisons are provided in Figure 12. From these comparisons, it seems that KS and Q are almost completely linearly related. If this is the case, then there isn’t much advantage in using Q given that it is much slower to compute than KS. Other comparisons reveal different non-linearities between estimators. These differences would certainly be worth exploring in particular experimental contexts… But enough for this post. Figure 12. Relationship between effect sizes.

## Final notes

Given that Cohen’s d and related estimators of effect size are not robust suggests that they should be abandoned in favour of robust methods. This is not to say that Cohen’s d is of no value – for instance in the case of single-trial ERP distributions of 100s of trials, it would be appropriate (Bieniek et al. 2015). But for typical group level analyses, I see no reason to use non-robust methods such as Cohen’s d. And defending the use of Cohen’s d and related measures for the sake of continuity in the literature, so that readers can compare them across studies is completely misguided: non-robust measures cannot be compared because the same value can be obtained for different amounts of overlap between distributions. For this reason, I am highly suspicious of any attempt to perform meta-analysis or to quantify effect sizes in the literature using published values, without access to the raw data. To allow true comparisons across studies, there is only one necessary and sufficient step: to share your data.

In the literature, there is a rampant misconception assuming that statistical tests and measures of effect size are different  entities. The Kolmogorov-Smirnov test and Cliff’s delta demonstrate that both aspects can be combined elegantly. Other useful measures of effect size, such as mutual information, can be used to test hypotheses by combining them with a bootstrap or permutation approach.

Which technique to use in which situation is something best worked out by yourself, given your own data and extensive tests. Essentially, you want to find measures that are informative and intuitive to use, and that you can trust in the long run. The alternatives described in this post are not the only ones on the market, but they are robust, informative, intuitive, and they cover a lot of useful situations. For instance, if the fields of neuroscience and psychology were to use the Kolmogorov-Smirnov test as default test when comparing two independent groups, I would expect a substantial reduction in the number of false negatives reported in the literature. The Kolmogorov-Smirnov test statistic is also a useful measure of effect size on its own. But because the KS test does not tell us how two distributions differ, it requires the very beneficial addition of detailed illustrations to understand how two groups differ.  This comment applies to all the techniques described in this post, which, although useful, do not provide a full picture of the effects. Notably, they do not tell us how two distributions differ. But detailed illustrations can be combined with robust estimation to compare 2 entire distributions.

## References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Birnbaum ZW. 1955. On a use of the Mann-Whitney statistic

Cliff N. 1996. Ordinal methods for behavioral data analysis. Mahwah, N.J.: Erlbaum

Keselman HJ, Algina J, Lix LM, Wilcox RR, Deering KN. 2008. A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13: 110-29

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Wilcox RR. 2006. Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education 74: 353-67

Wilcox, R.R. (2011) Inferences about a Probabilistic Measure of Effect Size When Dealing with More Than Two Groups. Journal of Data Science, 9, 471-486.

Wilcox RR. 2012. Introduction to robust estimation and hypothesis testing. Amsterdam ; Boston: Academic Press

Wilcox RR, Keselman HJ. 2003. Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods 8: 254-74

Wilcox RR, Muska J. 2010. Measuring effect size: A non-parametric analogue of omega(2). The British journal of mathematical and statistical psychology 52: 93-110