Category Archives: tools

A clearer explanation of the shift function

The shift function is a power tool to compare two marginal distributions. It’s covered in detail in this previous post. Below is a new illustration which might help better understand the graphical representation of the shift function. The R code to generate the figure is available in the README of the rogme package.

Panel A illustrates two distributions, both n = 1000, that differ in spread. The observations in the scatterplots were jittered based on their local density, as implemented in ggforce::geom_sina.

Panel B illustrates the same data from panel A. The dark vertical lines mark the deciles of the distributions. The thicker vertical line in each distribution is the median. Between distributions, the matching deciles are joined by coloured lined. If the decile difference between group 1 and group 2 is positive, the line is orange; if it is negative, the line is purple. The values of the differences for deciles 1 and 9 are indicated in the superimposed labels.

Panel C focuses on the portion of the x-axis marked by the grey shaded area at the bottom of panel B. It shows the deciles of group 1 on the x-axis – the same values that are shown for group 1 in panel B. The y-axis shows the differences between deciles: the difference is large and positive for decile 1; it then progressively decreases to reach almost zero for decile 5 (the median); it becomes progressively more negative for higher deciles. Thus, for each decile the shift function illustrates by how much one distribution needs to be shifted to match another one. In our example, we illustrate by how much we need to shift deciles from group 2 to match deciles from group 1.

More generally, a shift function shows quantile differences as a function of quantiles in one group. It estimates how and by how much two distributions differ. It is thus a powerful alternative to the traditional t-test on means, which focuses on only one, non-robust, quantity. Quantiles are robust, intuitive and informative.

figure2

Matlab code for the shift function: a powerful tool to compare two entire marginal distributions

Recently, I presented R code for the shift function, a powerful tool to compare two entire marginal distributions.

The Matlab code is now available on github.

shifthd has the same name as its R version, which was originally programmed by Rand Wilcox and first documented in 1995 (see details ). It computes a shift function for independent groups, using a percentile bootstrap estimation of the SE of the quantiles to compute confidence intervals.

shiftdhd is the version for dependent groups.

More recently, Wilcox introduced a new version of the shift function in which a straightforward percentile bootstrap is used to compute the confidence intervals, without estimation of the SE of the quantiles. This is implemented in Matlab as shifthd_pbci for independent groups (equivalent to qcomhd in R); as shiftdhd_pbci for dependent groups (equivalent to Dqcomhd in R).

A demo file shift_function_demo is available here, along with the function shift_fig and dependencies cmu and UnivarScatter.

For instance, if we use the ozone data covered in the previous shift function post, a call to shifthd looks like this:

[xd, yd, delta, deltaCI] = shifthd(control,ozone,200,1);

producing this figure:

figure1

The output of shifthd, or any of the other 3 sf functions, can be used as input into shift_fig:

shift_fig(xd, yd, delta, deltaCI,control,ozone,1,5);

producing this figure:

figure2

This is obviously work in progress, and shift_fig is meant as a starting point.

Have fun exploring how your distributions differ!

And if you have any question, don’t hesitate to get in touch.

How to quantify typical differences between distributions

In this post, I describe two complementary lines of enquiry for group comparisons:

(1) How do typical levels compare between groups?

(2.1) for independent groups What is the typical difference between randomly selected members of the two groups?

(2.2) for dependent groups What is the typical pairwise difference?

These two questions can be answered by exploring entire distributions, not just one measure of central tendency.


The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using ggplot2. I will describe Matlab code in another post.


Independent groups

When comparing two independent groups, the typical approach consists in comparing the marginal distributions using a proxy: each distribution is summarised using one value, usually the non-robust mean. The difference between means is then normalised by some measure of variability – usually involving the non-robust variance, in which case we get the usual t-test. There is of course no reason to use only the mean as a measure of central tendency: robust alternatives such as trimmed means and M-estimators are more appropriate in many situations (Wilcox, 2012a). However, whether we compare the means or the medians or the 20% trimmed means of two groups, we focus on one question:

“How does the typical level/participant in one group compares to the typical level/participant in the other group?” Q1

There is no reason to limit our questioning of the data to the average Joe in each distribution: to go beyond differences in central tendency, we can perform systematic group comparisons using shift functions. Nevertheless, shift functions are still based on a comparison of the two marginal distributions, even if a more complete one.

An interesting alternative approach consists in asking:

“What is the typical difference between any member of group 1 and any member of group 2?” Q2

This approach involves computing all the pairwise differences between groups, as covered previously.

Let’s look at an example. Figure 1A illustrates two independent samples. The scatterplots indicate large differences in spread between the two groups, and also suggest larger differences in the right than the left tails of the distributions. The medians of the two groups appear very similar, so the two distributions do not seem to differ in central tendency. In keeping with these observations, a t-test and a Mann-Whitney-Wilcoxon test are non-significant, but a Kolmogorov-Smirnov test is.

typ_diff_fig1_ind

Figure 1. Independent groups: non-uniform shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Vertical lines mark the deciles, with a thick line for the median. C Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense. The 95% confidence intervals are controlled for multiple comparisons. D Difference asymmetry plot with 95% confidence intervals. The family-wise error is controlled by adjusting the critical p values using Hochberg’s method; the confidence intervals are not adjusted.

This discrepancy between tests highlights an important point: if a t-test is not significant, one cannot conclude that the two distributions do not differ. A shift function helps us understand how the two distributions differ (Figure 1C): the overall profile corresponds to two centred distributions that differ in spread; for each decile, we can estimate by how much they differ, and with what uncertainty; finally, the differences appear asymmetric, with larger differences in the right tails.

Is this the end of the story? No, because so far we have only considered Q1, how the two marginal distributions compare. We can get a different but complementary perspective by considering Q2, the typical difference between any member of group 1 and any member of group 2. To address Q2, we compute all the pairwise differences between members of the two groups. In this case each group has n=50, so we end up with 2,500 differences. Figure 1B shows a kernel density representation of these differences. So what does the typical difference looks like? The median of the differences is very near zero, so it seems on average, if we randomly select one observation from each group, they will differ very little. However, the differences can be quite substantial, and with real data we would need to put these differences in context, to understand how large they are, and their physiological/psychological interpretation. The differences are also asymmetrically distributed, with negative skewness: negative scores extend to -10, whereas positive scores don’t even reach +5. This asymmetry relates to our earlier observation of asymmetric differences in the shift function.

Recently, Wilcox (2012) suggested a new approach to quantify asymmetries in difference distributions. To understand his approach, we first need to consider how difference scores are usually characterised. It helps to remember that for continuous distributions, the Mann—Whitney-Wilcoxon U statistics = sum(X>Y) for all pairwise comparisons, i.e. the sum of the number of times observations in group X are larger than observations in group Y. Concretely, to compute U we sum the number of times observations in group X are larger than observations on group Y. This calculation requires to compute all pairwise differences between X and Y, and then count the number of positive differences. So the MWW test assesses P(X>Y) = 0.5. Essentially, the MWW test is a non- parametric test of the hypothesis that the distributions are identical. The MWW test does not compare the medians of the marginal distributions as often stated; also, it estimates the wrong standard error (Cliff, 1996). A more powerful test is Cliff’s delta, which uses P(X>Y) – P(X<Y) as a measure of effect size. As expected, in our current example Cliff’s delta is not significant, because the difference distribution has a median very near zero.

Wilcox’s approach is an extension of the MWW test: the idea is to get a sense of the asymmetry of the difference distribution by computing a sum of quantiles = q + (1-q), for various quantiles estimated using the Harrell-Davis estimator. A percentile bootstrap technique is used to derive confidence intervals. Figure 1D shows the resulting difference asymmetry plot  (Wilcox has not given a clear name to that new function, so I made one up). In this plot, 0.05 stands for the sum of quantile 0.05 + quantile 0.95; 0.10 stands for the sum of quantile 0.10 + quantile 0.90; and so on… The approach is not limited to these quantiles, so sparser or denser functions could be tested too. Figure 1D reveals negative sums of the extreme quantiles (0.05 + 0.95), and progressively smaller, converging to zero sums as we get closer to the centre of the distribution. So the q+(1-q) plot suggests that the two groups differ, with maximum differences in the tails, and no significant differences in central tendency. Contrary to the shift function, the q+(1-q) plot let us conclude that the difference distribution is asymmetric, based on the 95% confidence intervals. Other alpha levels can be assessed too.

In the case of two random samples from a normal population, one shifted by a constant compared to the other, the shift function and the difference asymmetry function should be about flat, as illustrated in Figure 2. In this case, because of random sampling and limited sample size, the two approaches provide different perspectives on the results: the shift function suggests a uniform shift, but fails to reject for the three highest deciles; the difference asymmetry function more strongly suggests a uniform shift, with all sums at about the same value. This shows that all estimated pairs of quantiles are asymmetric about zero, because the difference function is uniformly shifted away from zero.

typ_diff_fig2_ind_linear_effect

Figure 2. Independent groups: uniform shift. Two random samples of 50 observations were generated using rnorm. A constant of 1 was added to group 2.

When two distributions do not differ, both the shift function and the difference asymmetry function should be about flat and centred around zero – however this is not necessarily the case, as shown in Figure 3.

typ_diff_fig3_ind_no_effect

Figure 3. Independent groups: no shift – example 1. Two random samples of 50 observations were generated using rnorm.

Figure 4 shows another example in which no shift is present, and with n=100 in each group, instead of n=50 in the previous example.

typ_diff_fig4_ind_no_effect2

Figure 4. Independent groups: no shift – example 2.  Two random samples of 100 observations were generated using rnorm.

In practice, the asymmetry plot will often not be flat. Actually, it took me several attempts to generate two random samples associated with such flat asymmetry plots. So, before getting too excited about your results, it really pays to run a few simulations to get an idea of what random fluctuations can look like. This can’t be stressed enough: you might be looking at noise!

Dependent groups

Wilcox & Erceg-Hurn (2012) described a difference asymmetry function for dependent group. We’re going to apply the technique to the dataset presented in Figure 5. Panel A shows the two marginal distributions. However, we’re dealing with a paired design, so it is impossible to tell how observations are linked between conditions. This association is revealed in two different ways in panels B & C, which demonstrate a striking pattern: for participants with weak scores in condition 1, differences tend to be small and centred about zero; beyond a certain level, with increasing scores in condition 1, the differences get progressively larger. Finally, panel D shows the distribution of differences, which is shifted up from zero, with only 6 out of 35 differences inferior to zero.

At this stage, we’ve learnt a lot about our dataset – certainly much more than would be possible from current standard figures. What else do we need? Statistical tests?! I don’t think they are absolutely necessary. Certainly, providing a t-test is of no interest whatsoever if Figure 5 is provided, because it cannot provide information we already have.

typ_diff_fig5_dep1

Figure 5. Dependent groups: data visualisation. A Stripcharts of the two distributions. Horizontal lines mark the deciles, with a thick line for the median. B Stripcharts of paired observations. Scatter was introduced along the x axis to reveal overlapping observations. C Scatterplot of paired observations. The diagonal black reference line of no effect has slope one and intercept zero. The dashed grey lines mark the quartiles of the two conditions. In panel C, it would also be useful to plot the pairwise differences as a function of condition 1 results. D Stripchart of difference scores. Horizontal lines mark the deciles, with a thick line for the median.

Figure 6 provides quantifications and visualisations of the effects using the same layout as Figure 5. The shift function (Figure 6C) shows a non-uniform shift between the marginal distributions: the first three deciles do not differ significantly, the remaining deciles do, and there is an overall trend of growing differences as we progress towards the right tails of the distributions. The difference asymmetry function provides a difference perspective. The function is positive and almost flat, demonstrating that the distribution of differences is uniformly shifted away from zero, a result that cannot be obtained by only looking at the marginal distributions. Of course, when using means comparing the marginals or assessing the difference scores give the same results, because the difference of the means is the same as the mean of the differences. That’s why a paired t-test is the same as a one-sample test on the pairwise differences. With robust estimators the two approaches differ: for instance the difference between the medians of the marginals is not the same as the median of the differences.

typ_diff_fig6_dep2

Figure 6. Dependent groups: uniform difference shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Horizontal lines mark the deciles, with a thick line for the median. C Shift function. D Difference asymmetry plot with 95% confidence intervals.

As fancy as Figure 6 can be, it still misses an important point: nowhere do we see the relationship between condition 1 and condition 2 results, as shown in panels B & C of Figure 5. This is why detailed illustrations are absolutely necessary to make sense of even the simplest datasets.

If you want to make more inferences about the distribution of differences, as shown in Figure 6B, Figure 7 shows a complementary description of all the deciles with their 95% confidence intervals. These could be substituted with highest density intervals or credible intervals for instance.

typ_diff_fig7_dep3_decile_plot

Figure 7. Dependent groups: deciles of the difference distribution. Each disk marks a difference decile, and the horizontal green line makes its 95% percentile bootstrap confidence interval. The reference line of no effect appears as a continuous black line. The dashed black line marks the difference median.

Finally, in Figure 8 we look at an example of a non-uniform difference shift. Essentially, I took the data used in Figure 6, and multiplied the four largest differences by 1.5. Now we see that the 9th decile does not respect the linear progression suggested by previous deciles, (Figure 8, panels A & B), and the difference asymmetry function suggests an asymmetric shift of the difference distribution, with larger discrepancies between extreme quantiles.

typ_diff_fig8_dep4_larger_diff

Figure 8. Dependent groups: non-uniform difference shift. A Stripchart of difference scores. B Deciles of the difference distribution. C Difference asymmetry function.

Conclusion

The techniques presented here provide a very useful perspective on group differences, by combining detailed illustrations and quantifications of the effects. The different techniques address different questions, so which technique to use depends on the question you want to ask. This choice should be guided by experience: to get a good sense of the behaviour of these techniques will require a lot of practice with various datasets, both real and simulated. If you follow that path, you will soon realise that classic approaches such as t-tests on means combined with bar graphs are far too limited, and can hide rich information about a dataset.

I see three important developments for the approach outlined here:

  • to make it Bayesian, or at least p value free using highest density intervals;

  • to extend it to multiple group comparisons (the current illustrations don’t scale up very easily);

  • to extend it to ANOVA type designs with interaction terms.

References

Cliff, N. (1996) Ordinal methods for behavioral data analysis. Erlbaum, Mahwah, N.J.

Wilcox, R.R. (2012a) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R. (2012b) Comparing Two Independent Groups Via a Quantile Generalization of the Wilcoxon-Mann-Whitney Test. Journal of Modern Applied Statistical Methods, 11, 296-302.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

the shift function: a powerful tool to compare two entire distributions

 


The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using ggplot2. The R code for the 2013 percentile bootstrap version of the shift function was also covered here and here. Matlab code is described in another post.


In neuroscience & psychology, group comparison is usually an exercise that involves comparing two typical observations. This is most of the time achieved using a t-test on means. This standard procedure makes very strong assumptions:

  • the distributions differ only in central tendency, not in other aspects;
  • the typical observation in each distribution can be summarised by the mean;
  • the t-test is sufficient to detect changes in location.

As we saw previously, t-tests on means are not robust. In addition, there is no reason a priori to assume that two distributions differ only in the location of the bulk of the observations. Effects can occur in the tails of the distributions too: for instance a particular intervention could have an effect only in animals with a certain hormonal level at baseline; a drug could help participants with severe symptoms, but not others with milder symptoms… Because effects are not necessarily homogenous among participants, it is useful to have appropriate tools at hand, to determine how, and by how much, two distributions differ. Here we’re going to consider a powerful family of tools that are robust and let us compare entire distributions: shift functions.

A more systematic way to characterise how two independent distributions differ was originally proposed by Doksum (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977): to plot the difference between the quantiles of two distributions as a function of the quantiles of one group. The original shift function approach is implemented in the functions sband and wband in Rand Wilcox’s WRS R package.

In 1995, Wilcox proposed an alternative technique which has better probability coverage and potentially more power than Doksum & Sievers’ approach. Wilcox’s technique:

  • uses the Harrell-Davis quantile estimator;
  • computes confidence intervals of the decile differences with a bootstrap estimation of the standard error of the deciles;
  • controls for multiple comparisons so that the type I error rate remains around 0.05 across the 9 confidence intervals. This means that the confidence intervals are a bit larger than what they would be if only one decile was compared, so that the long-run probability of a type I error across all 9 comparisons remains near 0.05;
  • is implemented in the shifthd function.

Let’s start with an extreme and probably unusual example, in which two distributions differ in spread, not in location (Figure 1). In that case, any test of central tendency will fail to reject, but it would be wrong to conclude that the two distributions do not differ. In fact, a Kolmogorov-Smirnov test reveals a significant effect, and several measures of effect sizes would suggest non-trivial effects. However, a significant KS test just tells us that the two distributions differ, not how.

shift_function_ex1_arrows

Figure 1. Two distributions that differ in spread A Kernel density estimates for the groups. B Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense.

The shift function can help us understand and quantify how the two distributions differ. The shift function describes how one distribution should be re-arranged to match the other one: it estimates how and by how much one distribution must be shifted. In Figure 1, I’ve added annotations to help understand the link between the KDE in panel A and the shift function in panel B. The shift function shows the decile differences between group 1 and group 2, as a function of group 1 deciles. The deciles for each group are marked by coloured vertical lines in panel A. The first decile of group 1 is slightly under 5, which can be read in the top KDE of panel A, and on the x-axis of panel B. The first decile of group 2 is lower. As a result, the first decile difference between group 1 and group 2 is positive, as indicated by a positive value around 0.75 in panel B, as marked by an upward arrow and a + symbol. The same symbol appears in panel A, linking the deciles from the two groups: it shows that to match the first deciles, group 2’s first decile needs to be shifted up. Deciles 2, 3 & 4 show the same pattern, but with progressively weaker effect sizes. Decile 5 is well centred, suggesting that the two distributions do not differ in central tendency. As we move away from the median, we observe progressively larger negative differences, indicating that to match the right tails of the two groups, group 2 needs to be shifted to the left, towards smaller values – hence the negative sign.

To get a good understanding of the shift function, let’s look at its behaviour in several other clear-cut situations. First, let’s consider a  situation in which two distributions differ in location (Figure 2). In that case, a t-test is significant, but again, it’s not the full story. The shift function looks like this:

shift_function_ex2_complete

Figure 2. Complete shift between two distributions

What’s happening? All the differences between deciles are negative and around -0.45. Wilcox (2012) defines such systematic effect has the hallmark of a completely effective method. In other words, there is a complete and seemingly uniform shift between the two distributions.

In the next example (Figure 3), only the right tails differ, which is captured by significant differences for deciles 6 to 9. This is a case described by Wilcox (2012) as involving a partially effective experimental manipulation.

shift_function_ex3_onesided1

Figure 3. Positive right tail shift

Figure 4 also shows a right tail shift, this time in the negative direction. I’ve also scaled the distributions so they look a bit like reaction time distributions. It would be much more informative to use shift functions in individual participants to study how RT distributions differ between conditions, instead of summarising each distribution by its mean (sigh)!

shift_function_ex4_onesided2

Figure 4. Negative right tail shift

Figure 5 shows two large samples drawn from a standard normal population. As expected, the shift function suggests that we do not have enough evidence to conclude that the two distributions differ. The shift function does look bumpy tough, potentially suggesting local differences – so keep that in mind when you plug-in your own data.

shift_function_ex5_nochange

Figure 5. No difference?

And be careful not to over-interpret the shift function: the lack of significant differences should not be used to conclude that we have evidence for the lack of effect; indeed, failure to reject in the frequentist sense can still be associated with non-trivial evidence against the null – it depends on prior results (Wagenmakers, 2007).

So far, we’ve looked at simulated examples involving large sample sizes. We now turn to a few real-data examples.

Doksum & Sievers (1976) describe an example in which two groups of rats were kept in an environment with or without ozone for 7 days and their weight gains measured (Figure 6). The shift function suggests two results: overall, ozone reduces weight gain; ozone might promote larger weight gains in animals gaining the most weight. However, these conclusions are only tentative given the small sample size, which explains the large confidence intervals.

shift_function_ex6_ozone

Figure 6. Weight gains A Because the sample sizes are much smaller than in the previous examples, the distributions are illustrated using 1D scatterplots. The deciles are marked by grey vertical lines, with lines for the 0.5 quantiles. B Shift function.

Let’s consider another example used in (Doksum, 1974; Doksum, 1977), concerning the survival time in days of 107 control guinea pigs and 61 guinea pigs treated with a heavy dose of tubercle bacilli (Figure 7). Relative to controls, the animals that died the earliest tended to live longer in the treatment group, suggesting that the treatment was beneficial to the weaker animals (decile 1). However, the treatment was harmful to animals with control survival times larger than about 200 days (deciles 4-9). Thus, this is a case where the treatment has very different effects on different animals. As noted by Doksum, the same experiment was actually performed 4 times, each time giving similar results.

shift_function_ex7_guineapigs

Figure 7. Survival time

Shift function for dependent groups

All the previous examples were concerned with independent groups. There is a version of the shift function for dependent groups implemented in shiftdhd. We’re going to apply it to ERP onsets from an object detection task (Bieniek et al., 2015). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. Typically, test-retest assessment is performed using a correlation. However, we care about the units (ms), which a correlation would get rid of, and we had a more specific hypothesis, which a correlation cannot test; so we used a shift function (Figure 8). If you look at the distributions of onsets across participants, you will see that it is overall positively skewed, and with a few participants with particularly early or late onsets. With the shift function, we wanted to test for the overall reliability of the results, but also in particular the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). The shift function does not provide enough evidence to suggest a uniform or non-uniform shift – but we would probably need many more observations to make a strong claim.

shift_function_ex8_onsets

Figure 8. ERP onsets

Because we’re dealing with a paired design, the illustration of the marginal distributions in Figure 8 is insufficient: we should illustrate the distribution of pairwise differences too, as shown in Figure 9.

shift_function_ex9_onsets_diff

Figure 9. ERP onsets with KDE of pairwise differences

Figure 10 provides an alternative representation of the distribution of pairwise differences using a violin plot.

shift_function_ex10_onsets_diff_violin

Figure 10. ERP onsets with violin plot of pairwise differences

Figure 11 uses a 1D scatterplot (strip chart).

shift_function_ex11_onsets_diff_scatter

Figure 11. ERP onsets with 1D scatterplot of pairwise differences

Shift function for other quantiles

Although powerful, Wilcox’s 1995 technique is not perfect, because it:

  • is limited to the deciles;
  • can only be used with alpha = 0.05;
  • does not work well with tied values.

More recently, Wilcox’s proposed a new version of the shift function that uses a straightforward percentile bootstrap (Wilcox & Erceg-Hurn, 2012; Wilcox et al., 2014). This new approach:

  • allows tied values;
  • can be applied to any quantile;
  • can have more power when looking at extreme quantiles (<=0.1, or >=0.9).
  • is implemented in qcomhd for independent groups;
  • is implemented in Dqcomhd for dependent groups.

Examples are provided in the R script for this post.

In the percentile bootstrap version of the shift function, p values are corrected, but not the confidence intervals. For dependent variables, Wilcox & Erceg-Hurn (2012) recommend at least 30 observations to compare the .1 or .9 quantiles. To compare the quartiles, 20 observations appear to be sufficient. For independent variables, Wilcox et al. (2014) make the same recommendations made for dependent groups; in addition, to compare the .95 quantiles, they suggest at least 50 observations per group.

Conclusion

The shift function is a powerful tool that can help you better understand how two distributions differ, and by how much. It provides much more information than the standard t-test approach.

Although currently the shift function only applies to two groups, it can in theory be extended to more complex designs, for instance to quantify interaction effects.

Finally, it would be valuable to make a Bayesian version of the shift function, to focus on effect sizes, model the data, and integrate them with other results.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam; Boston.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

The Harrell-Davis quantile estimator

Quantiles are robust and useful descriptive statistics. They belong to the family of L-estimators, which is to say that they are based on the linear combination of order statistics. They are several ways to compute quantiles. For instance, in R, the function quantile has 9 options. In Matlab, the quantile & prctile functions offer only 1 option. Here I’d like to introduce briefly yet another option: the Harrell-Davis quantile estimator (Harrell & Davis, 1982). It is the weighted average of all the order statistics (Figure 2). And, in combination with the percentile bootstrap, it is a useful tool to derive confidence intervals of quantiles (Wilcox 2012), as we will see quickly in this post. It is also a useful tool to derive confidence intervals of the difference between quantiles of two groups, as we will see in another post. As discussed previously in the percentile bootstrap post, to make accurate confidence intervals, we need to combine an estimator with a particular confidence interval building procedure, and the right combo is not obvious depending on the data at hand.

Before we motor on, a quick google search suggests that there is recent work to try to improve the Harrell-Davis estimator, so this not to say that this estimator is the best in all situations. But according to Rand Wilcox it works well in many situations, and we do use it a lot in the lab…

Let’s look at data from a paper on visual processing speed estimation (Bieniek et al. 2015). We consider ERP onsets from 120 participants aged 18 to 81.

The sorted ages are:

18 18 19 19 19 19 20 20 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 23 23 23 24 24 24 25 26 28 28 29 29 30 30 31 31 32 32 32 33 34 34 35 35 36 37 38 40 40 41 41 42 42 43 43 44 45 45 45 45 48 49 49 50 51 54 54 55 56 58 59 59 60 60 61 62 62 62 63 63 63 64 64 64 64 65 65 66 66 66 66 66 66 67 67 67 67 68 68 68 68 68 69 70 70 70 71 72 72 72 75 76 77 78 79 81 81

Fig1-age distribution

Figure 1. Age distribution.

The Matlab code to reproduce all the figures in this post is available on github. There is also a list of R functions from Rand Wilcox’s toolbox.

How do we compute Harrell-Davis quantiles of the age distribution? Figure 2 shows the Harrell-Davis weights for the deciles of the age distribution.

Fig2-weights

Figure 2. Decile weights.

The deciles are obtained by multiplying the sorted ages by the weights in Figure 2, which gives us:

21.1, 23.3, 29.7, 37.0, 45.3, 56.1, 63.3, 66.6, 70.4

For comparison, the age deciles from Matlab’s prctile function are:

21, 23, 30, 36, 45, 57, 64, 66, 70

Now, we can update the scatterplot in Figure 1 with the deciles:

Fig3-age deciles

Figure 3. Scatterplot + age deciles. The thick vertical black line marks the 50th quantiles.

We can also compute a confidence interval for a Harrell-Davis quantile. There are two ways to do that:

  • using a percentile bootstrap of the quantile (pbci approach);
  • using a percentile bootstrap estimate of the standard error of the quantile, which is then plugged into a confidence interval formula (pbse approach).

Using the code available with this post, we can try the two approaches on the median:

  • pbci approach gives 45.31 [35.89, 54.73]
  • pbse approach gives 45.31 [38.49, 54.40]

The two methods return similar upper bounds, but quite different lower bounds. Because they are both based on random resampling with replacement, running the same analysis several times will each time also give slightly different results. Actually, this is one important criterion to select a good bootstrap confidence interval technique: despite random sampling, using the same technique many times should provide overall similar results. Another important criterion is the probability coverage: if we build a 95% confidence interval, we want that confidence interval to contain the population value we’re trying to estimate 95% of the time. That’s right, the probability attached to a confidence interval is a long run coverage: assuming a population with a certain median, if we perform the same experiment over and over, every time drawing a sample of n observations and computing an (1-alpha)% confidence interval using the same technique, (1-alpha)% of these confidence intervals will contain the population median. So, if everything is fine (n is large enough, the number of bootstrap samples is large enough, the combination of bootstrap technique and estimator is appropriate), alpha% of the time (usually 5%), a confidence interval WILL NOT include the population parameter of interest. This implies that given the 1,000s of neuroscience & psychology experiments performed every year, 100s of paper report the wrong confidence intervals – but this possibility is never considered in the articles’ conclusions…

In many situations, the long run probability coverage can be actually much lower or much higher than (1-alpha). So can we check that we’re building accurate confidence intervals, at least in the long run? For that, we’ve got to run simulations. Here is an example. First, we create a fake population, for instance with a skewed distribution, which could reflect our belief of the nature of the population we’re studying:

Fig4-sim population

Figure 4. Population of 1,000,000 values with a 10 degrees of freedom chi2 distribution.

Second, we compute benchmark values, e.g. median, mean…

Third, we run simulations in which we perform fake experiments with a given sample size, and then compute confidence intervals of certain quantities. Finally, we check how often the different confidence intervals actually contain the population parameters (probability coverage):

  • pbse(hd) = 0.9530
  • pbci(hd) = 0.9473
  • pbci(median) = 0.9452
  • pbci(mean) = 0.9394

They’re all very close to 95%. However, the confidence intervals of hd created using the pbse approach tended to be larger than those created using the pbci approach. The confidence intervals for the mean missed the population mean 1% of the time compared to the expected 95% – that’s because they tended to be shorter than the other 3. The bootstrap estimates of the sampling distribution of hd, the median and the mean, as well as the width of the confidence intervals can be explored using the code on github.

Of course, no one is ever going to run 10,000 times the same experiment! And these results assume a certain population, a certain number of observations per experiment, and a certain number of bootstrap samples. We would need a more systematic exploration of the different combinations of options to be sure the present results are not special cases.

To be clear: there is absolutely no guarantee that any particular confidence interval contains the population parameter you’re trying to estimate. So be humble, and don’t make such a big deal about your confidence intervals, especially if you have small sample sizes.

Personally, more and more I use confidence intervals to try to describe the variability in the sample at hand. For that purpose, and to avoid potential inferential problems associated with confidence intervals, I think it is more satisfactory to use highest density intervals HDI. I will post R & Matlab functions to compute the HDI of the bootstrap quantiles on github at some stage. By reporting HDI, there are no associated p values and we minimise the temptation to cross proton streams (i.e. dichotomise a continuous variable to make a binary decision – MacCallum et al. 2002).

Finally, we consider something a bit more interesting than the age of our participants: the distribution of ERP onsets.

Here are the onsets in milliseconds:

Fig7-onset distribution

Figure 5. Onsets.

And the deciles with their confidence intervals, which provide a very nice summary of the distribution:

Fig8-onset deciles

Figure 6. Onset deciles with confidence intervals.

If you’re interested, I’ve also attempted a Bayesian estimation of the onset data using R and JAGS. However, I don’t know yet how to perform quantile estimation – please get in touch if you can help.

Conclusion

Now you’ve got the tools to describe a distribution in detail. There is no particular reason why we should be obsessed with the mean, especially when robust and more informative statistics are available. Next, I will show you how to compare all the deciles of two distributions using a mighty tool: the shift function. This will, of course, rely on the Harrell-Davis estimator and the bootstrap.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.

MacCallum RC, Zhang S, Preacher KJ, Rucker DD. 2002. On the practice of dichotomization of quantitative variables. Psychological Methods 7: 19-40

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.

the percentile bootstrap

 

“The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates.” Efron & Tibshirani, An introduction to the bootstrap, 1993

“The central idea is that it may sometimes be better to draw conclusions about the characteristics of a population strictly from the sample at hand, rather than by making perhaps unrealistic assumptions about the population.” Mooney & Duval, Bootstrapping, 1993

Like all bootstrap methods, the percentile bootstrap relies on a simple & intuitive idea: instead of making assumptions about the underlying distributions from which our observations could have been sampled, we use the data themselves to estimate sampling distributions. In turn, we can use these estimated sampling distributions to compute confidence intervals, estimate standard errors, estimate bias, and test hypotheses (Efron & Tibshirani, 1993; Mooney & Duval, 1993; Wilcox, 2012). The core principle to estimate sampling distributions is resampling, a technique pioneered in the 1960’s by Julian Simon (particularly inspiring is how he used dice and cards to teach resampling in statistics classes). The technique was developed & popularised by Brad Efron as the bootstrap.

Let’s consider an example, starting with this small set of 10 observations:

1.2 1.1 0.1 0.8 2.6 0.7 0.2 0.3 1.9 0.4

To take a bootstrap sample, we sample n observations with replacement. That is, given the 10 original observations above, we sample with replacement 10 observations from the 10 available. For instance, one bootstrap sample from the example above could be (sorted for convenience):

0.4 0.4 0.4 0.8 0.8 1.1 1.2 2.6 2.6 2.6

a second one:

0.1 0.3 0.4 0.8 1.1 1.2 1.2 1.9 1.9 1.9

a third one:

0.1 0.4 0.7 0.7 1.1 1.1 1.1 1.1 1.9 2.6

etc.

As you can see, in some bootstrap samples, certain observations were sampled once, others more than once, and yet others not at all. The resampling process is akin to running many experiments.

fig1-bootstrap_philosophy

Figure 1. Bootstrap philosophy.

Essentially, we are doing fake experiments using only the observations from our sample. And for each of these fake experiments, or bootstrap sample, we can compute any estimate of interest, for instance the median. Because of random sampling, we get different medians from different draws, with some values more likely than other. After repeating the process above many times, we get a distribution of bootstrap estimates, let say 1,000 bootstrap estimates of the sample median. That distribution of bootstrap estimates is a data driven estimation of the sampling distribution of the sample median. Similarly, we can use resampling to estimate the sampling distribution of any statistics, without requiring any analytical formula. This is the major appeal of the bootstrap.

Let’s consider another example, using data from figure 5 of Harvey Motulsky’s 2014 article. We’re going to reproduce his very useful figure and add a 95% percentile bootstrap confidence interval. The data and Matlab code + pointers to R code are available on github. The file pb_demo.m will walk you through the different steps of bootstrap estimation, and can be used to recreate the figures from the rest of this post.

With the bootstrap, we estimate how likely we are, given the data, to obtain medians of different values. In other words, we estimate the sampling distribution of the sample median. Here is an example of a distribution of 1,000 bootstrap medians.

fig2-boot_median_est_density

Figure 2. Kernel density distribution of the percentile bootstrap distribution of the sample median.

The distribution is skewed and rather rough, because of the particular data we used and the median estimator of central tendency. The Matlab code let you estimate other quantities, so for instance using the mean as a measure of central tendency would produce a much smoother and symmetric distribution. This is an essential feature of the bootstrap: it will suggest sampling distributions given the data at hand and a particular estimator, without assumptions about the underlying distribution. Thus, bootstrap sampling distributions can take many unusual shapes.

The interval, in the middle of the bootstrap distribution, that contains 95% of medians constitutes a percentile bootstrap confidence interval of the median.

fig3-bootci_illustration

Figure 3. Percentile bootstrap confidence interval of the median. CI = confidence interval.

Because the bootstrap sample distribution above is skewed, it might be more informative to report a highest-density interval – a topic for another post.

To test hypotheses, we can reject a point hypothesis if it is not included in the 95% confidence interval (a p value can also be obtained – see online code). Instead of testing a point hypothesis, or in addition, it can be informative to report the bootstrap distribution in a paper, to illustrate likely sample estimates given the data.

Now that we’ve got a 95% percentile bootstrap confidence interval, how do we know that it is correct? In particular, how many bootstrap samples do we need? The answer to this question depends on your goal. One goal might be to achieve stable results: if you repeatedly compute a confidence interval using the same data and the same bootstrap technique, you should obtain very similar confidence intervals. Going back to our example, if we take a sub-sample of the data, and compute many confidence intervals of the median, we sometimes get very different results. The figure below illustrates 7 confidence intervals of the median using the same small dataset. The upper boundaries of the different confidence intervals vary far too much:

fig4-median_CI_rep

Figure 4. Repeated calculations of the percentile bootstrap confidence interval of the median for the same dataset.

The variability is due in part to the median estimator, which introduces strong non-linearities. This point is better illustrated by looking at 1,000 sorted bootstrap median estimates:

fig5-boot_median_est_sorted

Figure 5. Sorted bootstrap median estimates.

If we take another series of 1,000 bootstrap samples, the non-linearities will appear at slightly different locations, which will affect confidence interval boundaries. In that particular case, one way to solve the variability problem is to increase the number of bootstrap samples – for instance using 10,000 samples produces much more stable confidence intervals (see code). Using more observations also improves matters significantly.

If we get back to the question of the number of bootstrap samples needed, another goal is to achieve accurate probability coverage. That is, if you build a 95% confidence interval, you want the interval to contain the population value 95% of the time in the long run. Concretely, if you repeat the same experiment over and over, and for each experiment you build a 95% confidence interval, 95% of these intervals should contain the population value you are trying to estimate if the sample size is large enough. This can be achieved by using a conjunction of 2 techniques: a technique to form the confidence interval (for instance a percentile bootstrap), and a technique to estimate a particular quantity (for instance the median to estimate the central tendency of the distribution). The only way to find out which combo of techniques work is to run simulations covering a lot of hypothetical scenarios – this is what statisticians do for a living, and this is why every time you ask one of them what you should do with your data, the answer will inevitably be “it depends”. And it depends on the shape of the distributions we are sampling from and the number of observations available in a typical experiment in your field. Needless to say, the best approach to use in one particular case is not straightforward: there is no one-size-fits-all technique to build confidence intervals; so any sweeping recommendation should be regarded suspiciously.

The percentile bootstrap works very well, and in certain situations is the only (frequentist) technique known to perform satisfactorily to build confidence intervals of or to compare for instance medians and other quantiles, trimmed means, M estimators, regression slopes estimates, correlation coefficients (Wilcox 2012). However, the percentile bootstrap

does not perform well with all quantities, in particular with the mean (Wilcox & Keselman 1993). You can still use the percentile bootstrap to illustrate the variability in the sample at hand, without making inferences about the underlying population. We do this in the figure below to see how the percentile bootstrap confidence interval compares to other ways to summarise the data.

Figure 6. Updated version of Motulsky’s 2014 figure 5.

This is a replication of Motulsky’s 2014 figure 5, to which I’ve added a 95% percentile bootstrap confidence interval of the mean. This figure makes a critical point: there is no substitute for a scatterplot, at least for relatively small sample sizes. Also, using the mean +/- SD, +/- SEM, with a classic confidence interval (using t formula) or with a percentile bootstrap confidence interval can provide very different impressions about the spread in the data (although it is not their primary objective). The worst representation clearly is mean +/- SEM, because it provides a very misleading impression of low variability. Here, because the sample is skewed, mean +/- SEM does not even include the median, thus providing a wrong estimation of the location of the bulk of the observations. It follows that results in an article reporting only mean +/- SEM cannot be assessed unless  scatterplots are provided, or at least estimates of skewness, bi-modality and complementary measures of uncertainty for comparison. Reporting a boxplot or the quartiles does a much better job at conveying the shape of the distribution than any of the other techniques. These representations are also robust to outliers. In the next figure, we consider a subsample of the observations from Figure 6, to which we add an outlier of increasing size: the quartiles do not move.

fig7-outliers_quartiles

Figure 7. Outlier effect on the quartiles. The y-axis is truncated.

Contrary to the quartiles, the classic confidence interval of the mean is not robust, so it provides very inaccurate results. In particular, it assumes symmetry, so even though the outlier is on the right side of the distribution, both sides of the confidence interval get larger. The mean is also  pulled towards the outlier, to the point where it is completely outside the bulk of the observations. I cannot stress this enough: you cannot trust mean estimates if scatterplots are not provided.

fig8-outliers_classic_ci

Figure 8. Outlier effect on the classic confidence interval of the mean.

In comparison, the percentile bootstrap confidence interval of the mean performs better: only its right side, the side affected by the outlier, expends as the outlier gets larger.

fig9-outliers_pbci_mean

Figure 9. Outlier effect on the percentile bootstrap confidence interval of the mean.

Of course, we do not have to use the mean as a measure of central tendency. It is trivial to compute a percentile bootstrap confidence interval of the median instead, which, as expected, does not change with outlier size:

fig10-outliers_pbci_median

Figure 10. Outlier effect on the percentile bootstrap confidence interval of the median.

Conclusion

The percentile bootstrap can be used to build a confidence interval for any quantity, whether its sampling distribution can be estimated analytically or not. However, there is no guarantee that the confidence interval obtained will be accurate. In fact, in many situations alternative methods outperform the percentile bootstrap (such as percentile-t, bias corrected, bias corrected & accelerated (BCa), wild bootstraps). With this caveat in mind, I think the percentile bootstrap remains an amazingly simple yet powerful tool to summarise the accuracy of an estimate given the variability in the data. It is also

the only frequentist tool that performs well in many situations – see Wilcox 2012 for an extensive coverage of these situations.

Finally, it is important to realise that the bootstrap does make a very strong & unwarranted assumption: only the observations in the sample can ever be observed. For this reason, for small samples the bootstrap can produce rugged sampling distributions, as illustrated above. Rasmus Bååth wrote about the limitations of the percentile bootstrap and its link to Bayesian estimation in a blog post I highly recommend; he also provided R code for the bootstrap and the Bayesian bootstrap in another post.

References

Efron, B. & Tibshirani Robert, J. (1993) An introduction to the bootstrap. Chapman & Hall, London u.a.

Mooney, C.Z. & Duval, R.D. (1993) Bootstrapping : a nonparametric approach to statistical inference. Sage Publications, Newbury Park, Calif. ; London.

Motulsky, H.J. (2014) Common misconceptions about data analysis and statistics. J Pharmacol Exp Ther, 351, 200-205.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam ; Boston.

Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.

Robust effect sizes for 2 independent groups

When I was an undergrad, I was told that beyond a certain sample size (n=30 if I recall correctly), t-tests and ANOVAs are fine. This was a lie. I wished I had been taught robust methods and that t-tests and ANOVAs on means are only a few options among many alternatives. Indeed, t-tests and ANOVAs on means are not robust to outliers, skewness, heavy-tails, and for independent groups, differences in skewness, variance (heteroscedasticity) and combinations of these factors (Wilcox & Keselman, 2003; Wilcox, 2012). The main consequence is a lack of statistical power. For this reason, it is often advised to report a measure of effect size to determine, for instance, if a non-significant effect (based on some arbitrary p value threshold) could be due to lack of power, or reflect a genuine lack of effect. The rationale is that an effect could be associated with a sufficiently large effect size but yet fail to trigger the arbitrary p value threshold. However, this advise is pointless, because classic measures of effect size, such as Cohen’s d, its variants, and its extensions to ANOVA are not robust.

To illustrate the problem, first, let’s consider a simple situation in which we compare 2 independent groups of 20 observations, each sampled from a normal distribution with mean = 0 and standard deviation = 1. We then add a constant of progressively larger value to one of the samples, to progressively shift it away from the other. As illustrated in Figure 1, as the difference between the two groups increases, so does Cohen’s d. The Matlab code to reproduce all the examples is available here, along with a list of matching R functions from Rand Wilcox’s toolbox.

fig1-cohend_3ex

Figure 1. Examples of Cohen’s d as a function of group differences. For simplicity, I report the absolute value of Cohen’s d, here and in subsequent figures.

We can map the relationship between group mean differences and d systematically, by running a simulation in which we repeatedly generate two random samples and progressively shift one away from the other by a small amount. We get a nice linear relationship (Figure 2).

fig2-cohend_sysmap

Figure 2. Linear relationship between Cohen’s d and group mean differences.

Cohen’s d appears to behave nicely, so what’s the problem? Let’s consider another example, in which we generate 2 samples of 20 observations from a normal distribution, and shift their means by a fixed amount of 2. Then, we replace the largest observation from group 2 by progressively larger values. As we do so, the difference between the means of group 1 and group 2 increases, but Cohen’s d decreases (Figure 3).

fig3-cohend_outliers

Figure 3. Cohen’s d is not robust to outliers.

Figure 4 provides a more systematic illustration of the effect of extreme values on Cohen’s d for the case of 2 groups of 20 observations. As the group difference increases, Cohen’s d wrongly suggests progressively lower effect sizes.

fig4-cohend_sysout

Figure 4. Cohen’s d as a function of group mean differences in the presence of one outlier. There is an inverse and slightly non-linear relationship between the two variables.

What is going on? Remember that Cohen’s d is the difference between the two group means divided by the pooled standard deviation. As such, neither the numerator nor the denominator are robust, so that even one unusual value can potentially significantly alter d and lead to the wrong conclusions about  effect size. In the example provided in Figure 4, d gets smaller as the mean difference increases because the denominator of d is composed of a non-robust estimator of dispersion, the variance, such that the outlier increases variability, which leads to an increase of the denominator, and thus a lower d. The outlier also has a strong effect on the mean, which leads to an increase of the numerator, and thus larger d. However, the outlier has a stronger effect on the variance than the mean: this imbalance explains the overall decrease of d with increasing outlier size. I leave it as an exercise to understand the origin of the non-linearity in Figure 4. It has to do with the differential effect of the outlier on the mean and the variance.

One could argue that the outlier value added to one of the groups could be removed, which would solve the problem. There are 3 objections to this argument:

  • there are situations in which extreme values are not outliers but expected and plausible observations from a skewed or heavy tail distribution, and thus physiologically or psychologically meaningful values. In other words, what looks like an outlier in a sample of 20 observations could well look very natural in a sample of 200 observations;
  • for small sample sizes, relatively small outliers could go unnoticed but still affect effect size estimation;
  • outliers are not the only problem: skewness & heavy tails can affect the mean and the variance and thus d.

For instance, in some cases, two groups can differ in skewness, as illustrated in Figure 5. In the left panel, the two kernel density estimates illustrate two samples of 100 observations from a normal distribution. The two groups overlap only moderately, and Cohen’s d is high. In the right panel, group 1, with a mean of zero, is the same as in the previous panel; group 2, with a mean of 2, is almost identical to the one in the left panel, except that its largest 10% observations were replaced with slightly larger observations. As a result, the overlap between the two distributions is the same in the two panels – yet Cohen’s d is quite smaller in the second example.

fig5-cohend_mixed

Figure 5. Cohen’s d for normal & skewed distributions.

The point of this example is to illustrate the potential for discrepancies between a visual inspection of two distributions and Cohen’s d. Clearly, in Figure 5, a useful measure of effect size should provide the same estimates for the two examples. Fortunately, several robust alternatives have this desirable property, including Cliff’s delta, the Kolmogorov-Smirnov test statistic, Wilcox & Muska’s Q, and mutual information.

Robust versions of Cohen’s d

Before going over the 4 robust alternatives listed above, it is useful to consider that Cohen’s d is part of a large family of estimators of effect size, which can be described as the ratio of a difference between two measures of central tendency (CT), over some measure of variability:

(CT1 – CT2) / variability

From this expression, it follows that robust effect size estimators can be derived by plugging in robust estimators of central tendency in the numerator and robust estimators of variability in the denominator. Several examples of such robust alternatives are available, for instance using trimmed means and Winsorised variances (Keselman et al. 2008; Wilcox 2012). R users might want to check these functions from Wilcox for instance:

  • akp.effect
  • yuenv2
  • med.effect

There are also extensions of these quantities to the comparison of more than one group (Wilcox 2012).

Robust & intuitive measures of effect sizes

In many situations, the robust effect sizes presented above can bring a great improvement over Cohen’s d and its derivatives. However, they provide only a limited perspective on the data. First, I don’t find this family of effect sizes the easiest to interpret: having to think of effects in standard deviation (or robust equivalent) units is not the most intuitive. Second, this type of effect sizes does not always answer the questions we’re most interested in (Cliff, 1996; Wilcox, 2006).

The simplest measure of effect size: the difference

Fortunately, effect sizes don’t have to be expressed as the ratio difference / variability. The simplest effect size is simply a difference. For instance, when reporting that group A differs from group B, typically people report the mean for each group. It is also very useful to report the difference, without normalisation, but with a confidence or credible interval around it, or some other estimate of uncertainty. This simple measure of effect size can be very informative, particularly if you care about the units. It is also trivial to make it robust by using robust estimators, such as the median when dealing with reaction times and other skewed distributions.

Probabilistic effect size and the Wilcoxon-Mann-Whitney U statistic

For two independent groups, asking by how much the central tendencies of the two groups differ is useful, but this certainly does not exhaust all the potential differences between the two groups. Another perspective relates to a probabilistic description: for instance, given two groups of observations, what is the probability that one random observation from group 1 is larger than a random observation from group 2? Given two independent variables X and Y, this probability can be defined as P(X > Y). Such probability gives a very useful indication of the amount of overlap between the two groups, in a way that is not limited to and dependent on measures of central tendency. More generally, we could consider these 3 probabilities:

  • P(X > Y)
  • P(X = Y)
  • P(X < Y)

These probabilities are worth reporting in conjunction with illustrations of the group distributions. Also, there is a direct relationship between these probabilities and the Wilcoxon-Mann-Whitney U statistic (Birnbaum, 1956; Wilcox 2006). Given sample sizes Nx and Ny:

U / NxNy = P(X > Y) + 0.5 x P(X = Y)

In the case of two strictly continuous distributions, for which ties do not occur:

U / NxNy = P(X > Y)

Cliff’s delta

Cliff suggested to use P(X > Y) and P(X < Y) to compute a new measure of effect size. He defined what is now called Cliff’s delta as:

delta = P(X > Y) – P(X < Y)

Cliff’s delta estimates the probability that a randomly selected observation from one group is larger than a randomly selected observation from another group, minus the reverse probability (Cliff, 1996). It is estimated as:

delta = (sum(x > y) – sum(x < y)) / NxNy

In this equation, each observation from one group is compared to each observation in the other group, and we count how many times the observations from one group are higher or lower than in the other group. The difference between these two counts is then divided by the total number of observations, the product of their sample sizes NxNy. This statistic ranges from 1 when all values from one group are higher than the values from the other group, to -1 when the reverse is true. Completely overlapping distributions have a Cliff’s delta of 0. Because delta is a statistic based on ordinal properties of the data, it is unaffected by rank preserving data transformations. Its non-parametric nature reduces the impact of extreme values or distribution shape. For instance, Cliff’s delta is not affected by the outlier or the difference in skewness in the examples from Figure 3 & 5.

For an MEEG application, we’ve used Cliff’s delta to quantify effect sizes in single-trial ERP distributions (Bieniek et al. 2015). We also used Q, presented later on in this post, but it behaved so similarly to delta that it does not feature in the paper.

An estimate of the standard error of delta can be used to compute a confidence interval for delta. When conditions differ, the statistical test associated with delta can be more  powerful than the Wilcoxon-Mann-Whitney test, which uses the wrong standard error (Cliff, 1996; Wilcox, 2006). Also, contrary to U, delta is a direct measure of effect size, with an intuitive interpretation. There are also some attempts at extending delta to handle more than two groups (e.g. Wilcox, 2011). Finally, Joachim Goedhart has provided an Excel macro to compute Cliff’s delta.

Update: Cliff’s delta is also related to the later introduced “common-language effect size” – see this post from Jan Vanhove.

All pairwise differences

Cliff’s delta is a robust and informative measure of effect size. Because it relies on probabilities, it normalises effect sizes onto a common scale useful for comparisons across experiments. However, the normalisation gets rid of the original units. So, what if the units matter? A complementary perspective to that provided by delta can be gained by considering all the pairwise differences between individual observations from the two groups (Figure 6). Such distribution can be used to answer a very useful question: given that we randomly select one observation from each group, what is the typical difference we can expect? This can be obtained by computing for instance the  median of the pairwise differences. An illustration of the full distribution provides a lot more information: we can see how far away the bulk of the distribution is from zero, get a sense of how large differences can be in the tails…

fig6-all_pairwise_differences

Figure 6. Illustration of all pairwise differences. Left panel: scatterplots of the two groups of observations. One observation from group 1 (in red) is compared to all the observations from group 2 (in orange). The difference between all the pairs of observations is saved and the same process is applied to all the observations from group 1. Right panel: kernel density estimate of the distribution of all the pairwise differences between the two groups. The median of these differences is indicated by the continuous vertical line; the 1st & 3rd quartiles are indicated by the dashed vertical lines.

Something like Figure 6, in conjunction with Cliff’s delta and associated probabilities, would provide a very useful summary of the data.

When Cohen’s d & Cliff’s delta fail

Although robust alternatives to Cohen’s d considered so far, including Cliff’s delta, can handle well situations in which 2 conditions differ in central tendency, they fail completely to describe situations like the one in Figure 7. In this example, the two distributions are dramatically different from each other, yet Cohen’s d is exactly zero, and Cliff’s delta is very close to zero.

fig7-vardiffexample

Figure 7. Measures of effect size for two distributions that differ in spread, not in location. Cd = Cohen’s d, delta = Cliff’s delta, MI = mutual information, KS = Kolmogorov-Smirnov test statistics, Q = Wilcox & Muska’s Q.

Here the two distributions differ in spread, not in central tendency, so it would wise to estimate spread instead. This is indeed one possibility. But it would also be nice to have an estimator of effect size that can handle special cases like this one as well. Three estimators fit the bill, as suggested by the title of Figure 7.

The Kolmogorov-Smirnov statistic

It’s time to introduce a powerful all-rounder: the Kolmogorov-Smirnov test statistic. The KS test is often mentioned to compare one distribution to a normal distribution. It can also be used to compare two independent samples. In that context, the KS test statistic is defined as the maximum of the absolute differences between the empirical cumulative distribution functions (ecdf) of the two groups. As such KS is not limited to differences in central tendency; it is also robust, independent of the shape of distributions, and provides a measure of effect size bounded between 0 and 1. Figure 8 illustrates the statistic using the example from Figure 7. The KS statistic is quite large, suggesting correctly that the two distributions differ. More generally, because it is robust and sensitive to differences located anywhere in the distributions, the KS test is a solid candidate for a default test for two independent samples. However, the KS test is more sensitive to differences in the middle of the distributions than in the tails. To correct this problem, there is also a weighted version of the KS test which provides increased sensitivity to differences in the tails of the distributions – check out the ks R function from Wilcox.

fig8-vardiff_ks_illustration

Figure 8. Illustration of the KS statistic for two independent samples. The top panel shows the kernel density estimates for the two groups. The lower panel shows the matching empirical cumulative distribution functions. The thick black line marks the maximum absolute difference between the two ecdfs – the KS statistic. Figure 8 is the output of the ksstat_fig Matlab function written for this post.

The KS statistic non-linearly increases as the difference in variance between two samples of 100 observations progressively increases (Figure 9). The two samples were drawn from a standard normal distribution and do not differ in mean.

fig9-vardiff_map

Figure 9. Relationship between effect sizes and variance differences. The 3 measures of effect size illustrated here are sensitive to distribution differences other than central tendency, and are therefore better able to handle a variety of cases compared to traditional effect size estimates.

Wilcox & Muska’s Q

Similarly to KS, the Q statistic is also a non-parametric measure of effect size. It ranges from 0 to 1, with chance level at 0.5. It is the probability of correctly deciding whether a randomly selected observation from one of two groups belongs to the first group, based on the kernel density estimates of the two groups (Wilcox & Muska, 1999). Essentially, it reflects the degree of separation between two groups. Again, similarly to KS, in situations in which two distributions differ in other aspects than central tendency, Q might suggest that a difference exists, whereas other methods such as Cohen’s d or Cliff’s delta would not (Figure 9).

Mutual information

In addition to the KS statistic and Q, a third estimator can be used to quantify many sorts of differences between two or more independent samples: mutual information (MI). MI is a non-parametric measure of association between distributions. As shown in Figure 9, it is sensitive to group differences in spread. MI is expressed in bits and is quite popular in neuroscience – much more so than in psychology. MI is a powerful and much more versatile quantity than any of the tools we have considered so far. To learn more about MI, check out Robin Ince’s tutorial with Matlab & Python code and examples, with special applications to brain imaging. There is also a clear illustration of MI calculation using bins in Figure S3 of Schyns et al. 2010.

In the lab, we use MI to quantify the relationship between stimulus variability and behaviour or brain activity (e.g. Rousselet et al. 2014). This is done using single-trial distributions in every participant. Then, at the group level, we compare distributions of MI between conditions or groups of participants. We thus use MI as a robust measure of within-participant effect size, applicable to many situations. This quantity can then be illustrated and tested across participants. This strategy is particularly fruitful to compare brain activity between groups of participants, such as younger and older participants. Cliff’s delta for instance could then be used to quantify the MI difference between groups.

Comparisons of effect sizes

We’ve covered several useful robust measures of effect size, with different properties. So, which one should be used? In statistics, the answer to this sort of questions often is “it depends”. Indeed, it depends on your needs and on the sort of data you’re dealing with. It also depends on which measure makes more sense to you. The code provided with this post will let you explore the different options using simulated data or your own data. For now, we can get a sense of the behaviour of delta, MI, KS and Q for relatively large samples of observations from a normal distribution. In Figure 10, two distributions are progressively shifted from each other.

fig10-escomp_kde

Figure 10. Examples of effect size estimates for different distribution shifts.

Figure 11 provides a more systematic mapping of the relationship between effect size estimates and the difference between the means of two groups of 100 observations. The KS statistic and Q appear to have similar profiles, with a linear rise for small differences, before progressively reaching a plateau. In contrast, Cliff’s delta appears to be less variable and to reach a maximum earlier than KS and Q. MI differs from the other 3 quantities with its non-linear rise for small mean differences.

fig11-escomp_diffmean

Figure 11. Relationship between effect sizes and mean differences.

To more clearly contrast the 4 effect sizes, all their pairwise comparisons are provided in Figure 12. From these comparisons, it seems that KS and Q are almost completely linearly related. If this is the case, then there isn’t much advantage in using Q given that it is much slower to compute than KS. Other comparisons reveal different non-linearities between estimators. These differences would certainly be worth exploring in particular experimental contexts… But enough for this post.

fig12-escomp_sys

Figure 12. Relationship between effect sizes.

Final notes

Given that Cohen’s d and related estimators of effect size are not robust suggests that they should be abandoned in favour of robust methods. This is not to say that Cohen’s d is of no value – for instance in the case of single-trial ERP distributions of 100s of trials, it would be appropriate (Bieniek et al. 2015). But for typical group level analyses, I see no reason to use non-robust methods such as Cohen’s d. And defending the use of Cohen’s d and related measures for the sake of continuity in the literature, so that readers can compare them across studies is completely misguided: non-robust measures cannot be compared because the same value can be obtained for different amounts of overlap between distributions. For this reason, I am highly suspicious of any attempt to perform meta-analysis or to quantify effect sizes in the literature using published values, without access to the raw data. To allow true comparisons across studies, there is only one necessary and sufficient step: to share your data.

In the literature, there is a rampant misconception assuming that statistical tests and measures of effect size are different  entities. The Kolmogorov-Smirnov test and Cliff’s delta demonstrate that both aspects can be combined elegantly. Other useful measures of effect size, such as mutual information, can be used to test hypotheses by combining them with a bootstrap or permutation approach.

Which technique to use in which situation is something best worked out by yourself, given your own data and extensive tests. Essentially, you want to find measures that are informative and intuitive to use, and that you can trust in the long run. The alternatives described in this post are not the only ones on the market, but they are robust, informative, intuitive, and they cover a lot of useful situations. For instance, if the fields of neuroscience and psychology were to use the Kolmogorov-Smirnov test as default test when comparing two independent groups, I would expect a substantial reduction in the number of false negatives reported in the literature. The Kolmogorov-Smirnov test statistic is also a useful measure of effect size on its own. But because the KS test does not tell us how two distributions differ, it requires the very beneficial addition of detailed illustrations to understand how two groups differ.  This comment applies to all the techniques described in this post, which, although useful, do not provide a full picture of the effects. Notably, they do not tell us how two distributions differ. But detailed illustrations can be combined with robust estimation to compare 2 entire distributions.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Birnbaum ZW. 1955. On a use of the Mann-Whitney statistic

Cliff N. 1996. Ordinal methods for behavioral data analysis. Mahwah, N.J.: Erlbaum

Keselman HJ, Algina J, Lix LM, Wilcox RR, Deering KN. 2008. A generally robust approach for testing hypotheses and setting confidence intervals for effect sizes. Psychol Methods 13: 110-29

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Wilcox RR. 2006. Graphical methods for assessing effect size: Some alternatives to Cohen’s d. Journal of Experimental Education 74: 353-67

Wilcox, R.R. (2011) Inferences about a Probabilistic Measure of Effect Size When Dealing with More Than Two Groups. Journal of Data Science, 9, 471-486.

Wilcox RR. 2012. Introduction to robust estimation and hypothesis testing. Amsterdam ; Boston: Academic Press

Wilcox RR, Keselman HJ. 2003. Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods 8: 254-74

Wilcox RR, Muska J. 2010. Measuring effect size: A non-parametric analogue of omega(2). The British journal of mathematical and statistical psychology 52: 93-110