# Bayesian shift function

Two distributions can differ in many ways, yet the standard approach in neuroscience & psychology is to assume differences in means. That’s why the first step in exploratory data analysis should always be detailed graphical representations (Rousselet et al. 2016, 2017). To help quantify how two distributions differ, a fantastic tool is the shift function – an example is provided below. It consists in plotting the difference between group quantiles, as a function of the quantiles in one group. The technique was first described in the 1970s by Doksum (1974; Doksum & Sievers, 1976), and later refined by Wilcox using the Harrell-Davis quantile estimator (Harrell & Davis, 1982) in conjunction with two percentile bootstrap methods (Wilcox 1995; Wilcox et al. 2014). The technique is related to delta plots and relative distribution methods (see details in Rousselet et al. 2017).

The goal of this post is to get feedback on my first attempt to make a Bayesian version of the shift function. Below I describe three potential strategies. My main motivation for making a Bayesian version is that the shift function comes with frequentist confidence intervals and p values. Although I still use confidence intervals to describe sampling variability, they are inherently linked to p values and tend to be associated with major flaws in interpretation (Morey et al. 2016). And my experience is that if p values are available, most researchers will embark on lazy and irrational decision making (Wagenmakers 2007). P values would not be a problem if they were just used as one of many pieces of evidence, without any special status (McShane et al. 2017).

Let’s consider a toy example: two independent exGaussian distributions, with n = 100 in each group.

The two groups clearly differ, as expected from the generative process (see online code). A t-test on means suggests a large uncertainty about the group difference:

difference = – 65 ms [-166, 37]

t = -1.26

p = 0.21

The x-axis shows the quantiles of group 1 (here only the 9 deciles, which is a good default). The y-axis shows the difference between deciles of group 1 and group 2. Intuitively, the difference shows by how much group 2 would need to be shifted to match group 1, for each decile. The coloured labels show the quantile differences. I let you go back and forth between the density estimates and the shift function to understand what’s going on. Another detailed description is provided here if needed.

# Strategy 1: Bayesian bootstrap

A simple strategy to make a Bayesian shift function is to use the Bayesian bootstrap instead of a percentile bootstrap. The percentile bootstrap can already be considered a very cheap way to create Bayesian posterior distributions, if we make the strong (and wrong) assumption that our observations are the only possible ones. Rasmus Bååth provides a detailed introduction to the Bayesian bootstrap, and it’s R implementation on his blog. There is also a video.

Using the Bayesian bootstrap, and 95% highest density intervals (HDI) of the posterior distributions, the shift function looks very similar to the original version, as expected.

Except now we’re dealing with credible intervals, and there are no p values, so users have to focus on quantification!

# Strategy 2: Bayesian quantile regression

Another strategy is to use quantile regression, which comes in a Bayesian flavour using the asymmetric Laplacian likelihood family. To do that, I’m using the amazing `brms` R package by Paul Bürkner.

To get started with Bayesian statistics and the `brms` package, I recommend this excellent blog post by Matti Vuorre. Many other great posts are available here.

Using the default priors from `brms`, we can fit a quantile regression line for each group and for each decile. The medians (or means, or modes) and 95% HDIs of the posterior distributions can then be used to create a shift function.

Again, the results are quite similar to the original ones, although this time the quantiles do differ a bit from those in the previous versions because we use the medians of the posterior distributions to estimate them. Also, I haven’t looked in much detail at how well the model fits the data. The posterior predictive samples suggest the fits could be improved, but I have too little experience to make a call.

# Strategy 3: Bayesian model with exGaussians

The third strategy is to fit a descriptive model to the distributions, generate samples from the posterior distributions, and compute quantiles from these predicted values. Here, since our toy model simulates reaction time data using exGaussian distributions, it makes sense to fit an exGaussian family to the data. More generally, exGaussian distributions are very good at capturing the shape of RT data (Matzke & Wagenmakers, 2009).

Again, the results look similar to the original ones. This strategy has the advantage to require fitting only one model, instead of a new model for each quantile in strategy 2. In addition, we get an exGaussian fit of the data, which is very useful in itself. So that would probably be my favourite strategy.

Does this make even remotely sense?

Which strategy seems more promising?

The clear benefit of strategies 2 and 3 is that they can easily be extended to create hierarchical shift functions, by fitting simultaneously observations from multiple participants. Whereas for the original shift function using the percentile bootstrap, I don’t see any obvious way to make a hierarchical version.

You might want to ask, why not simply focus on modelling the distributions instead of looking at the quantiles? Modelling the shape of the distributions is great of course, but I don’t think it can achieve the same fine-grained quantification achieved by the shift function. Another approach of course would be to fit a more psychologically motivated model, such as a diffusion model. I think these three approaches are complementary.

# References

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.

Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, Jennifer L. Tackett (2017) Abandon Statistical Significance. arXiv:1709.07588

Matzke, D. & Wagenmakers, E.J. (2009) Psychological interpretation of the ex-Gaussian and shifted Wald parameters: a diffusion model analysis. Psychon Bull Rev, 16, 798-817.

Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D. & Wagenmakers, E.J. (2016) The fallacy of placing confidence in confidence intervals. Psychon Bull Rev, 23, 103-123.

Rousselet, G.A., Foxe, J.J. & Bolam, J.P. (2016) A few simple steps to improve the description of group results in neuroscience. The European journal of neuroscience, 44, 2647-2651.

Rousselet, G., Pernet, C. & Wilcox, R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience, figshare.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

# How to illustrate a 2×2 mixed ERP design

Let’s consider a simple mixed ERP design with 2 repeated measures (2 tasks) and 2 independent groups of participants (young and older participants). The Matlab code and the data are available on github. The data are time-courses of mutual information, with one vector time-course per participant and task. These results are preliminary and have not been published yet, but you can get an idea of how we use mutual information in the lab in recent publications (Ince et al. 2016a, 2016b; Rousselet et al. 2014). The code and illustrations presented in the rest of the post are not specific to mutual information.

Our 2 x 2 experimental design could be analysed using the LIMO EEG toolbox for instance, by computing a 2 x 2 ANOVA at every time point, and correcting for multiple comparisons using cluster based bootstrap statistics (Pernet et al. 2011, 2015). LIMO EEG has been used to investigate task effects for instance (Rousselet et al. 2011). But here, instead of ANOVAs, I’d like to focus on graphical representations and non-parametric assessment of our simple group design, to focus on effect sizes and to demonstrate how a few figures can tell a rich data-driven story.

First, we illustrate the 4 cells of our design. Figure 1 shows separately each group and each task: in each cell all participants are superimposed using thin coloured lines. We can immediately see large differences among participants and between groups, with overall smaller effects (mutual information) in older participants. There also seems to be task differences, in particular in young participants, which tend to present more sustained effects past 200 ms in the expressive task than the gender task.

Figure 1

To complement the individual traces, we can add measures of central tendency. The mean is shown with a thick green line, the median with a thick black line. See how the mean can be biased compared to the median in the presence of extreme values. The median was calculated using the Harrell-Davis estimator of the 50th quantile. To illustrate the group median with a measure of uncertainty, we can add a 95% percentile bootstrap confidence interval for instance (Figure 2).

We can immediately see discrepancies between the median time-courses and their confidence intervals on the one hand, and the individual time-courses on the other hand. There are indeed many distributions of participants that can lead to the same average time-course. That’s why it is essential to show individual results, at least in some illustrations.

In our 2 x 2 design, we now have 3 aspects to consider: group differences, task differences and their interactions. We illustrate them in turn.

## Age group differences for each task

We can look at the group differences in each task separately, as shown in Figure 3. The medians of each group is shown with 95% percentile bootstrap confidence intervals. On average, older participants tend to have weaker mutual information than young participants – less than half around 100-200 ms post-stimulus. This will need to be better quantified, for instance by reporting the median of all pairwise differences.

Figure 3

Under each panel showing the median + CI for each group, we plot the time-course of the group differences (young-older), with a confidence interval. For group comparisons we cannot illustrate individuals, because participants are be paired. However, we can illustrate all the bootstrap samples, shown in grey. Each sample was obtained by:

• sampling with replacement Ny observations among Ny young observers
• sampling with replacement No observations among No older observers
• compute the median of each group
• subtract the two medians

It is particularly important to illustrate the bootstrap distributions if they are skewed or contain outliers, or both, to check that the confidence intervals provide a good summary. If the bootstrap samples are very skewed, highest density intervals might be a good alternative to classic confidence intervals.

The lower panels of Figure 3 reveal relatively large group differences in a narrow window within 200 ms. The effect also appears to be stronger in the expressive task. Technically, one could also say that the effects are statistically significant, in a frequentist sense, when the 95% confidence intervals do not include zero. But not much is gained from that because some effects are large and others are small. Correction for multiple comparisons would also be required.

## Task differences for each group

Figure 4 has a similar layout to Figure 3, now focusing on the task differences. The top panels suggest that the group medians don’t differ much between tasks, except maybe in young participants around 300-500 ms.

Figure 4

Because task effects are paired, we are not limited to the comparison of the medians between tasks; we can also illustrate the individual task differences and the medians of these differences [1]. These are shown in the bottom panels of Figure 4. In both groups, the individual differences are large and the time-courses of the task differences are scattered around zero, except in the young group starting around 300 ms, where most participants have positive differences (expressive > gender).

[1] When the mean is used as a measure of central tendency, these two perspectives are identical, because the difference between two means is the same as the mean of the pairwise differences. However, this is not the case for the median: the difference between medians is not the same as the median of the differences. Because we are interested in effect sizes, it is more informative to report descriptive statistics of the pairwise differences. The advantage of the Matlab code provided with this post is that instead of looking at the median, we can also look at other quantiles, thus getting a better picture of the strength of the effects.

## Interaction between tasks and groups

Finally, in Figure 5 we consider the interactions between task and group factors. To do that we first superimpose the medians of the task differences with their confidence intervals (top panel). These traces are the same shown in the bottom panels of Figure 4. I can’t say I’m very happy with the top panel of Figure 5 because the two traces are difficult to compare. Essentially the don’t seem to differ much, except maybe for the late effect in young participants being higher than what is observed in older participants.

In the lower panel of Figure 5 we illustrate the age group differences (young – older) between the medians of the pairwise task differences. Again confidence intervals are also provided, along with the original bootstrap samples. Overall, there is very little evidence for a 2 x 2 interaction, suggesting that the age group differences are fairly stable across tasks. Put another way, the weak task effects don’t appear to change much in the two age groups.

## References

Ince, R.A., Jaworska, K., Gross, J., Panzeri, S., van Rijsbergen, N.J., Rousselet, G.A. & Schyns, P.G. (2016a) The Deceptively Simple N170 Reflects Network Information Processing Mechanisms Involving Visual Feature Coding and Transfer Across Hemispheres. Cereb Cortex.

Ince, R.A., Giordano, B.L., Kayser, C., Rousselet, G.A., Gross, J. & Schyns, P.G. (2016b) A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Hum Brain Mapp.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

# How to quantify typical differences between distributions

In this post, I describe two complementary lines of enquiry for group comparisons:

(1) How do typical levels compare between groups?

(2.1) for independent groups What is the typical difference between randomly selected members of the two groups?

(2.2) for dependent groups What is the typical pairwise difference?

These two questions can be answered by exploring entire distributions, not just one measure of central tendency.

The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using `ggplot2`. I will describe Matlab code in another post.

## Independent groups

When comparing two independent groups, the typical approach consists in comparing the marginal distributions using a proxy: each distribution is summarised using one value, usually the non-robust mean. The difference between means is then normalised by some measure of variability – usually involving the non-robust variance, in which case we get the usual t-test. There is of course no reason to use only the mean as a measure of central tendency: robust alternatives such as trimmed means and M-estimators are more appropriate in many situations (Wilcox, 2012a). However, whether we compare the means or the medians or the 20% trimmed means of two groups, we focus on one question:

“How does the typical level/participant in one group compares to the typical level/participant in the other group?” Q1

There is no reason to limit our questioning of the data to the average Joe in each distribution: to go beyond differences in central tendency, we can perform systematic group comparisons using shift functions. Nevertheless, shift functions are still based on a comparison of the two marginal distributions, even if a more complete one.

An interesting alternative approach consists in asking:

“What is the typical difference between any member of group 1 and any member of group 2?” Q2

This approach involves computing all the pairwise differences between groups, as covered previously.

Let’s look at an example. Figure 1A illustrates two independent samples. The scatterplots indicate large differences in spread between the two groups, and also suggest larger differences in the right than the left tails of the distributions. The medians of the two groups appear very similar, so the two distributions do not seem to differ in central tendency. In keeping with these observations, a t-test and a Mann-Whitney-Wilcoxon test are non-significant, but a Kolmogorov-Smirnov test is.

Figure 1. Independent groups: non-uniform shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Vertical lines mark the deciles, with a thick line for the median. C Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense. The 95% confidence intervals are controlled for multiple comparisons. D Difference asymmetry plot with 95% confidence intervals. The family-wise error is controlled by adjusting the critical p values using Hochberg’s method; the confidence intervals are not adjusted.

This discrepancy between tests highlights an important point: if a t-test is not significant, one cannot conclude that the two distributions do not differ. A shift function helps us understand how the two distributions differ (Figure 1C): the overall profile corresponds to two centred distributions that differ in spread; for each decile, we can estimate by how much they differ, and with what uncertainty; finally, the differences appear asymmetric, with larger differences in the right tails.

Is this the end of the story? No, because so far we have only considered Q1, how the two marginal distributions compare. We can get a different but complementary perspective by considering Q2, the typical difference between any member of group 1 and any member of group 2. To address Q2, we compute all the pairwise differences between members of the two groups. In this case each group has n=50, so we end up with 2,500 differences. Figure 1B shows a kernel density representation of these differences. So what does the typical difference looks like? The median of the differences is very near zero, so it seems on average, if we randomly select one observation from each group, they will differ very little. However, the differences can be quite substantial, and with real data we would need to put these differences in context, to understand how large they are, and their physiological/psychological interpretation. The differences are also asymmetrically distributed, with negative skewness: negative scores extend to -10, whereas positive scores don’t even reach +5. This asymmetry relates to our earlier observation of asymmetric differences in the shift function.

Recently, Wilcox (2012) suggested a new approach to quantify asymmetries in difference distributions. To understand his approach, we first need to consider how difference scores are usually characterised. It helps to remember that for continuous distributions, the Mann—Whitney-Wilcoxon U statistics = sum(X>Y) for all pairwise comparisons, i.e. the sum of the number of times observations in group X are larger than observations in group Y. Concretely, to compute U we sum the number of times observations in group X are larger than observations on group Y. This calculation requires to compute all pairwise differences between X and Y, and then count the number of positive differences. So the MWW test assesses P(X>Y) = 0.5. Essentially, the MWW test is a non- parametric test of the hypothesis that the distributions are identical. The MWW test does not compare the medians of the marginal distributions as often stated; also, it estimates the wrong standard error (Cliff, 1996). A more powerful test is Cliff’s delta, which uses P(X>Y) – P(X<Y) as a measure of effect size. As expected, in our current example Cliff’s delta is not significant, because the difference distribution has a median very near zero.

Wilcox’s approach is an extension of the MWW test: the idea is to get a sense of the asymmetry of the difference distribution by computing a sum of quantiles = q + (1-q), for various quantiles estimated using the Harrell-Davis estimator. A percentile bootstrap technique is used to derive confidence intervals. Figure 1D shows the resulting difference asymmetry plot  (Wilcox has not given a clear name to that new function, so I made one up). In this plot, 0.05 stands for the sum of quantile 0.05 + quantile 0.95; 0.10 stands for the sum of quantile 0.10 + quantile 0.90; and so on… The approach is not limited to these quantiles, so sparser or denser functions could be tested too. Figure 1D reveals negative sums of the extreme quantiles (0.05 + 0.95), and progressively smaller, converging to zero sums as we get closer to the centre of the distribution. So the q+(1-q) plot suggests that the two groups differ, with maximum differences in the tails, and no significant differences in central tendency. Contrary to the shift function, the q+(1-q) plot let us conclude that the difference distribution is asymmetric, based on the 95% confidence intervals. Other alpha levels can be assessed too.

In the case of two random samples from a normal population, one shifted by a constant compared to the other, the shift function and the difference asymmetry function should be about flat, as illustrated in Figure 2. In this case, because of random sampling and limited sample size, the two approaches provide different perspectives on the results: the shift function suggests a uniform shift, but fails to reject for the three highest deciles; the difference asymmetry function more strongly suggests a uniform shift, with all sums at about the same value. This shows that all estimated pairs of quantiles are asymmetric about zero, because the difference function is uniformly shifted away from zero.

Figure 2. Independent groups: uniform shift. Two random samples of 50 observations were generated using `rnorm`. A constant of 1 was added to group 2.

When two distributions do not differ, both the shift function and the difference asymmetry function should be about flat and centred around zero – however this is not necessarily the case, as shown in Figure 3.

Figure 3. Independent groups: no shift – example 1. Two random samples of 50 observations were generated using `rnorm`.

Figure 4 shows another example in which no shift is present, and with n=100 in each group, instead of n=50 in the previous example.

Figure 4. Independent groups: no shift – example 2.  Two random samples of 100 observations were generated using `rnorm`.

In practice, the asymmetry plot will often not be flat. Actually, it took me several attempts to generate two random samples associated with such flat asymmetry plots. So, before getting too excited about your results, it really pays to run a few simulations to get an idea of what random fluctuations can look like. This can’t be stressed enough: you might be looking at noise!

## Dependent groups

Wilcox & Erceg-Hurn (2012) described a difference asymmetry function for dependent group. We’re going to apply the technique to the dataset presented in Figure 5. Panel A shows the two marginal distributions. However, we’re dealing with a paired design, so it is impossible to tell how observations are linked between conditions. This association is revealed in two different ways in panels B & C, which demonstrate a striking pattern: for participants with weak scores in condition 1, differences tend to be small and centred about zero; beyond a certain level, with increasing scores in condition 1, the differences get progressively larger. Finally, panel D shows the distribution of differences, which is shifted up from zero, with only 6 out of 35 differences inferior to zero.

At this stage, we’ve learnt a lot about our dataset – certainly much more than would be possible from current standard figures. What else do we need? Statistical tests?! I don’t think they are absolutely necessary. Certainly, providing a t-test is of no interest whatsoever if Figure 5 is provided, because it cannot provide information we already have.

Figure 5. Dependent groups: data visualisation. A Stripcharts of the two distributions. Horizontal lines mark the deciles, with a thick line for the median. B Stripcharts of paired observations. Scatter was introduced along the x axis to reveal overlapping observations. C Scatterplot of paired observations. The diagonal black reference line of no effect has slope one and intercept zero. The dashed grey lines mark the quartiles of the two conditions. In panel C, it would also be useful to plot the pairwise differences as a function of condition 1 results. D Stripchart of difference scores. Horizontal lines mark the deciles, with a thick line for the median.

Figure 6 provides quantifications and visualisations of the effects using the same layout as Figure 5. The shift function (Figure 6C) shows a non-uniform shift between the marginal distributions: the first three deciles do not differ significantly, the remaining deciles do, and there is an overall trend of growing differences as we progress towards the right tails of the distributions. The difference asymmetry function provides a different perspective. The function is positive and almost flat, demonstrating that the distribution of differences is uniformly shifted away from zero, a result that cannot be obtained by only looking at the marginal distributions. Of course, when using means comparing the marginals or assessing the difference scores give the same results, because the difference of the means is the same as the mean of the differences. That’s why a paired t-test is the same as a one-sample test on the pairwise differences. With robust estimators the two approaches differ: for instance the difference between the medians of the marginals is not the same as the median of the differences.

Figure 6. Dependent groups: uniform difference shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Horizontal lines mark the deciles, with a thick line for the median. C Shift function. D Difference asymmetry plot with 95% confidence intervals.

As fancy as Figure 6 can be, it still misses an important point: nowhere do we see the relationship between condition 1 and condition 2 results, as shown in panels B & C of Figure 5. This is why detailed illustrations are absolutely necessary to make sense of even the simplest datasets.

If you want to make more inferences about the distribution of differences, as shown in Figure 6B, Figure 7 shows a complementary description of all the deciles with their 95% confidence intervals. These could be substituted with highest density intervals or credible intervals for instance.

Figure 7. Dependent groups: deciles of the difference distribution. Each disk marks a difference decile, and the horizontal green line makes its 95% percentile bootstrap confidence interval. The reference line of no effect appears as a continuous black line. The dashed black line marks the difference median.

Finally, in Figure 8 we look at an example of a non-uniform difference shift. Essentially, I took the data used in Figure 6, and multiplied the four largest differences by 1.5. Now we see that the 9th decile does not respect the linear progression suggested by previous deciles, (Figure 8, panels A & B), and the difference asymmetry function suggests an asymmetric shift of the difference distribution, with larger discrepancies between extreme quantiles.

Figure 8. Dependent groups: non-uniform difference shift. A Stripchart of difference scores. B Deciles of the difference distribution. C Difference asymmetry function.

## Conclusion

The techniques presented here provide a very useful perspective on group differences, by combining detailed illustrations and quantifications of the effects. The different techniques address different questions, so which technique to use depends on the question you want to ask. This choice should be guided by experience: to get a good sense of the behaviour of these techniques will require a lot of practice with various datasets, both real and simulated. If you follow that path, you will soon realise that classic approaches such as t-tests on means combined with bar graphs are far too limited, and can hide rich information about a dataset.

I see three important developments for the approach outlined here:

• to make it Bayesian, or at least p value free using highest density intervals;

• to extend it to multiple group comparisons (the current illustrations don’t scale up very easily);

• to extend it to ANOVA type designs with interaction terms.

## References

Cliff, N. (1996) Ordinal methods for behavioral data analysis. Erlbaum, Mahwah, N.J.

Wilcox, R.R. (2012a) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R. (2012b) Comparing Two Independent Groups Via a Quantile Generalization of the Wilcoxon-Mann-Whitney Test. Journal of Modern Applied Statistical Methods, 11, 296-302.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

# the shift function: a powerful tool to compare two entire distributions

The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using `ggplot2`. The R code for the 2013 percentile bootstrap version of the shift function was also covered here and here. Matlab code is described in another post.

UPDATE: The shift function and its cousin the difference asymmetry function are described in a review article with many examples. And a Bayesian shift function is now available!

In neuroscience & psychology, group comparison is usually an exercise that involves comparing two typical observations. This is most of the time achieved using a t-test on means. This standard procedure makes very strong assumptions:

• the distributions differ only in central tendency, not in other aspects;
• the typical observation in each distribution can be summarised by the mean;
• the t-test is sufficient to detect changes in location.

As we saw previously, t-tests on means are not robust. In addition, there is no reason a priori to assume that two distributions differ only in the location of the bulk of the observations. Effects can occur in the tails of the distributions too: for instance a particular intervention could have an effect only in animals with a certain hormonal level at baseline; a drug could help participants with severe symptoms, but not others with milder symptoms… Because effects are not necessarily homogenous among participants, it is useful to have appropriate tools at hand, to determine how, and by how much, two distributions differ. Here we’re going to consider a powerful family of tools that are robust and let us compare entire distributions: shift functions.

A more systematic way to characterise how two independent distributions differ was originally proposed by Doksum (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977): to plot the difference between the quantiles of two distributions as a function of the quantiles of one group. The original shift function approach is implemented in the functions `sband` and `wband` in Rand Wilcox’s WRS R package.

In 1995, Wilcox proposed an alternative technique which has better probability coverage and potentially more power than Doksum & Sievers’ approach. Wilcox’s technique:

• uses the Harrell-Davis quantile estimator;
• computes confidence intervals of the decile differences with a bootstrap estimation of the standard error of the deciles;
• controls for multiple comparisons so that the type I error rate remains around 0.05 across the 9 confidence intervals. This means that the confidence intervals are a bit larger than what they would be if only one decile was compared, so that the long-run probability of a type I error across all 9 comparisons remains near 0.05;
• is implemented in the `shifthd` function.

Let’s start with an extreme and probably unusual example, in which two distributions differ in spread, not in location (Figure 1). In that case, any test of central tendency will fail to reject, but it would be wrong to conclude that the two distributions do not differ. In fact, a Kolmogorov-Smirnov test reveals a significant effect, and several measures of effect sizes would suggest non-trivial effects. However, a significant KS test just tells us that the two distributions differ, not how.

Figure 1. Two distributions that differ in spread A Kernel density estimates for the groups. B Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense.

The shift function can help us understand and quantify how the two distributions differ. The shift function describes how one distribution should be re-arranged to match the other one: it estimates how and by how much one distribution must be shifted. In Figure 1, I’ve added annotations to help understand the link between the KDE in panel A and the shift function in panel B. The shift function shows the decile differences between group 1 and group 2, as a function of group 1 deciles. The deciles for each group are marked by coloured vertical lines in panel A. The first decile of group 1 is slightly under 5, which can be read in the top KDE of panel A, and on the x-axis of panel B. The first decile of group 2 is lower. As a result, the first decile difference between group 1 and group 2 is positive, as indicated by a positive value around 0.75 in panel B, as marked by an upward arrow and a `+` symbol. The same symbol appears in panel A, linking the deciles from the two groups: it shows that to match the first deciles, group 2’s first decile needs to be shifted up. Deciles 2, 3 & 4 show the same pattern, but with progressively weaker effect sizes. Decile 5 is well centred, suggesting that the two distributions do not differ in central tendency. As we move away from the median, we observe progressively larger negative differences, indicating that to match the right tails of the two groups, group 2 needs to be shifted to the left, towards smaller values – hence the negative sign.

To get a good understanding of the shift function, let’s look at its behaviour in several other clear-cut situations. First, let’s consider a  situation in which two distributions differ in location (Figure 2). In that case, a t-test is significant, but again, it’s not the full story. The shift function looks like this:

Figure 2. Complete shift between two distributions

What’s happening? All the differences between deciles are negative and around -0.45. Wilcox (2012) defines such systematic effect has the hallmark of a completely effective method. In other words, there is a complete and seemingly uniform shift between the two distributions.

In the next example (Figure 3), only the right tails differ, which is captured by significant differences for deciles 6 to 9. This is a case described by Wilcox (2012) as involving a partially effective experimental manipulation.

Figure 3. Positive right tail shift

Figure 4 also shows a right tail shift, this time in the negative direction. I’ve also scaled the distributions so they look a bit like reaction time distributions. It would be much more informative to use shift functions in individual participants to study how RT distributions differ between conditions, instead of summarising each distribution by its mean (sigh)!

Figure 4. Negative right tail shift

Figure 5 shows two large samples drawn from a standard normal population. As expected, the shift function suggests that we do not have enough evidence to conclude that the two distributions differ. The shift function does look bumpy tough, potentially suggesting local differences – so keep that in mind when you plug-in your own data.

Figure 5. No difference?

And be careful not to over-interpret the shift function: the lack of significant differences should not be used to conclude that we have evidence for the lack of effect; indeed, failure to reject in the frequentist sense can still be associated with non-trivial evidence against the null – it depends on prior results (Wagenmakers, 2007).

So far, we’ve looked at simulated examples involving large sample sizes. We now turn to a few real-data examples.

Doksum & Sievers (1976) describe an example in which two groups of rats were kept in an environment with or without ozone for 7 days and their weight gains measured (Figure 6). The shift function suggests two results: overall, ozone reduces weight gain; ozone might promote larger weight gains in animals gaining the most weight. However, these conclusions are only tentative given the small sample size, which explains the large confidence intervals.

Figure 6. Weight gains A Because the sample sizes are much smaller than in the previous examples, the distributions are illustrated using 1D scatterplots. The deciles are marked by grey vertical lines, with lines for the 0.5 quantiles. B Shift function.

Let’s consider another example used in (Doksum, 1974; Doksum, 1977), concerning the survival time in days of 107 control guinea pigs and 61 guinea pigs treated with a heavy dose of tubercle bacilli (Figure 7). Relative to controls, the animals that died the earliest tended to live longer in the treatment group, suggesting that the treatment was beneficial to the weaker animals (decile 1). However, the treatment was harmful to animals with control survival times larger than about 200 days (deciles 4-9). Thus, this is a case where the treatment has very different effects on different animals. As noted by Doksum, the same experiment was actually performed 4 times, each time giving similar results.

Figure 7. Survival time

## Shift function for dependent groups

All the previous examples were concerned with independent groups. There is a version of the shift function for dependent groups implemented in `shiftdhd`. We’re going to apply it to ERP onsets from an object detection task (Bieniek et al., 2015). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. Typically, test-retest assessment is performed using a correlation. However, we care about the units (ms), which a correlation would get rid of, and we had a more specific hypothesis, which a correlation cannot test; so we used a shift function (Figure 8). If you look at the distributions of onsets across participants, you will see that it is overall positively skewed, and with a few participants with particularly early or late onsets. With the shift function, we wanted to test for the overall reliability of the results, but also in particular the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). The shift function does not provide enough evidence to suggest a uniform or non-uniform shift – but we would probably need many more observations to make a strong claim.

Figure 8. ERP onsets

Because we’re dealing with a paired design, the illustration of the marginal distributions in Figure 8 is insufficient: we should illustrate the distribution of pairwise differences too, as shown in Figure 9.

Figure 9. ERP onsets with KDE of pairwise differences

Figure 10 provides an alternative representation of the distribution of pairwise differences using a violin plot.

Figure 10. ERP onsets with violin plot of pairwise differences

Figure 11 uses a 1D scatterplot (strip chart).

Figure 11. ERP onsets with 1D scatterplot of pairwise differences

## Shift function for other quantiles

Although powerful, Wilcox’s 1995 technique is not perfect, because it:

• is limited to the deciles;
• can only be used with alpha = 0.05;
• does not work well with tied values.

More recently, Wilcox’s proposed a new version of the shift function that uses a straightforward percentile bootstrap (Wilcox & Erceg-Hurn, 2012; Wilcox et al., 2014). This new approach:

• allows tied values;
• can be applied to any quantile;
• can have more power when looking at extreme quantiles (<=0.1, or >=0.9).
• is implemented in `qcomhd` for independent groups;
• is implemented in `Dqcomhd` for dependent groups.

Examples are provided in the R script for this post.

In the percentile bootstrap version of the shift function, p values are corrected, but not the confidence intervals. For dependent variables, Wilcox & Erceg-Hurn (2012) recommend at least 30 observations to compare the .1 or .9 quantiles. To compare the quartiles, 20 observations appear to be sufficient. For independent variables, Wilcox et al. (2014) make the same recommendations made for dependent groups; in addition, to compare the .95 quantiles, they suggest at least 50 observations per group.

## Conclusion

The shift function is a powerful tool that can help you better understand how two distributions differ, and by how much. It provides much more information than the standard t-test approach.

Although currently the shift function only applies to two groups, it can in theory be extended to more complex designs, for instance to quantify interaction effects.

Finally, it would be valuable to make a Bayesian version of the shift function, to focus on effect sizes, model the data, and integrate them with other results.

## References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam; Boston.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.