the shift function: a powerful tool to compare two entire distributions

The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using ggplot2. The R code for the 2013 percentile bootstrap version of the shift function was also covered here and here. Matlab code is described in another post.

UPDATE: The shift function and its cousin the difference asymmetry function are described in a review article with many examples. And a Bayesian shift function is now available! The hierarchical shift function provides a powerful alternative to the t-test.

In neuroscience & psychology, group comparison is usually an exercise that involves comparing two typical observations. This is most of the time achieved using a t-test on means. This standard procedure makes very strong assumptions:

• the distributions differ only in central tendency, not in other aspects;
• the typical observation in each distribution can be summarised by the mean;
• the t-test is sufficient to detect changes in location.

As we saw previously, t-tests on means are not robust. In addition, there is no reason a priori to assume that two distributions differ only in the location of the bulk of the observations. Effects can occur in the tails of the distributions too: for instance a particular intervention could have an effect only in animals with a certain hormonal level at baseline; a drug could help participants with severe symptoms, but not others with milder symptoms… Because effects are not necessarily homogenous among participants, it is useful to have appropriate tools at hand, to determine how, and by how much, two distributions differ. Here we’re going to consider a powerful family of tools that are robust and let us compare entire distributions: shift functions.

A more systematic way to characterise how two independent distributions differ was originally proposed by Doksum (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977): to plot the difference between the quantiles of two distributions as a function of the quantiles of one group. The original shift function approach is implemented in the functions sband and wband in Rand Wilcox’s WRS R package.

In 1995, Wilcox proposed an alternative technique which has better probability coverage and potentially more power than Doksum & Sievers’ approach. Wilcox’s technique:

• uses the Harrell-Davis quantile estimator;
• computes confidence intervals of the decile differences with a bootstrap estimation of the standard error of the deciles;
• controls for multiple comparisons so that the type I error rate remains around 0.05 across the 9 confidence intervals. This means that the confidence intervals are a bit larger than what they would be if only one decile was compared, so that the long-run probability of a type I error across all 9 comparisons remains near 0.05;
• is implemented in the shifthd function.

Let’s start with an extreme and probably unusual example, in which two distributions differ in spread, not in location (Figure 1). In that case, any test of central tendency will fail to reject, but it would be wrong to conclude that the two distributions do not differ. In fact, a Kolmogorov-Smirnov test reveals a significant effect, and several measures of effect sizes would suggest non-trivial effects. However, a significant KS test just tells us that the two distributions differ, not how. Figure 1. Two distributions that differ in spread A Kernel density estimates for the groups. B Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense.

The shift function can help us understand and quantify how the two distributions differ. The shift function describes how one distribution should be re-arranged to match the other one: it estimates how and by how much one distribution must be shifted. In Figure 1, I’ve added annotations to help understand the link between the KDE in panel A and the shift function in panel B. The shift function shows the decile differences between group 1 and group 2, as a function of group 1 deciles. The deciles for each group are marked by coloured vertical lines in panel A. The first decile of group 1 is slightly under 5, which can be read in the top KDE of panel A, and on the x-axis of panel B. The first decile of group 2 is lower. As a result, the first decile difference between group 1 and group 2 is positive, as indicated by a positive value around 0.75 in panel B, as marked by an upward arrow and a + symbol. The same symbol appears in panel A, linking the deciles from the two groups: it shows that to match the first deciles, group 2’s first decile needs to be shifted up. Deciles 2, 3 & 4 show the same pattern, but with progressively weaker effect sizes. Decile 5 is well centred, suggesting that the two distributions do not differ in central tendency. As we move away from the median, we observe progressively larger negative differences, indicating that to match the right tails of the two groups, group 2 needs to be shifted to the left, towards smaller values – hence the negative sign.

To get a good understanding of the shift function, let’s look at its behaviour in several other clear-cut situations. First, let’s consider a  situation in which two distributions differ in location (Figure 2). In that case, a t-test is significant, but again, it’s not the full story. The shift function looks like this: Figure 2. Complete shift between two distributions

What’s happening? All the differences between deciles are negative and around -0.45. Wilcox (2012) defines such systematic effect has the hallmark of a completely effective method. In other words, there is a complete and seemingly uniform shift between the two distributions.

In the next example (Figure 3), only the right tails differ, which is captured by significant differences for deciles 6 to 9. This is a case described by Wilcox (2012) as involving a partially effective experimental manipulation. Figure 3. Positive right tail shift

Figure 4 also shows a right tail shift, this time in the negative direction. I’ve also scaled the distributions so they look a bit like reaction time distributions. It would be much more informative to use shift functions in individual participants to study how RT distributions differ between conditions, instead of summarising each distribution by its mean (sigh)! Figure 4. Negative right tail shift

Figure 5 shows two large samples drawn from a standard normal population. As expected, the shift function suggests that we do not have enough evidence to conclude that the two distributions differ. The shift function does look bumpy tough, potentially suggesting local differences – so keep that in mind when you plug-in your own data. Figure 5. No difference?

And be careful not to over-interpret the shift function: the lack of significant differences should not be used to conclude that we have evidence for the lack of effect; indeed, failure to reject in the frequentist sense can still be associated with non-trivial evidence against the null – it depends on prior results (Wagenmakers, 2007).

So far, we’ve looked at simulated examples involving large sample sizes. We now turn to a few real-data examples.

Doksum & Sievers (1976) describe an example in which two groups of rats were kept in an environment with or without ozone for 7 days and their weight gains measured (Figure 6). The shift function suggests two results: overall, ozone reduces weight gain; ozone might promote larger weight gains in animals gaining the most weight. However, these conclusions are only tentative given the small sample size, which explains the large confidence intervals. Figure 6. Weight gains A Because the sample sizes are much smaller than in the previous examples, the distributions are illustrated using 1D scatterplots. The deciles are marked by grey vertical lines, with lines for the 0.5 quantiles. B Shift function.

Let’s consider another example used in (Doksum, 1974; Doksum, 1977), concerning the survival time in days of 107 control guinea pigs and 61 guinea pigs treated with a heavy dose of tubercle bacilli (Figure 7). Relative to controls, the animals that died the earliest tended to live longer in the treatment group, suggesting that the treatment was beneficial to the weaker animals (decile 1). However, the treatment was harmful to animals with control survival times larger than about 200 days (deciles 4-9). Thus, this is a case where the treatment has very different effects on different animals. As noted by Doksum, the same experiment was actually performed 4 times, each time giving similar results. Figure 7. Survival time

Shift function for dependent groups

All the previous examples were concerned with independent groups. There is a version of the shift function for dependent groups implemented in shiftdhd. We’re going to apply it to ERP onsets from an object detection task (Bieniek et al., 2015). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. Typically, test-retest assessment is performed using a correlation. However, we care about the units (ms), which a correlation would get rid of, and we had a more specific hypothesis, which a correlation cannot test; so we used a shift function (Figure 8). If you look at the distributions of onsets across participants, you will see that it is overall positively skewed, and with a few participants with particularly early or late onsets. With the shift function, we wanted to test for the overall reliability of the results, but also in particular the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). The shift function does not provide enough evidence to suggest a uniform or non-uniform shift – but we would probably need many more observations to make a strong claim. Figure 8. ERP onsets

Because we’re dealing with a paired design, the illustration of the marginal distributions in Figure 8 is insufficient: we should illustrate the distribution of pairwise differences too, as shown in Figure 9. Figure 9. ERP onsets with KDE of pairwise differences

Figure 10 provides an alternative representation of the distribution of pairwise differences using a violin plot. Figure 10. ERP onsets with violin plot of pairwise differences

Figure 11 uses a 1D scatterplot (strip chart). Figure 11. ERP onsets with 1D scatterplot of pairwise differences

Shift function for other quantiles

Although powerful, Wilcox’s 1995 technique is not perfect, because it:

• is limited to the deciles;
• can only be used with alpha = 0.05;
• does not work well with tied values.

More recently, Wilcox’s proposed a new version of the shift function that uses a straightforward percentile bootstrap (Wilcox & Erceg-Hurn, 2012; Wilcox et al., 2014). This new approach:

• allows tied values;
• can be applied to any quantile;
• can have more power when looking at extreme quantiles (<=0.1, or >=0.9).
• is implemented in qcomhd for independent groups;
• is implemented in Dqcomhd for dependent groups.

Examples are provided in the R script for this post.

In the percentile bootstrap version of the shift function, p values are corrected, but not the confidence intervals. For dependent variables, Wilcox & Erceg-Hurn (2012) recommend at least 30 observations to compare the .1 or .9 quantiles. To compare the quartiles, 20 observations appear to be sufficient. For independent variables, Wilcox et al. (2014) make the same recommendations made for dependent groups; in addition, to compare the .95 quantiles, they suggest at least 50 observations per group.

Conclusion

The shift function is a powerful tool that can help you better understand how two distributions differ, and by how much. It provides much more information than the standard t-test approach.

Although currently the shift function only applies to two groups, it can in theory be extended to more complex designs, for instance to quantify interaction effects.

Finally, it would be valuable to make a Bayesian version of the shift function, to focus on effect sizes, model the data, and integrate them with other results.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam; Boston.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

34 thoughts on “the shift function: a powerful tool to compare two entire distributions”

1. Bjørn Erik Juel

Thank you for sharing all this information. I just came across your site, and I just wanted to express my appreciation!

Like

2. Yuan-Fang Zhao

Thank you for this informative article. However, I’m afraid that the interpretation regarding to one of the real data examples is not that right. In your first example, i.e., the effect of ozone on weight gains, it reads that “ozone might promote larger weight gains in animals gaining the most weight”. This argument seems invalid, as the data did not suggest that some larger-weight-gain cases are from animals gaining the most weight. It is possible that animals gaining the least weight exhibited larger weight gains. The relationship between weight gains in control condition and the effect of ozone is unknown and cannot be inferred from the current data. In this regard, I think the shift function plot may be misleading in some specific conditions.

Like

1. garstats Post author

Hi Yuan-Fang,
Thanks for your comment. The example you refer to has the classic limitations of a between-group design. A repeated measure design would help reach stronger conclusions. But the shift function is certainly not misleading, it only reflects the pattern observable in the scatterplots: among animals with the most gain weight (decile 9), animals from the ozone group tended to gain more weight than animals in the control group – hence the negative difference for the last decile. However, a stronger conclusion could only be reached with a much larger sample size…

Like

3. Balázs Knakker

I’m thinking about comparing RT distributions with a shift function. However, I feel unconfortable applying this to clustered data, since RTs arise from multiple subjects. Is there a variety of this method to turn to in this case? Or should I try to delve in a mixed variety of quantile regression?

Like

1. garstats Post author

Hi,
That’s a very good question. You can see a proposal in that direction in another blog post:
https://garstats.wordpress.com/2018/04/06/bsf/

Currently, I’m running validations for group estimation, in which a shift function is computed for every participant, followed by group statistics on each quantile difference, with correction for multiple comparisons using Hochberg’s method. In terms of type I errors and power, it works well when using median and 20% trimmed means group tests. I need to write a blog post on this approach.

Like

1. Balázs Knakker

Thanks for the prompt response 🙂

So if I understand correctly, you get 10 numbers from each subject (the deciles) and use them in spirit of the “summary statistic approach” to random effects akin to conventional group-level ANOVAs? (And why did you choose Hochberg? Did you consider bootstrapping all the quantiles together and including type I error control, with something like a cluster-based test? Does that have less power with N=10. Excuse me, these are non-clarifying questions but rather just curious ones:) )

Like

2. garstats Post author

I use deciles in the sense of 9 quantiles that split the distribution in 10. So for each participant, instead of getting one mean or one median, we get 9 values providing a much more informative description of the distribution. Actually, it’s 18 values: 9 for each condition, which are then subtracted. We can then do group statistics on each of the 9 differences. Using simulated and real data, the sampling distributions of the deciles are skewed, so median and other trimmed means seem more appropriate for the group assessment. I tried Hochberg because it works well for the standard shift function (https://onlinelibrary.wiley.com/doi/full/10.1111/ejn.13610). Validation of type I errors indicate values a bit lower than 0.05 in various situations, but power remains systematically higher than studying the mean or the median alone. I’ve also validated type I error rates for cluster-sum statistics using permutation and bootstrap, as well as various combinations of TFCE parameters: can’t get near nominal levels so I’ve settled for Hochberg for now. It’s also much faster 🙂

Like

3. Balázs Knakker

Thanks for the exhaustive answer, I’m looking forward to the blog post 🙂

Like

4. Lucas Rizzo

How would you suggest to compute the difference between the two distributions in a single number? Would it be possible to just sum all the differences of each decile? In this case would we also need to define a new confidence interval for the sum of the differences?

Like

1. garstats Post author

You could do that, but you would need to add an extra sum step in the bootstrap loop to get the confidence interval. But that would defeat the purpose of trying to understand how the distributions differ. In a sense, what you would get is similar to a non-normalised measure of overlap between distributions. Look for instance at the qhat function in Rallfun-v35.txt from Rand Wilcox.

Liked by 1 person

5. c

Thanks for this excellent post and for this interesting analytical tool.
I do have a (probably, very naïve) question: why q=1 is not assessed/ included in the shift function analysis? If the independent groups have different maximum values, the length of this slot could be as informative as any of the other segments of the variable range, couldn’t it?

Like

1. garstats Post author

Hi, great question! Min and max are very special quantiles that are very difficult to estimate. For unbounded continuous distributions, these values are undefined. For bounded distributions (think reaction data, questionnaire data), the question becomes whether the populations have different bounds, which can be very difficult to answer because that requires a lot of observations and outliers can dominate the answer. Intuitively, you need the least observations to estimate the central quantiles, and you need progressively more as you go towards the tails. For reaction time data, I don’t see how the max could be compared because the distributions are typically truncated by design (participants can’t respond after X ms) or during preprocessing, and would be extremely sensitive to outliers. For distributions that have natural lower and higher bounds, like questionnaire data or percent correct data, the question could be addressed but would require simulations to determine the best approach. I suspect typical quantile estimators don’t perform well for extreme quantiles. I think a better approach would be to model the data to make inferences about differences in bounds. For minimum reaction times, specific methods involve comparing cumulated distributions for instance: https://open.lnu.se/index.php/metapsychology/article/view/1630

Like

6. c

Thanks for your fast and detailed answer!. I truly appreciate it. I will check carefully the provided reference, although I see your point about extreme quartiles (which, if I understand it correctly, the same concern should probably apply to the lowest quartiles, such as q=0.01).

The example I am currently working on is a 0,1 bounded continuous variable (actually, a probability). Would it require of any special/ additional caution when assessing the shift function with the shifthd_pbci function from the Rogme package?

Like

1. garstats Post author

Correct: same comment for q=0.01. Unless you have loads of trials, I would stick to the deciles. In addition to trials for extreme quantiles, the other main thing you have to be worried about when estimating quantiles in general is the presence of tied values. But the sf by default uses the Harrell-Davis quantile estimator, which handles ties. I would be curious to see your example with probabilities.

Like

7. G

Can this be extended for use in an ANCOVA model? I.e., comparing the posttest means of a treatment and control group using baseline values as a covariate?

Like

8. garstats Post author

Absolutely! I was planning a 2 x 2 design version, considering difference scores in 2 groups, but in many situations an ANCOVA type of model makes more sense. The same function could handle both situations in one go actually. No time for code development at the moment but you could fork the project on GitHub and give it a go. A few simulations would be needed to validate the approach too and I could probably work on that next year if you’d like to collaborate.

Liked by 1 person

1. c

Hi, I also think that a shift-function with co-variates would be really useful. Therefore, I would like to know whether it has been already developed or if there is any suitable alternative till such procedure becomes available. Thanks a lot!

Like

1. garstats Post author

Hi, I’ve just discovered that this can be done using two R functions from Rand Wilcox:
– ancova()
– anc.ES.sum()
Both functions are available in the Rallfun file, which can be downloaded here: http://dornsife.usc.edu/labs/rwilcox/software/
The two functions use a running interval smoother, and the general approach is described here: https://dornsife.usc.edu/assets/sites/239/docs/WRS2.pdf
The ancova() function has an argument for reporting a measure of effect size. Setting method=’QS’ returns the shift function.
The anc.ES.sum() function computes several measures of effect size, including the shift function. It returns confidence
intervals for each covariate value used.

Like

9. Robert

I have question about use of median to calculate ‘effect size’ for two skewed probability distributions, with Y values set as an ‘expected’ reference condition and X values based on random measurement of a variable.

For this example, let the two distributions be Y (expected) and X (observed), with five frequency categories for each, thus:
X1, X2, X3, X4, X5
Y1, Y2, Y3, Y4, Y5

1. Determine count frequencies for each distribution and calculate median. For example:
X: 29, 14, 7, 2, 1 ; Median X = 7
Y: 30, 17, 12, 6, 2; Median Y = 12

2. Calculate Pearson Standardized Residuals(SR) for each category [ABS(xi-yi)/sq rt(Y1)] and SUM, this will become a measure of pooled median residual deviation between X and Y distributions across all five categories :
SUM SR = (0.182 + 0.728 + 1.44 + 1.63 + 0.709) = 4.689

3. Calculate ‘effect size’ as [ABS (Median X – Median Y)/ SUM SR] = (5 / 4.689) = 1.07

4. Following suggestions for Cohen-d using mean as measure of central tendency, effect size > 0.8 suggests a large effect size between the X and Y distributions. Therefore, for this example, the X and Y distributions have a ‘large effect sign’.
==
Would appreciate any feedback why this approach is not a valid estimation of a Cohen type ‘effect size’ phenomenon using median as measure of central tendency.

Robert

Like

1. garstats Post author

Hi Robert, for me it is hard to know how an estimator behaves without a few simulations and comparisons. Your approach seems related to using explanatory power – see this article by Wilcox & Tian:
Measuring effect size: a robust heteroscedastic approach for two or more group

Click to access ES.pdf

Like

10. C

Hi! I was wondering whether the shift-function can be applied to count data. In principle, it seems as it should be possible but the procedure would probably require some kind of adjustment to ensure that the calculated quantiles are integers. Do you have any experience/ suggestion on how this could be done?

Like

1. garstats Post author

Hi, the shift-function is non-parametric so it doesn’t make assumptions about the data generation process. The main problem with count data is that you are likely to get tied values, but they can be handled by the Harrell-Davis quantile estimator, which is used by default in the shift function. Also, the percentile bootstrap used to derive confidence intervals will only suggest values within the bounds. In sum: why not?

Like

11. C

Thanks a lot for your fast answer 🙂 just one more detail: Your answer does not mention any need for rounding the values of the obtained estimates/ CI boundaries or differences. Does this mean that such rounding should not be applied? Again, thanks a lot for sharing your expertise in this really helpful blog!

Like

1. garstats Post author

The raw data should never be rounded, no matter the test applied. The inferential statistics, including the CI boundaries, can be rounded to the precision needed, as long as readers have access to the data and code to check the unrounded results.

Like

1. C