Tukey mean-difference plot

Distributions can differ in many ways, not just in central tendency (Rousselet, Pernet & Wilcox, 2017). While obvious, this statement is at odd with common statistical practices that are focused exclusively on mean comparisons. Fortunately, there are many methods to make distributional inferences, for instance by applying generalized linear models to reaction time data (Lindeløv, 2019). Another informative approach is to make inferences about multiple quantiles, essentially an extension of q-q plots (quantile-quantile plots, Wilk & Gnanadesikan, 1968). Quantile inference methods appear to be have been reinvented multiple times (Rousselet, 2018). In this earlier post, I illustrated four very similar types of plots: quantile plots, vincentile plots, delta plots and shift functions. As far as I know, the earliest method is the shift function, introduced by Doksum (1974; 1976). Recently, while reading about Bland-Altman plots (Altman & Bland, 1983), I came across a 5th type of quantile plots, the Tukey mean-difference plot (Cleveland, 1993).

Side story: it is unclear who the Tukey in question is, as Cleveland (1993) offers no reference for the plot. A very similar plot appears in an earlier book by Chambers, Cleveland, Kleiner & Tukey, 1983. The Tukey is the 1983 book is Paul A. Tukey, not the famous John W. Tukey. Others have been down the same historical rabbit hole.

Surprisingly, several sources describe the Bland-Altman plot and the Tukey mean-difference plot as equivalent (Wikipedia; Wolfram). Equating the two methods is surprising because, even though they lead to graphs that can look similar, they ask very different questions about the data. The Bland-Altman plot is a kind of scatterplot used to assess the agreement between two methods (and more generally two paired measurements): the differences between paired observations are plotted as a function of their averages. In contrast, the Tukey mean-difference plot is a type of q-q plot used to assess distributional shape differences: the quantile differences are plotted as a function of the quantile averages. The general difference between a scatterplot and a q-q plot is vividly described in this quote from Chambers, Cleveland, Kleiner & Tukey, 1983:

“It is essential to understand the difference between a quantile-quantile plot and a scatter plot. Basically, a scatter plot is useful for studying such questions as “Is the monthly average temperature in Lincoln systematically related to the temperature in Newark?” or “If Newark has a hot month, is Lincoln likely to have hot weather in the same month?” On the other hand, the quantile-quantile plot is aimed at such questions as “Over a period of time, do the residents of Lincoln experience the same mixture of hot, mild, and cold weather as people living in Newark?” This question would be meaningful even if the two data sets spanned different years, or if we were comparing autumn temperatures in Newark with spring temperatures in Lincoln, or if Newark were in another galaxy. It is the kind of question that a home owner in Lincoln and one in Newark might be interested in if they were concerned about the cost of heating in the winter and air conditioning in the summer at the two places but had no interest in whether they experienced hot and cold spells at exactly the same time.”

In short, a scatterplot is about the relation between paired observations, whereas a q-q plot is about relative shape differences between two distributions of observations. A scatterplot can only be used when dealing with paired observations; it is irrelevant for a q-q plot.

To better understand the two types of graphs, let’s consider a series of illustrations. The next four series of plots reflect situations where the Bland-Altman plot is a useful tool: we try to estimate some ground truth using two types of instruments, and then consider how well the two sets of measurements agree.

Dependent observations

Independent measurement errors, no bias

In this first example, we get two sets of n=200 measurements that are equal to a ground truth + random errors that are independent between the sets. Also, there is no bias. In this situation, the two measurement methods agree with each other. Any pattern is due to sampling variability. Increase the sample size in the code to see what happens.

For the Bland-Altman plot, a superimposed smoother is useful to reveal trends in the scatterplot.
In this example and the following one, the two plots reveal similar trends in differences. We can also see a difference in scale: the Bland-Altman plot shows the raw differences between pairs of observations, whereas the Tukey mean-difference plot shows differences between matching quantiles, or order statistics. In the case of equal sample sizes, these quantiles are simply the sorted observations, such that the smallest observation in one group is compared to the smallest observation in the other group, and so on in rank order.

Correlated measurement errors, no bias

In this situation, the error terms have a 0.5 correlation.

Additive bias

Now in addition to the correlated noise, there is a 0.1 additive bias applied to Y.

Additive and multiplicative bias

Same as above, now with the addition of multiplicative bias.

Independent observations

Now we consider completely independent observations to illustrate that the Tukey mean-difference plot captures shape differences, like other related quantile methods. In this scenario, the Bland-Altman plot makes no sense.

Illustrate populations

Consider pairs of marginal Beta distributions. In each panel, the reference condition is X = Beta(10,10), which is compared to Y = Beta(10,10); Y = Beta(3,3); Y = Beta(8,6); Y = Beta(4,2). Why Beta distributions? That’s the way percent correct data are distributed across participants for instance (Rousselet, 2025).

Simulate data: vary shape

Now imagine we take independent samples from these populations. In that situation the Bland-Altman plot is inappropriate, because any pairing of observations would be completely arbitrary. To make it more interesting, we can also use different sample sizes: n1=100 and n2=200.

Downsample to the deciles

Since we’re plotting quantiles, we don’t need to plot so many. We can make the same graphs using the deciles only.

Downsample to the quartiles

Or even just the quartiles!

And to get confidence intervals in these plots, the percentile bootstrap would work very well, especially when combined with the Harrell-Davis quantile estimator (Rousselet, Pernet & Wilcox, 2017; Wilcox & Rousselet, 2024).

In conclusion, the Tukey mean-difference plot is another great example of a quantile graphical method, and although in some situations it can reveal similar trends as the Bland-Altman plot, the two approaches are certainly not the same.

The R code is on GitHub.

1 thought on “Tukey mean-difference plot

  1. Pingback: Pre/Post design: the fallacy of comparing difference scores | basic statistics

Leave a comment