Monthly Archives: May 2018

Test-retest reliability assessment using graphical methods

UPDATE (2018-05-17): as explained in the now updated previous post, the shift function for pairwise differences, originally described as a great tool to assess test-retest reliability, is completely flawed. The approach using scatterplots remains valid. If you know of other graphical methods, please leave a comment.

Test-retest reliability is often summarised using a correlation coefficient, often without illustrating the raw data. This is a very bad idea given that the same correlation coefficient can result from many different configurations of observations. Graphical representations are thus essential to assess test-retest reliability, as demonstrated for instance in the work of Bland & Altman.

The R code for this post is on github.

Example 1: made up data

Let’s look at a first example using made up data. Imagine that reaction times were measured from 100 participants in two sessions. The medians of the two distributions do not differ much, but the shapes do differ a lot, similarly to the example covered in the previous post.


The kernel density estimates above do not reveal the pairwise associations between observations. This is better done using a scatterplot. In this plot, strong test-retest reliability would show up as a tight cloud of points along the unity line (the black diagonal line).


Here the observations do not fall on the unity line: instead the relationship leads to a much shallower slope than expected if the test-retest reliability was high. For fast responses in session 1, responses tended to be slower in session 2. Conversely, for slow responses in condition 1, responses tended to be faster in condition 2. This pattern would be expected if there was regression to the mean [wikipedia][ Barnett et al. 2005], that is, particularly fast or particularly slow responses in session 1 were due in part to chance, such that responses from the same individuals in session 2 were closer to the group mean. Here we know this is the case because the data are made up to have that pattern.

We can also use a shift function for dependent group to investigate the relationship between sessions, as we did in the previous post.


The shift function reveals a characteristic  difference in spread between the two distributions, a pattern that is also expected if there is regression to the mean. Essentially, the shift function shows how  the distribution in session 2 needs to be modified to match the distribution in session 1: the lowest deciles need to be decreased and the highest deciles need to be increased, and these changes should be stronger as we move towards the tails of the distribution. For this example, these changes would be similar to an anti-clockwise rotation of the regression slope in the next figure, to align the cloud of observations with the black diagonal line.  


To confirm these observations, we also perform a shift function for pairwise differences. 


This second type of shift function reveals a pattern very similar to the previous one. In the [previous post], I wrote that this “is re-assuring. But there might be situations where the two versions differ.” Well, here are two such situations…

Example 2: ERP onsets

Here we look at ERP onsets from an object detection task (Bieniek et al. 2016). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. The distributions of onsets across participants is positively skewed, with a few participants with particularly early or late onsets. The distributions for the two sessions appear quite similar.   


With these data, we were particularly interested in the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). Plotting session 2 onsets as a function of session 1 onsets does not reveal a strong pattern of regression to the mean as we observed in example 1. 


Adding a loess regression line suggests there might actually be an overall clockwise rotation of the cloud of points relative to the black diagonal.


The pattern is even more apparent if we plot the difference between sessions on the y axis. This is sometimes called a Bland & Altman plot (but here without the SD lines).


However, a shift function on the marginals is relatively flat.


Although there seems to be a linear trend, the uncertainty around the differences between deciles is large. In the original paper, we wrote this conclusion (sorry for the awful frequentist statement, I won’t do it again):

“across the 74 participants tested twice, no significant differences were found between any of the onset deciles (Fig. 6C). This last result is important because it demonstrates that test–retest reliability does not depend on onset times. One could have imagined for instance that the earliest onsets might have been obtained by chance, so that a second test would be systematically biased towards longer onsets: our analysis suggests that this was not the case.”

That conclusion was probably wrong, because the shift function for dependent marginals is inappropriate to look at test-retest reliability. Inferences should be made on pairwise differences instead. So, if we use the shift function for pairwise differences, the results are very different! A much better diagnostic tool is to plot difference results as a function of session 1 results. This approach suggests, in our relatively small sample size:


  • the earlier the onsets in session 1, the more they increased in session 2, such that the difference between sessions became more negative;
  • the later the onsets in session 1, the more they decreased in session 2, such that the difference between sessions became more positive. 

This result and the discrepancy between the two types of shift functions is very interesting and can be explained by a simple principle: for dependent variables, the difference between 2 means is equal to the mean of the individual pairwise differences; however, this does not have to be the case for other estimators, such as quantiles (Wilcox & Rousselet, 2018).

Also, tThe discrepancy shows that I reached the wrong conclusion in a previous study because I used the wrong analysis. Of course, there is always the possibility that I’ve made a terrible coding mistake somewhere (that won’t be the first time – please let me know if you spot a fatal mistake). So l Let’s look at another example using published clinical data in which regression to the mean was suspected.

Example 3: Nambour skin cancer prevention trial

The data are from a cancer clinical trial described by Barnett et al. (2005). Here is Figure 3 from that paper:


“Scatter-plot of n = 96 paired and log-transformed betacarotene measurements showing change (log(follow-up) minus log(baseline)) against log(baseline) from the Nambour Skin Cancer Prevention Trial. The solid line represents perfect agreement (no change) and the dotted lines are fitted regression lines for the treatment and placebo groups”

Let’s try to make a similarly looking figure.


Unfortunately, the original figure cannot be reproduced because the group membership has been mixed up in the shared dataset… So let’s merge the two groups and plot the data following our shift function convention, in which the difference is session 1 – session 2.


Regression to the mean is suggested by the large number of negative differences and the negative slope of the loess regression: participants with low results in session 1 tended to have higher results in session 2. This pattern can also be revealed by plotting session 2 as a function of session 1.


The shift function for marginals suggests increasing differences between session quantiles for increasing quantiles in session 1.


This result seems at odd with the previous plot, but it is easier to understand if we look at the kernel density estimates of the marginals. Thus, plotting difference scores as a function of session 1 scores probably remains the best strategy to have a fine-grained look at test-retest results.


A shift function for pairwise differences shows a very different pattern, consistent with the regression to the mean suggested by Barnett et al. (2005).



To assess test-retest reliability, it is very informative to use graphical representations, which can reveal interesting patterns that would be hidden in a correlation coefficient. Unfortunately, there doesn’t seem to be a magic tool to simultaneously illustrate and make inferences about test-retest reliability.

It seems that the shift function for pairwise differences is an excellent tool to look at test-retest reliability, and to spot patterns of regression to the mean. The next steps for the shift function for pairwise differences will be to perform some statistical validations for the frequentist version, and develop a Bayesian version.

That’s it for this post. If you use the shift function for pairwise differences to look at test-retest reliability, let me know and I’ll add a link here.


Barnett, A.G., van der Pols, J.C. & Dobson, A.J. (2005) Regression to the mean: what it is and how to deal with it. Int J Epidemiol, 34, 215-220.

Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310.

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2016) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience, 44, 1804-1814.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.