This post looks at the coverage of confidence intervals for the difference between two independent correlation coefficients. Previously, we saw how the standard Fisher’s r-to-z transform can lead to inflated false positive rates when sampling from non-normal bivariate distributions and the population correlation differs from zero. In this post, we look at a complementary perspective: using simulations, we’re going to determine how often confidence intervals include the population difference. As we saw in our previous post, because we compute say 95% confidence intervals does not mean that over the long run, 95% of such confidence intervals will include the population we’re trying to estimate. In some situations, the coverage is much lower than expected, which means we might fool ourselves more often that we thought (although in practice in most discussions I’ve ever read, authors behave as if their 95% confidence intervals were very narrow 100% confidence intervals — but that’s another story).
We look at confidence interval coverage for the difference between Pearsons’ correlations using Zou’s method (2007) and a modified percentile bootstrap method (Wilcox, 2009). We do the same for the comparison of Spearmans’ correlations using the standard percentile bootstrap. We used simulations with 4,000 iterations. Sampling is from bivariate g & h distributions (see illustrations here).
We consider 4 cases:
g = h = 0, difference = 0.1, vary rho
g = 1, h = 0, difference = 0.1, vary rho
rho = 0.3, difference = 0.2, vary g, h = 0
rho = 0.3, difference = 0.2, vary g, h = 0.2
g = h = 0, difference = 0.1, vary rho
That’s the standard normal bivariate distribution. Group 1 has values of rho1 = 0 to 0.8, in steps of 0.1. Group 2 has values of rho2 = rho1 + 0.1.
For normal bivariate distributions, coverage is at the nominal level for all methods, sample sizes and population correlations. (Here I only considered sample sizes of 50+ because otherwise power is far too low, so there is no point.)
When CIs do not include the population value, are they located to the left or the right of the population? In the figure below, negative values indicate a preponderance of left shifts, positive values a preponderance of right shifts. A value of 1 = 100% right shifts, -1 = 100% left shifts. For Pearson, CIs not including the population value tend to be located evenly to the left and right of the population. For Spearman, there is a preponderance of left shifted CIs for rho1 = 0.8. This left shift implies a tendency to over-estimate the difference (the difference group 1 minus group 2 is negative).
g = 1, h = 0, vary rho
What happens when we sample from a skewed distribution?
The coverage is lower than the expected 95% for Zou’s method and the discrepancy worsens with increasing rho1 and with increasing sample size. The percentile bootstrap does a much better job. Spearman’s combined with the percentile bootstrap is spot on.
For CIs that did not include the population value, the pattern of shifts varies as a function of rho. For Pearson, CIs are more likely to be located to the right of the population (under-estimation of the population value or wrong sign) for rho = 0, whereas for rho = 0.8, CIs are more likely to be located to the left. Spearman + bootstrap produces much more balanced results.
To investigate the asymmetry, we look at CIs for g=1, a sample size of n = 200 and the extremes of the distributions, rho1 = 0 and rho2 = 0.8. The figure below shows the preponderance of right shifted CIs for the two Pearson methods. The vertical line marks the population difference of -0.1.
For rho1 = 0.8, the pattern changes to a preponderance of left shifts for all methods, which means that the CIs tended to over-estimate the population difference. CIs for differences between Spearman’s correlations were quite smaller than Pearson’s ones though, thus showing less bias and less uncertainty.
rho=0.3, diff=0.2, vary g, h = 0
For another perspective on the three methods, we now look at a case with:
group 1: rho1 = 0.3
group 2: rho2 = 0.5
we vary g from 0 to 1.
For Pearson + Zou, coverage progressively decreases with increasing g, and to a much more limited extent with increasing sample size. Pearson + bootstrap is much more resilient to changes in g. And Spearman + bootstrap just doesn’t care about asymmetry!
The better coverage of Pearson + bootstrap seems to be achieved by producing wider CIs.
Matters only get’s worse for Pearson + Zou when outliers are likely (see notebook on GitHub).
Based on this new comparison of the 3 methods, I’d argue again that Spearman + bootstrap should be preferred over the two Pearson methods. But if the goal is to assess linear relationships, then Pearson + bootstrap is preferable to Zou’s method. I’ll report on other methods in another post.
Wilcox, Rand R. Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics – Simulation and Computation 38, no. 10 (1 November 2009): 2220–34. https://doi.org/10.1080/03610910903289151.
In previous posts, we saw how skewness and outliers can affect false positives (type I errors) and true positives (power) in one-sample tests. In particular, when making inferences about the population mean, skewness tends to inflate false positives, and skewness and outliers can destroy power. Here we investigate a complementary perspective, looking at how confidence intervals are affected by skewness and outliers.
Spoiler alert: 95% confidence intervals most likely do not have a coverage of 95%. In fact, I’ll show you an example in which a 95% CI for the mean has an 80% coverage…
Back to the title of the post. Seems like a weird question? Not if we consider the definition of a confidence interval (CI). Let say we conduct an experiment to estimate quantity x from a sample, where x could be the median or the mean for instance. Then a 95% CI for the population value of x refers to a procedure whose behaviour is defined in the long-run: CIs computed in the same way should contain the population value in 95% of exact replications of the experiment. For a single experiment, the particular CI does or does not contain the population value, there is no probability associated with it. A CI can also be described as the interval compatible with the data given our model — see definitions and common misinterpretations in Greenland et al. (2016).
So 95% refers to the (long-term) coverage of the CI; the exact values of the CI bounds vary across experiments. The CI procedure is associated with a certain coverage probability, in the long-run, given the model. Here the model refers to how we collected data, data cleaning procedures (e.g. outlier removal), assumptions about data distribution, and the methods used to compute the CI. Coverage can differ from the expected one if model assumptions are violated or the model is just plain wrong.
Wrong models are extremely common, for instance when applying a standard t-test CI to percent correct data (Kruschke, 2014; Jaeger, 2008) or Likert scale data (Bürkner & Vuorre, 2019; Liddell & Kruschke, 2019).
For continuous data, CI coverage is not at the expected, nominal level, for instance when the model expects symmetric distributions and we’re actually sampling from skewed populations (which is the norm, not the exception, when we measure sizes, durations, latencies etc.). Here we explore this issue using g & h distributions that let us manipulate asymmetry.
Illustrate g & h distributions
All g & h distributions have a median of zero. The parameter g controls the asymmetry of the distribution, while the parameter h controls the thickness of the tails (Hoaglin, 1985; Yan & Genton, 2019). Let’s look at some illustrations to make things clear.
Examples in which we vary g from 0 to 1.
As g increases, the asymmetry of the distributions increases. Using negative g values would produce distributions with negative skewness.
Examples in which we vary h from 0 to 0.2.
As h increases, the tails are getting thicker, which means that outliers are more likely.
Test with normal (g=h=0) distribution
Let’s run simulations to look at coverage probability in different situations and for different estimators. First, we sample with replacement from a normal population (g=h=0) 20,000 times (that’s 20,000 simulated experiments). Each sample has size n=30. Confidence intervals are computed for the mean, the 10% trimmed mean ™, the 20% trimmed mean and the median using standard parametric methods (see details in the code on GitHub, and references for equations in Wilcox & Rousselet, 2018). The trimmed mean and the median are robust measures of central tendency. To compute a 10% trimmed mean, observations are sorted, the 10% lowest and 10% largest values are discarded (20% in total), and the remaining values are averaged. In this context, the mean is a 0% trimmed mean and the median is a 50% trimmed mean. Trimming the data attenuates the influence of the tails of the distributions and thus the effects of asymmetry and outliers on confidence intervals.
First we look at coverage for the 4 estimators: we look at the proportion of simulated experiments in which the CIs included the population value for each estimator. As expected for the special case of a normal distribution, the coverage is close to nominal (95%) for every method:
In addition to coverage, we also look at the width of the CIs (upper bound minus lower bound). Across simulations, we summarise the results using the median width. CIs tends to be larger for trimmed means and median relative to the mean, which implies lower power under normality for these methods (Wilcox & Rousselet, 2018).
For CIs that did not include the population, the distribution is fairly balanced between the left and the right of the population. To see this, I computed a shift index: if the CI was located to the left of the population value, it receives a score of -1, when it was located to the right, it receives a score of 1. The shift index was then computed by averaging the scores only for those CI excluding the population.
Illustrate CIs that did not include the population
Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value for each estimator. About the same number of CIs were shifted to the left and to the right of the population value, which is illustrated in the next figure. In each panel, the vertical line marks the population value (here it’s zero in all conditions because the population is symmetric). The CIs are plotted in the order of occurrence in the simulation. So the figure shows that if we miss the population value, we’re as likely to overshoot than undershoot our estimation.
Across panels, the figure also shows that the more we trim (10%, 20%, median) the larger the CIs get. So for a strictly normal population, we more precisely estimate the mean than trimmed means and the median.
Test with g=1 & h=0 distribution
What happens for a skewed population? Three things happen for the mean:
coverage goes down
CIs not including the population value tend to be shifted to the left (negative average shift values)
The same effects are observed for the trimmed means, but less so the more we trim, because trimming alleviates the effects of the tails.
Illustrate CIs that did not include the population
The figure illustrates the strong imbalance between left and right CI shifts. If we try to estimate the mean of a skewed population, our CIs are likely to miss it more than 5% of the time, and when that happens, the CIs are most likely to be shifted towards the bulky part of the distribution (here the left for a right skewed distribution). Also, the right shifted CIs vary a lot in width and can be very large.
As we trim, the imbalance is progressively resolved. With 20% trimming, when CIs do not contain the population value, the distribution of left and right shifts is more balanced, although with still far more left shifts. With the median we have roughly 50% left / 50% right shifts and CIs are narrower than for the mean.
Test with g=1 & h=0.2 distribution
What happens if we sample from a skewed distribution (g=1) in which outliers are likely (h=0.2)?
The results are similar to those observed for h=0, only exacerbated. Coverage for the mean is even lower, CIs are larger, and the shift imbalance even more severe. I have no idea how often such a situation occur, but I suspect if you study clinical populations that might be rather common. Anyway, the point is that it is a very bad idea to assume the distributions we study are normal, apply standard tools, and hope for the best. Reporting CIs as 95% or some other value, without checking, can be very misleading.
Simulations in which we vary g
We now explore CI properties as a function of g, which we vary from 0 to 1, in steps of 0.1. The parameter h is set to 0 (left column of next figure) or 0.2 (right column). Let’s look at column A first (h=0). For the median, coverage is unaffected by g. For the other estimators, there is a monotonic decrease in coverage with increasing g. The effect is much stronger for the mean than the trimmed means.
For all estimators, increasing g leads to monotonic increases in CI width. The effect is very subtle for the median and more pronounced the less we trim. Under normality, g=0, CIs are the shortest for the mean, explaining the larger power of mean based methods relative to trimmed means in this unusual situation.
In the third panel, the zero line represents an equal proportion of left and right shifts, relative to the population, for CIs that did not include the population value. The values are consistently above zero for the median, with a few more right shifts than left shifts for all values of g. For the other estimators, the preponderance of left shifts increases markedly with g.
Now we look at results in panel B (h=0.2). When outliers are likely, coverage drops faster with g for the mean. Other estimators are resistant to outliers.
When outliers are common, CIs for the population mean are larger than for all other estimators, irrespective of g.
Again, there is a constant over-representation of right shifted CIS for the median. For the other estimators, the left shifted CIs dominate more and more with increasing g. The trend is more pronounced for the mean relative to the h=0 situation, with a sharper monotonic downward trajectory.
The answer to the question in the title is: most of the time! Simply because our models are wrong most of the time. So I would take all published confidence intervals with a pinch of salt. [Some would actually go further and say that if the sampling and analysis plans for an experiment were not clearly stipulated before running the experiment, then confidence interval, like P values, are not even defined (Wagenmakers, 2007). That is, we can compute a CI, but the coverage is meaningless, because exact repeated sampling might be impossible or contingent on external factors that would need to be simulated.] The best way forward is probably not to advocate for the use of trimmed means or the median over the mean in all cases, because different estimators address different questions about the data. And there are more estimators of central tendency than means, trimmed means and medians. There are also more interesting questions to ask about the data than their central tendencies (Rousselet, Pernet & Wilcox, 2017). For these reasons, we need data sharing to be the default, so that other users can ask different questions using different tools. The idea that the one approach used in a paper is the best to address the problem at hand is just silly.
Bürkner, Paul-Christian, and Matti Vuorre. ‘Ordinal Regression Models in Psychology: A Tutorial’. Advances in Methods and Practices in Psychological Science 2, no. 1 (1 March 2019): 77–101. https://doi.org/10.1177/2515245918823199.
Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. ‘Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations’. European Journal of Epidemiology 31, no. 4 (1 April 2016): 337–50. https://doi.org/10.1007/s10654-016-0149-3.
Jaeger, T. Florian. ‘Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models’. Journal of Memory and Language 59, no. 4 (November 2008): 434–46. https://doi.org/10.1016/j.jml.2007.11.007.
Kruschke, John K. Doing Bayesian Data Analysis. 2nd Edition. Academic Press, 2014.
Liddell, Torrin M., and John K. Kruschke. ‘Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?’ Journal of Experimental Social Psychology 79 (1 November 2018): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.
Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘Beyond Differences in Means: Robust Graphical Methods to Compare Two Groups in Neuroscience’. European Journal of Neuroscience 46, no. 2 (1 July 2017): 1738–48. https://doi.org/10.1111/ejn.13610.
Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.
Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review 14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.
Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.
(1) the Fisher’s z method to compare two independent correlations can give very inaccurate results when sampling from distributions that are skewed (asymmetric) or heavy tailed (high probability of outliers) and the population correlation rho differs from zero;
(2) even when sampling from normal distributions, Fisher’s z as well as bootstrap methods require very large sample sizes to detect differences, much larger than commonly used in neuroscience & psychology.
We start with an example and then consider the results of a few simulations. The R code is available on GitHub. It is part of a larger project and I’ll report in other blog posts on the rest of the simulations I’m currently running.
Let’s look at an example of correlation analysis reported in Davis et al. (2008). This article had 898 citations on June 11th 2019 according to Google scholar. In their figure 3, the authors reported 4 correlations: 2 correlations in 2 independent groups of participants. Across panels A and B, the frontal activity variable appears twice, so for each group of participants (old and young), the correlations in A and B are dependent and overlapping. Code on how to handle this case was covered in a previous post. Here we’re going to focus on the comparison of two independent correlations, which is like considering A or B and its own.
There are several problems with the analysis from Davis et al.:
within-participant variability was removed by averaging over trials and other variables, so equal weight is wrongly attributed to every observation;
the analyses are split into different correlations instead of attempting mixed-effect modelling with the groups of participants considered together;
association is assessed using Pearson’s correlation, which is not robust and suffers from many issues (it is sensitive to outliers, the magnitude of the slope around which points are clustered, curvature, the magnitude of the residuals, restriction of range, and heteroscedasticity — see details in Wilcox 2017);
sample sizes are very small (far too small to say anything meaningful actually, as described in this previous post);
if a correction for multiple comparisons is applied, nothing passes the magic 0.05 threshold;
the authors commit a classic interaction fallacy by reporting one P<0.05 correlation, the other not, without directly comparing them (Gelman & Stern 2006; Nieuwenhuis et al. 2011).
There is nothing special about the correlation analysis in Davis et al. (2008): in neuroscience and psychology these problems are very common. In the rest of this post we’re going to tackle only two of them: how to compare 2 independent Pearson’s correlations, and what sample sizes are required to tease them apart in the long-run.
The most common approach to compare 2 independent correlations is to use the Fisher’s r-to-z approach. Here is a snippet of R code for Fisher’s z, given r1 and r2 the correlations in group 1 and group 2, and n1 and n2 the corresponding sample sizes:
z1 <- atanh(r1)
z2 <- atanh(r2)
zobs <- (z1-z2) / sqrt( 1 / (n1-3) + 1 / (n2-3) )
pval <- 2 * pnorm(-abs(zobs))
This approach is implemented in the cocor R package for instance (Diedenhofen & Musch, 2015). The problem with this approach is its normality assumption: it works well when sampling from a bivariate normal distribution or when the two distributions have a population correlation rho of zero, but it misbehaves when sampling from asymmetric distributions, or from distributions with heavy tails, when rho differs from zero (e.g. Duncan & Layard, 1973; Berry & Mielke, 2000). And all methods based on some variant of the z transform suffer the same issues.
Simulations can be used to understand how a statistical procedure works in the long-run, over many identical experiments. Let’s imagine that we samplefrom bivariate g & h distributions, in which parameter g controls the asymmetry of the distributions and h controls the thickness of the tails(Hoaglin, 1985). You can see univariate g & h distributions illustrated here. Below you can see bivariate samples with n = 5,000, for combinations of g and h parameters, for a population correlation rho of 0:
and for a population correlation rho of 0.5:
Using g=h=0 gives a normal bivariate distribution.
When h=0.2, the tails are thicker, which means that outliers are more likely.
There is a large parameter space to cover, so here we only look at a subset of results, using simulations with 4,000 iterations. We’ll see examples in which we vary rho, the population correlation, or we vary g for a given rho.
First, we consider false positives, before turning to true positives (power).
For false positives, 2 independent random samples are taken from the same bivariate population, so on average the two correlations should not differ. Here are the results for different rhos and sample sizes for g=h=0, using Fisher’s z test and an alpha of 0.05:
So all is fine when sampling from normal distributions: the type I error rate, or proportion of false positives, is at the nominal 0.05 level (marked by a horizontal black line). The values are not exactly 0.05 because of the limited number of simulations, but they’re close enough.
What happens if we sample from slightly asymmetric populations instead (g=0.2)? Relative to the g=0 condition, the number of false positives increases a bit over 0.05 on average, and more so with increasing rho.
Things get worse if we sample from symmetric distributions in which outliers are likely (g=0, h=0.2):
If rho = 0 then the proportion of false positives is still around 0.05, bit it increases rapidly with rho, an effect exacerbated by sample size (a pattern already reported in Duncan & Layard, 1973). That’s a very bad feature, because, as we will see below, very large sample sizes are required to detect differences between correlation coefficients. Unless of course our experimental samples are from normal distributions, but that’s unlikely and usually unknown, so why make such a gamble? (A cursory look at the literature suggests that skewed distributions and outliers are frequent in correlation analyses —Rousselet & Pernet, 2012).
If we sample from asymmetric distributions in which outliers are likely (g=h=0.2), it just gets slightly worse than in the previous example:
What’s the alternative? According to Wilcox (2009), a percentile bootstrap can be used to compare Pearson’s correlations, leading to satisfactory proportions of false positives. If instead of Fisher’s z we use a percentile bootstrap to compare Pearson’s correlations when g=h=0.2, we get the following results:
Still above the expected 0.05 but much better than Fisher’s z!
To compare the two correlation coefficients using the percentile bootstrap, we proceed like this:
sample participants with replacement, independently in each group;
compute the two correlation coefficients based on the bootstrap samples;
save the difference between correlations;
execute the previous steps many times;
use the distribution of bootstrap differences to derive a confidence interval.
Usually a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution, but for Pearson’s correlation different quantiles are used depending on the sample size. This adjusted procedure is implemented in the R function twopcor() from Rand Wilcox. To compare two robust correlation coefficients, the adjustment is not necessary, a procedure implemented in twocor(). If for instance we use Spearman’s correlation when g=h=0.2, we get these results:
The proportion of false positives is at the nominal level and doesn’t change with rho! I’ll report simulations using the percentage bend correlation and the winsorised correlation, two other robust estimators, in another post.
To assess true positives, we first consider differences of 0.1: that is, if rho1 is 0, rho2 is 0.1, and so on for different rhos. Here are the results for g=h=0 using Fisher’s z:
For a difference of 0.1, power is overall very low, and it depends strongly on rho. The dependence of power on rho follows logically from the reduced sampling variability with increasing rho, which is demonstrated in a previous post. That post also illustrates the asymmetry of the sampling distributions when rho is different from zero. More generally, bounded distributions tend to be asymmetric, which implies that normality assumptions are routinely violated in psychology.
Changing g or h to 0.2 lowers power. Here is what happens with h=0.2 for instance:
But it’s difficult to make comparisons across figures. I would need to make plots of power differences, which is getting tedious for a blog post. So instead we can directly investigate the effect of g and h for fixed rho values. Here we only consider g.
First, we consider the case where rho1 = rho2 = 0.3, we vary g and keep h=0. Again, asymmetry has a strong effect on false positives from Fisher’s z test.
Using the bootstrap gives better results, but still with inflated false positives.
Spearman + bootstrap is better behaved:
What about power? Let’s consider rho1 = 0.3 and rho2 = 0.5 for Fisher’s z. Increasing asymmetry reduces power.
The percentile bootstrap gives even worse results:
Spearman + bootstrap is not affected by asymmetry at all it seems:
But again, notice the large number of trials required to detect an effect. Here we have a population difference of 0.2, and even assuming normality, 80% power requires well over 250 observations! And that’s if you’re ok to be wrong 20% of the time when there is an effect — seems quite a gamble. For 90% power you will need over 350 observations! But again that’s assuming normality… I haven’t looked at the effect of h on power in this scenario yet.
Finally, let’s consider a simulation in which the rhos are set to the values reported in Figure 3A of Davis et al. (2008): rho1 = 0, rho2 = 0.6. That’s a massive and unlikely difference, but let’s look at how many observations we need even in this improbable case.
To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 36 observations. For 90% power, the minimum sample size is 47 observations.
Now with g=1, to achieve at least 80% power the minimum sample size is 61 observations; to achieve at least 90% power the minimum sample size is 91 observations.
Contrast these results with the results obtained using a combination of Spearman + bootstrap:
To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 43 observations. For 90% power, the minimum sample size is 56 observations.
Now with g=1, to achieve at least 80% power the minimum sample size is 42 observations; to achieve at least 90% power the minimum sample size is 55 observations.
So under normality, Pearson + Fisher is more powerful than Spearman + bootstrap, but power can drop a lot for asymmetric distributions. Spearman is very resilient to asymmetry.
Pearson + bootstrap doesn’t fare well:
A good reminder that the bootstrap is not a magic recipe that can handle all situations well (Rousselet, Pernet & Wilcox, 2019).
For the normal case (g=h=0), we look at the proportion of true positives (power) for the difference between Pearsons’ correlations using Fisher’s r-to-z transform. We vary systematically the sampling size n, rho1 and the difference between rho1 and rho2. The title of each facet indicates rho1. The difference between rho1 and rho2 is colour coded. For instance, for rho1=0, a 0.1 difference means that rho2=0.1. For rho1=0.8, only a difference of 0.1 is considered, meaning that rho2=0.9. So, unless we deal with large differences, we need very large numbers of trials to detect differences between independent correlation coefficients.
Given the many problems associated with Pearson’s correlation, it’s probably wise to choose some other, robust alternative (Pernet et al. 2012). Whatever methods you choose, it seems that very large sample sizes are required to detect effects, certainly much larger than I expected before I ran the simulations. I’ll report on the behaviour of other methods in another post. Meanwhile, consider published correlation analyses from small samples with a grain of salt!
Given the present results, I would recommend to use Spearman + bootstrap to compare correlation coefficients. You could also consider a skipped Spearman, which is robust to multivariate outliers, but with long computation times relative to the other techniques mentioned above (Pernet et al. 2012).
Berry, K.J. & Mielke, P.W. (2000) A Monte Carlo investigation of the Fisher Z transformation for normal and nonnormal distributions. Psychol Rep, 87, 1101-1114.
Davis, S.W., Dennis, N.A., Daselaar, S.M., Fleck, M.S. & Cabeza, R. (2008) Que PASA? The posterior-anterior shift in aging. Cereb Cortex, 18, 1201-1209.
Diedenhofen, B. & Musch, J. (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One, 10, e0121945.
Duncan, G.T. & Layard, M.W.J. (1973) Monte-Carlo Study of Asymptotically Robust Tests for Correlation-Coefficients. Biometrika, 60, 551-558.
Gelman, A. & Stern, H. (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat, 60, 328-331.
Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.
Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.
Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.
Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2019) A practical introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations (Preprint). PsyArXiv. https://doi.org/10.31234/osf.io/h8ft7
Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Commun Stat-Simul C, 38, 2220-2234.
Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.
UPDATE (2018-05-17): as explained in the now updated previous post, the shift function for pairwise differences, originally described as a great tool to assess test-retest reliability, is completely flawed. The approach using scatterplots remains valid. If you know of other graphical methods, please leave a comment.
Let’s look at a first example using made up data. Imagine that reaction times were measured from 100 participants in two sessions. The medians of the two distributions do not differ much, but the shapes do differ a lot, similarly to the example covered in the previous post.
The kernel density estimates above do not reveal the pairwise associations between observations. This is better done using a scatterplot. In this plot, strong test-retest reliability would show up as a tight cloud of points along the unity line (the black diagonal line).
Here the observations do not fall on the unity line: instead the relationship leads to a much shallower slope than expected if the test-retest reliability was high. For fast responses in session 1, responses tended to be slower in session 2. Conversely, for slow responses in condition 1, responses tended to be faster in condition 2. This pattern would be expected if there was regression to the mean [wikipedia][ Barnett et al. 2005], that is, particularly fast or particularly slow responses in session 1 were due in part to chance, such that responses from the same individuals in session 2 were closer to the group mean. Here we know this is the case because the data are made up to have that pattern.
The shift function reveals a characteristicdifference in spread between the two distributions, a pattern that is also expected if there is regression to the mean. Essentially, the shift function shows howthe distribution in session 2 needs to be modified to match the distribution in session 1: the lowest deciles need to be decreased and the highest deciles need to be increased, and these changes should be stronger as we move towards the tails of the distribution. For this example, these changes would be similar to an anti-clockwise rotation of the regression slope in the next figure, to align the cloud of observations with the black diagonal line.
This second type of shift function reveals a pattern very similar to the previous one. In the [previous post], I wrote that this “is re-assuring. But there might be situations where the two versions differ.” Well, here are two such situations…
Example 2: ERP onsets
Here we look at ERP onsets from an object detection task (Bieniek et al. 2016). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. The distributions of onsets across participants is positively skewed, with a few participants with particularly early or late onsets. The distributions for the two sessions appear quite similar.
With these data, we were particularly interested in the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). Plotting session 2 onsets as a function of session 1 onsets does not reveal a strong pattern of regression to the mean as we observed in example 1.
Adding a loess regression line suggests there might actually be an overall clockwise rotation of the cloud of points relative to the black diagonal.
The pattern is even more apparent if we plot the difference between sessions on the y axis. This is sometimes called a Bland & Altman plot (but here without the SD lines).
However, a shift function on the marginals is relatively flat.
Although there seems to be a linear trend, the uncertainty around the differences between deciles is large. In the original paper, we wrote this conclusion (sorry for the awful frequentist statement, I won’t do it again):
“across the 74 participants tested twice, no significant differences were found between any of the onset deciles (Fig. 6C). This last result is important because it demonstrates that test–retest reliability does not depend on onset times. One could have imagined for instance that the earliest onsets might have been obtained by chance, so that a second test would be systematically biased towards longer onsets: our analysis suggests that this was not the case.”
That conclusion was probably wrong, because the shift function for dependent marginals is inappropriate to look at test-retest reliability. Inferences should be made on pairwise differences instead. So, if we use the shift function for pairwise differences, the results are very different! A much better diagnostic tool is to plot difference results as a function of session 1 results. This approach suggests, in our relatively small sample size:
the earlier the onsets in session 1, the more they increased in session 2, such that the difference between sessions became more negative;
the later the onsets in session 1, the more they decreased in session 2, such that the difference between sessions became more positive.
This result and the discrepancy between the two types of shift functions is very interesting and can be explained by a simple principle: for dependent variables, the difference between 2 means is equal to the mean of the individual pairwise differences; however, this does not have to be the case for other estimators, such as quantiles (Wilcox & Rousselet, 2018).
Also, tThe discrepancy shows that I reached the wrong conclusion in a previous study because I used the wrong analysis. Of course, there is always the possibility that I’ve made a terrible coding mistake somewhere (that won’t be the first time – please let me know if you spot a fatal mistake). So l Let’s look at another example using published clinical data in which regression to the mean was suspected.
“Scatter-plot of n = 96 paired and log-transformed betacarotene measurements showing change (log(follow-up) minus log(baseline)) against log(baseline) from the Nambour Skin Cancer Prevention Trial. The solid line represents perfect agreement (no change) and the dotted lines are fitted regression lines for the treatment and placebo groups”
Let’s try to make a similarly looking figure.
Unfortunately, the original figure cannot be reproduced because the group membership has been mixed up in the shared dataset… So let’s merge the two groups and plot the data following our shift function convention, in which the difference is session 1 – session 2.
Regression to the mean is suggested by the large number of negative differences and the negative slope of the loess regression: participants with low results in session 1 tended to have higher results in session 2. This pattern can also be revealed by plotting session 2 as a function of session 1.
The shift function for marginals suggests increasing differences between session quantiles for increasing quantiles in session 1.
This result seems at odd with the previous plot, but it is easier to understand if we look at the kernel density estimates of the marginals. Thus, plotting difference scores as a function of session 1 scores probably remains the best strategy to have a fine-grained look at test-retest results.
A shift function for pairwise differences shows a very different pattern, consistent with the regression to the mean suggested by Barnett et al. (2005).
To assess test-retest reliability, it is very informative to use graphical representations, which can reveal interesting patterns that would be hidden in a correlation coefficient. Unfortunately, there doesn’t seem to be a magic tool to simultaneously illustrate and make inferences about test-retest reliability.
It seems that the shift function for pairwise differences is an excellent tool to look at test-retest reliability, and to spot patterns of regression to the mean. The next steps for the shift function for pairwise differences will be to perform some statistical validations for the frequentist version, and develop a Bayesian version.
That’s it for this post. If you use the shift function for pairwise differences to look at test-retest reliability, let me know and I’ll add a link here.
Barnett, A.G., van der Pols, J.C. & Dobson, A.J. (2005) Regression to the mean: what it is and how to deal with it. Int J Epidemiol, 34, 215-220.
Bland JM, Altman DG. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, i, 307-310.
Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2016) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience, 44, 1804-1814.
Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.
Two distributions can differ in many ways, yet the standard approach in neuroscience & psychology is to assume differences in means. That’s why the first step in exploratory data analysis should always be detailed graphical representations (Rousselet et al. 2016, 2017). To help quantify how two distributions differ, a fantastic tool is the shift function – an example is provided below. It consists in plotting the difference between group quantiles, as a function of the quantiles in one group. The technique was first described in the 1970s by Doksum (1974; Doksum & Sievers, 1976), and later refined by Wilcox using the Harrell-Davis quantile estimator (Harrell & Davis, 1982) in conjunction with two percentile bootstrap methods (Wilcox 1995; Wilcox et al. 2014). The technique is related to delta plots and relative distribution methods (see details in Rousselet et al. 2017).
The goal of this post is to get feedback on my first attempt to make a Bayesian version of the shift function. Below I describe three potential strategies. My main motivation for making a Bayesian version is that the shift function comes with frequentist confidence intervals and p values. Although I still use confidence intervals to describe sampling variability, they are inherently linked to p values and tend to be associated with major flaws in interpretation (Morey et al. 2016). And my experience is that if p values are available, most researchers will embark on lazy and irrational decision making (Wagenmakers 2007). P values would not be a problem if they were just used as one of many pieces of evidence, without any special status (McShane et al. 2017).
Let’s consider a toy example: two independent exGaussian distributions, with n = 100 in each group.
The two groups clearly differ, as expected from the generative process (see online code). A t-test on means suggests a large uncertainty about the group difference:
difference = – 65 ms [-166, 37]
t = -1.26
p = 0.21
A shift function provides much more information about how the two groups differ.
The x-axis shows the quantiles of group 1 (here only the 9 deciles, which is a good default). The y-axis shows the difference between deciles of group 1 and group 2. Intuitively, the difference shows by how much group 2 would need to be shifted to match group 1, for each decile. The coloured labels show the quantile differences. I let you go back and forth between the density estimates and the shift function to understand what’s going on. Another detailed description is provided here if needed.
Strategy 1: Bayesian bootstrap
A simple strategy to make a Bayesian shift function is to use the Bayesian bootstrap instead of a percentile bootstrap. The percentile bootstrap can already be considered a very cheap way to create Bayesian posterior distributions, if we make the strong (and wrong) assumption that our observations are the only possible ones. Rasmus Bååth provides a detailed introduction to the Bayesian bootstrap, and it’s R implementation on his blog. There is also a video.
Using the Bayesian bootstrap, and 95% highest density intervals (HDI) of the posterior distributions, the shift function looks very similar to the original version, as expected.
Except now we’re dealing with credible intervals, and there are no p values, so users have to focus on quantification!
Strategy 2: Bayesian quantile regression
Another strategy is to use quantile regression, which comes in a Bayesian flavour using the asymmetric Laplacian likelihood family. To do that, I’m using the amazing brms R package by Paul Bürkner.
To get started with Bayesian statistics and the brms package, I recommend this excellent blog post by Matti Vuorre. Many other great posts are available here.
Using the default priors from brms, we can fit a quantile regression line for each group and for each decile. The medians (or means, or modes) and 95% HDIs of the posterior distributions can then be used to create a shift function.
Again, the results are quite similar to the original ones, although this time the quantiles do differ a bit from those in the previous versions because we use the medians of the posterior distributions to estimate them. Also, I haven’t looked in much detail at how well the model fits the data. The posterior predictive samples suggest the fits could be improved, but I have too little experience to make a call.
Strategy 3: Bayesian model with exGaussians
The third strategy is to fit a descriptive model to the distributions, generate samples from the posterior distributions, and compute quantiles from these predicted values. Here, since our toy model simulates reaction time data using exGaussian distributions, it makes sense to fit an exGaussian family to the data. More generally, exGaussian distributions are very good at capturing the shape of RT data (Matzke & Wagenmakers, 2009).
Again, the results look similar to the original ones. This strategy has the advantage to require fitting only one model, instead of a new model for each quantile in strategy 2. In addition, we get an exGaussian fit of the data, which is very useful in itself. So that would probably be my favourite strategy.
Questions to you, reader
Does this make even remotely sense?
Which strategy seems more promising?
The clear benefit of strategies 2 and 3 is that they can easily be extended to create hierarchical shift functions, by fitting simultaneously observations from multiple participants. Whereas for the original shift function using the percentile bootstrap, I don’t see any obvious way to make a hierarchical version.
You might want to ask, why not simply focus on modelling the distributions instead of looking at the quantiles? Modelling the shape of the distributions is great of course, but I don’t think it can achieve the same fine-grained quantification achieved by the shift function. Another approach of course would be to fit a more psychologically motivated model, such as a diffusion model. I think these three approaches are complementary.
Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.
Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.
Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.
Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, Jennifer L. Tackett (2017) Abandon Statistical Significance. arXiv:1709.07588
Matzke, D. & Wagenmakers, E.J. (2009) Psychological interpretation of the ex-Gaussian and shifted Wald parameters: a diffusion model analysis. Psychon Bull Rev, 16, 798-817.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D. & Wagenmakers, E.J. (2016) The fallacy of placing confidence in confidence intervals. Psychon Bull Rev, 23, 103-123.
Rousselet, G.A., Foxe, J.J. & Bolam, J.P. (2016) A few simple steps to improve the description of group results in neuroscience. The European journal of neuroscience, 44, 2647-2651.
Rousselet, G., Pernet, C. & Wilcox, R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience, figshare.
Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.
Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.
Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.
Let’s consider a simple mixed ERP design with 2 repeated measures (2 tasks) and 2 independent groups of participants (young and older participants). The Matlab code and the data are available on github. The data are time-courses of mutual information, with one vector time-course per participant and task. These results are preliminary and have not been published yet, but you can get an idea of how we use mutual information in the lab in recent publications (Ince et al. 2016a, 2016b; Rousselet et al. 2014). The code and illustrations presented in the rest of the post are not specific to mutual information.
Our 2 x 2 experimental design could be analysed using the LIMO EEG toolbox for instance, by computing a 2 x 2 ANOVA at every time point, and correcting for multiple comparisons using cluster based bootstrap statistics (Pernet et al. 2011, 2015). LIMO EEG has been used to investigate task effects for instance (Rousselet et al. 2011). But here, instead of ANOVAs, I’d like to focus on graphical representations and non-parametric assessment of our simple group design, to focus on effect sizes and to demonstrate how a few figures can tell a rich data-driven story.
First, we illustrate the 4 cells of our design. Figure 1 shows separately each group and each task: in each cell all participants are superimposed using thin coloured lines. We can immediately see large differences among participants and between groups, with overall smaller effects (mutual information) in older participants. There also seems to be task differences, in particular in young participants, which tend to present more sustained effects past 200 ms in the expressive task than the gender task.
To complement the individual traces, we can add measures of central tendency. The mean is shown with a thick green line, the median with a thick black line. See how the mean can be biased compared to the median in the presence of extreme values. The median was calculated using the Harrell-Davis estimator of the 50th quantile. To illustrate the group median with a measure of uncertainty, we can add a 95% percentile bootstrap confidence interval for instance (Figure 2).
We can immediately see discrepancies between the median time-courses and their confidence intervals on the one hand, and the individual time-courses on the other hand. There are indeed many distributions of participants that can lead to the same average time-course. That’s why it is essential to show individual results, at least in some illustrations.
In our 2 x 2 design, we now have 3 aspects to consider: group differences, task differences and their interactions. We illustrate them in turn.
Age group differences for each task
We can look at the group differences in each task separately, as shown in Figure 3. The medians of each group is shown with 95% percentile bootstrap confidence intervals. On average, older participants tend to have weaker mutual information than young participants – less than half around 100-200 ms post-stimulus. This will need to be better quantified, for instance by reporting the median of all pairwise differences.
Under each panel showing the median + CI for each group, we plot the time-course of the group differences (young-older), with a confidence interval. For group comparisons we cannot illustrate individuals, because participants are be paired. However, we can illustrate all the bootstrap samples, shown in grey. Each sample was obtained by:
sampling with replacement Ny observations among Ny young observers
sampling with replacement No observations among No older observers
compute the median of each group
subtract the two medians
It is particularly important to illustrate the bootstrap distributions if they are skewed or contain outliers, or both, to check that the confidence intervals provide a good summary. If the bootstrap samples are very skewed, highest density intervals might be a good alternative to classic confidence intervals.
The lower panels of Figure 3 reveal relatively large group differences in a narrow window within 200 ms. The effect also appears to be stronger in the expressive task. Technically, one could also say that the effects are statistically significant, in a frequentist sense, when the 95% confidence intervals do not include zero. But not much is gained from that because some effects are large and others are small. Correction for multiple comparisons would also be required.
Task differences for each group
Figure 4 has a similar layout to Figure 3, now focusing on the task differences. The top panels suggest that the group medians don’t differ much between tasks, except maybe in young participants around 300-500 ms.
Because task effects are paired, we are not limited to the comparison of the medians between tasks; we can also illustrate the individual task differences and the medians of these differences . These are shown in the bottom panels of Figure 4. In both groups, the individual differences are large and the time-courses of the task differences are scattered around zero, except in the young group starting around 300 ms, where most participants have positive differences (expressive > gender).
 When the mean is used as a measure of central tendency, these two perspectives are identical, because the difference between two means is the same as the mean of the pairwise differences. However, this is not the case for the median: the difference between medians is not the same as the median of the differences. Because we are interested in effect sizes, it is more informative to report descriptive statistics of the pairwise differences. The advantage of the Matlab code provided with this post is that instead of looking at the median, we can also look at other quantiles, thus getting a better picture of the strength of the effects.
Interaction between tasks and groups
Finally, in Figure 5 we consider the interactions between task and group factors. To do that we first superimpose the medians of the task differences with their confidence intervals (top panel). These traces are the same shown in the bottom panels of Figure 4. I can’t say I’m very happy with the top panel of Figure 5 because the two traces are difficult to compare. Essentially the don’t seem to differ much, except maybe for the late effect in young participants being higher than what is observed in older participants.
In the lower panel of Figure 5 we illustrate the age group differences (young – older) between the medians of the pairwise task differences. Again confidence intervals are also provided, along with the original bootstrap samples. Overall, there is very little evidence for a 2 x 2 interaction, suggesting that the age group differences are fairly stable across tasks. Put another way, the weak task effects don’t appear to change much in the two age groups.
Ince, R.A., Jaworska, K., Gross, J., Panzeri, S., van Rijsbergen, N.J., Rousselet, G.A. & Schyns, P.G. (2016a) The Deceptively Simple N170 Reflects Network Information Processing Mechanisms Involving Visual Feature Coding and Transfer Across Hemispheres. Cereb Cortex.
Ince, R.A., Giordano, B.L., Kayser, C., Rousselet, G.A., Gross, J. & Schyns, P.G. (2016b) A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Hum Brain Mapp.
Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.
Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.
Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.
Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.
In Bieniek et al. 2015 we reported EEG estimates of visual object processing onsets from a sample of 120 human adult participants aged 18-81. Among these participants, 74 were tested a second time for test-retest assessment. All the details are described in the paper, which is open access. The main outcomes of that paper were onset estimation in every participant and robust descriptive group statistics with various frequentist confidence intervals. Although I still think we could improve on the onset detection technique (notably using shape matching instead of amplitude boundary crossing), we can leave that issue aside for the moment and assume that we’re dealing with a decent sample of EEG onsets. Here I want to revisit the main group analysis using a Bayesian approach. In doing so, I will cover a very Bayesian problem: how to integrate information across experiments. I will focus on a very simple one-sample case, because that’s the nature of the data, and more importantly because this is my first attempt at Bayesian statistics!
Code and data
The R & JAGS code and the data to reproduce all the analyses and the figures are available in the R script onset_priors.R located in this stand-alone folder.
Analyses were performed in R version 3.2.2, using JAGS version 4.1.0, and R functions and examples from the BEST package version 2 by John Kruschke. The R script used to generate the results and the figures assumes that you have read Kruschke’s BEST paper and installed his code and dependencies. Much more can be found in his excellent book.
For the onset results, a more comprehensive dataset + Matlab code is available here:
The EEG data used to compute the onsets are available here:
Bieniek MM, Bennett PJ, Sekuler AB, Rousselet GA (2015) Data from: A robust and representative lower bound on object processing speed in humans. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.46786
Before we get started on our experimental data, we should explore a toy example for which we know exactly what to expect. Indeed, it can be a very bad idea to apply completely new tools to your own data, because there are potentially too many unknowns, and no easy way to check that the tools are doing what they’re supposed to. So first we create fake data to check that the procedure works. We make a sample of 20 observations with a mean of 100 and standard deviation of 5.
Run Bayesian analysis using BEST’s defaults
We use BEST’s defaults with broad priors, 3 chains, 500 adaptation steps, 1,000 burn-in steps and 10,000 monitoring steps. Here are the results:
Figure 1. Parameter estimation given BEST’s defaults broad priors. Panels A-D show histograms of credible values from the posterior distribution for the mean, standard deviation, effect size and normality parameters. In panel A, the expected value of 100 is indicated in green under the mode of the marginal mean distribution. This value appears between 2 values: the percentage of the posterior distribution below and above the expected value. HDI stands for highest density interval. Panel E shows examples of representative posterior predictive t distributions. These examples appear in blue and are combinations of value parameters illustrated in panels A, B and D. In red is a histogram of the data. The distributions in blue are not data, they are posterior predictive distributions given our priors and the data.
The results are as expected given the fake data we have generated: the MCMC (Markov Chain Monte Carlo) Gibbs algorithm seems to do a good job at sampling regions of the posterior distribution that are compatible with the data. But how do we assess the quality of the analysis? This is a huge topic covered in part in Kruschke’s book, Doing Bayesian data analysis, 2nd edition. He offers a diagnostic plotting function diagMCMC to investigate the behaviour of the chains.
Run diagnostic tools
We run diagMCMC on the mu parameter and obtain the figure below. Although Kruschke suggests to use this figure to determine if the chains have converged, it is not a very appropriate description, because strictly speaking Markov chains do not converge. So what does this figure actually tells us?
Figure 2. MCMC diagnostics.A Trace plots of iterations, or steps, of the 3 chains for parameter mu (the mean). The first 1,500 steps were discarded because of the default 500 adaptation steps and 1000 burn in steps. The strong overlap among the 3 chains suggest that they are representative of the posterior distribution. B Autocorrelation functions for lags 0 to 35 for the 3 chains. This gives an indication of the number of steps necessary for the chains to provide independent samples from the posterior distribution. Here it looks as if 5 steps between samples would be sufficient to obtain a more representative estimation of the posterior distribution. ESS is the effective sample size: it estimates the length of a chain without autocorrelation that would give us the same information as the current one. Here we used 10,000 steps, so an ESS of ~6,000 suggests redundancy among samples which are taken from very similar parts of the distribution. For an accurate description of the 95% HDI, an ESS of at least 10,000 is recommended, according to Kruschke; for particular applications it would be wise to validate that number. C Shrink factor over iterations (steps) in the chains. The shrink factor is a measure of between chain variance relative to within chain variance. A value close to 1 again suggests that the chains are sampling similar representative regions of the posterior distribution. A shrink factor \<1.1 is a recommended cut-off according to Kruschke. D Density plots of the parameter value for the 3 chains, and the 95% HDI for each chain. The 3 density plots and HDIs overlap well, suggesting again that the parameter values have been sampled from similar representative regions of the posterior distribution. MCSE is the Monte Carlo standard error – the lower the better. I do not understand yet how MCSE, ESS and shrink factors are calculated.
The 3 chains appear to be well mixed and to be sampling from a high-probability region. However there is some autocorrelation, and the ESS is ~6000. Autocorrelation can be associated with poor, unrepresentative, sampling of the posterior distribution. To see if this is a concern, we could run the MCMC analysis again, and compare the 95% HDIs. If they are very similar between analyses, then we have probably sampled sufficiently from similar representative portions of the posterior distribution, and there is nothing to worry about. An alternative is to compare the results from the 3 chains we ran independently: in panel D of the figure above, the 3 marginal distributions of the mean and their associated 95% HDIs are very similar, so the sampling was probably sufficient.
To decrease the autocorrelation, we could also thin the chains, or increase the number of steps. See the very useful discussion in Kruschke’s blog suggesting that thinning is a poor strategy against auto-correlation. It seems that introducing burn-in steps is also unnecessary.
Results with 5 thinning steps
Let’s try thinning anyway. I’ll report on better strategies once I have performed some simulations. In this example, using 5 thinning steps gets rid of the auto-correlation and might be sufficient to provide stable estimation of the mean and its 95% HDI:
Figure 3. MCMC diagnostics for analyses with 5 thinning steps.
The results show improved HDI consistency across the 3 chains, from already very consistent results. In this case, 5 thinning steps were sufficient to get an ESS around the expected 10,000, although running the same analysis again produced inconsistent results with ESS ranging from ~8000 to ~10,000. If our main interest is the mode and the 95% HDI of the mean of the samples from the posterior distribution, then thinning does not make much difference compared to the previous results. The required precision of the estimation depends on the application of course.
Results are robust to outliers
In the lab, we only adopt tools that are robust against outliers. Certainly, t-tests and ANOVAs on means are not robust. What about our new Bayesian tool? To check, we add outliers to our sample of fake data, and run the analysis again. The results are very convincing:
Figure 4. Parameter estimation given data with outliers.
Although the estimation of SD has changed, here our main interest is to estimate the mean of probable underlying distributions compatible with the data. And for the mean the results with outliers are almost identical to those without outliers. Thus, MCMC with a t distribution is robust: the estimation of the mean is barely different from the original, and within random fluctuations expected from running the same MCMC several times. You can try different combinations of outliers to convince yourselves: personally, I’m sold.
Impact of priors
How are the results affected by our choice of priors? Instead of BEST’s default settings, we can for instance run the analysis using a uniform prior on mean and a gamma prior on SD. We strongly expect onsets to be contained in the broad interval 0-200 ms, so we use a uniform distribution spanning that interval. This gives results very similar to the original ones, which is expected because by default BEST uses very broad priors.
Figure 5. Parameter estimation given uniform prior on the mean.
But what happens if we run the analysis with strong priors on the mean? Let’s assume that we expect onsets to have a mean around 80 ms, with little dispersion around that value (mean prior SD=3). In that case the posterior predictive samples for the mean look very different from those obtained using broad priors (Figures 1 & 5):
Figure 6. Posterior distribution of the mean given a strong 80 ms prior. The mode is used as central tendency of the distribution – although it can be more unstable than the median, it tends to be more representative. The expected value of 100 is indicated in green under the mode and marked by a green vertical dotted line. This value appears between 2 values: the percentage of the posterior distribution below and above the expected value. Under that, in red, is a [95, 105] ROPE, marked by vertical doted lines. The 3 percentage values indicate the percentage of the MCMC distribution below, inside, and above the ROPE. Finally, the 95% HDI is indicated by a thick horizontal black line at the bottom of the histogram.
That’s reassuring, because if we have strong prior knowledge about the underlying distribution from which the data might be sampled, a new experiment should not be able to override that knowledge. Here, in particular, the 95% HDI does not include 100, the sample mean. So should we conclude that 100 is not a credible value for the underlying distribution, given our prior knowledge and the current experiment? If we were dealing with processing speed estimates from EEG recordings, our decision should be guided by other physiological considerations. For instance, if we think that our measurements originate from one particular cortical area, and after considering conduction times between areas & other factors (Salin & Bullier, 1995; Nowak & Bullier, 1997), we could use for instance a narrow margin of error of 5 ms on either side of 100 ms. We would then define a region of practical equivalence or ROPE of [95, 105]. Because the HDI does overlap with the proposed ROPE, we should probably reserve judgement about whether 100 is credible given the data.
Bayesian analysis of EEG onsets
Now that we’ve gained some confidence in the tools, let’s do a Bayesian analysis of our real dataset of EEG onsets. The variable ses1 contains onsets from 120 participants. ses2 contains onsets from 74 participants who also provided ses1 onsets. Although this is a paired design, and the results can be used to investigate test-retest reliability (see our EJN 2015 paper), here we treat the two sessions as independent.
First, let’s plot kernel density estimates for the two sessions:
Figure 7. Kernel density estimates for session 1 and session 2 observations.
Bayesian analysis on session 1 using BEST’s defaults
To start, we use BEST’s defaults. Before we consider the results, we call the diagnostic plotting function diagMCMC for mu, our main parameter of interest.
Figure 8. MCMC diagnostics for session 1 onset analysis using BEST’s defaults.
The diagnostic tools suggests the chains are well mixed and seem to be sampling from a high-probability region. However the ESS is ~6000, which calls for some thinning or more steps. So let’s try again with 20,000 steps:
Figure 9. MCMC diagnostics for session 1 onset analysis using 20,000 steps.
ESS is now ~10000, and all the diagnostic indicators look good. It seems we’ve got a good sampling of the posterior distribution. So let’s display the results, with a 100 ms onset for comparison:
Figure 10. Parameter estimation for session 1 onset data using 20,000 iterations.
Here we get a mode of 92.6 and 95% HDI [88.4, 96.7]. If we had the hypothesis of a mean onset of 100 ms, the results suggest a credible mean lower than 100 ms. Let’s plot only mean posterior samples:
Figure 11. Session 1 onset posterior distribution of the mean.
A 95-105 ms ROPE overlaps with the 95% HDI, so we conclude that 100 ms is still a credible value.
Bayesian analysis on session 1 using informed priors
What happens if we use more informed priors? For instance, the literature suggests that face onsets are around 50-150 ms. Here are the results using a uniform distribution on that interval:
Figure 12. Parameter estimation for session 1 data using a uniform prior.
And the mean posterior only, with a [95, 105] ROPE:
Figure 13. Session 1 onset posterior distribution of the mean given a uniform prior.
As expected with uniform priors, the results changed very little compared to using BEST’s default options.
Bayesian analysis of session 2 results
With our new posterior estimates, we can now study results from session 2 using more informed priors. We plugin estimates of the posterior samples from session 1 as priors for session 2. This makes a lot of sense given that session 1 has a large sample size (for this type of research) and is therefore our best description of the population we’re trying to estimate. Here are the results:
Figure 14. Parameter estimation for session 2 onset data given session 1 priors.
Figure 15. Session 2 onset posterior distribution of the mean given session 1 priors.
Compared to session 1 estimation, we’ve gained information: the ROPE is now completely outside the 95% HDI, which is [87.3, 93.4], and the mode of the mean is 90.3. So 100 ms is no longer a credible onset value.
What do we get if we pretend that session 1 does not exist?
session 2 results with session 1 priors: 95% HDI = [87.3, 93.4], mode of the mean = 90.3 (Figure 15)
session 2 results with broad priors: 95% HDI = [83.2, 92.2], mode of the mean = 88.1
Because the results are very similar across the two sessions, using broad priors instead of session 1 priors give very similar results. However, session 2 estimates using broad priors are shifted to the left, because the session 2 data distribution suggests a slightly lower mean compared to session 1.
In our EJN paper, we reported these median onsets with 95% percentile bootstrap confidence intervals:
– session 1 = 92 ms [85, 99]
– session 2 = 85 ms [79, 92]
Our Bayesian estimation across sessions suggests that the underlying distribution’s most credible mean is 90 ms. The 95% HDI is also about twice narrower than the 95% confidence interval, so we’ve reduced the uncertainty about the mean.
The same procedure employed in this section can be used to analyse a new distribution of onsets from similar experiments. We have collected a smaller number of observations in a series of new experiments, and we will be analysing the results using as priors session 2 posteriors, themselves obtained given session 1 posteriors.
Comparison to a new point estimate
Let’s assume someone finds an onset at 50 ms, probably using group statistics, such that there is only a single point estimate available. Before writing a paper declaring “face processing takes 50 ms”, is 50 ms credible given our estimation from 2 large samples? Let’s assume a ROPE of 10 ms on each side of the point estimate, so that we would consider 50 ms not credible only if our previous 95% HDI does not contain the interval 40-60 ms at all. There are other ways to estimate the ROPE, for instance by sampling participants with replacement many times and compute the group estimate for every iteration. Given that most MEEG papers I have ever read report only a single value, it’s very likely that future studies will focus on point estimates. Here are the results with a [40, 60] ROPE:
Figure 16. Comparison between a 50 ms point estimate + ROPE & session 2 onset posterior distribution of the mean.
Well, as you’ve guessed, the answer is no, 50 ms is not a statistically credible onset! It is also not physiologically credible! And as incredible as such a 50 ms processing speed claim might seem, it has been made before, and I’m received recently a paper making such a claim to review… Now I can point the authors to this blog. Go Bayes!