Tag Archives: bootstrap

How to compare dependent correlations

In this post we’re going to compare two robust dependent correlation coefficients using a frequentist approach. The approach boils down to computing a confidence interval for the difference between correlations. There are several solutions to this problem, and we’re going to focus on what is probably the simplest one, using a percentile bootstrap, as described in Wilcox 2016 & implemented in his R functions twoDcorR() and twoDNOV(). These two functions correspond to two cases:

  • Case 1: overlapping correlations
  • Case 2: non-overlapping correlations

Case 1: overlapping correlations

Case 1 corresponds to the common scenario in which we look for correlations, across participants, between one behavioural measurement and activity in several brain areas. For instance, we could look at correlations between percent correct in one task and brain activity in two regions of interest (e.g. parietal and occipital). In this scenario, often papers report a significant brain-behaviour correlation in one brain area, and a non-significant correlation in another brain area. Stopping the analyses at that stage leads to a common interaction fallacy: because correlation 1 is statistically significant and correlation 2 is not does not mean that the two correlations differ (Nieuwenhuis et al. 2011). The interaction fallacy is also covered in a post by Jan Vanhove. Thom Baguley also provides R code to compare correlations, as well as a cautionary note about using correlations at all.

To compare the two correlation coefficients, we proceed like this:

  • sample participants with replacement
  • compute the two correlation coefficients based on the bootstrap samples
  • save the difference between correlations
  • execute the previous steps at least 500 times
  • use the distribution of bootstrap differences to derive a confidence interval: a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution.

A Matlab script implementing the procedure is on github. To run the code you will need the Robust Correlation Toolbox. First we generate data and illustrate them.

fig1_comp2dcorr

Figure 1

% Then we bootstrap the data
Nb = 500;
bootcorr1 = zeros(Nb,1);
bootcorr2 = zeros(Nb,1);

for B = 1:Nb

bootsample = randi(Np,1,Np);
bootcorr1(B) = Spearman(a(bootsample),b(bootsample),0);
bootcorr2(B) = Spearman(a(bootsample),c(bootsample),0);

end

% Cyril Pernet pointed out on Twitter that the loop is unnecessary.
% We can compute all bootstrap samples in one go:
% bootsamples = randi(Np,Np,Nb);
% bc1 = Spearman(a(bootsamples),b(bootsamples),0);
% bc2 = Spearman(a(bootsamples),c(bootsamples),0);
% The bootstrap loop does make the bootstrap procedure more intuitive
% for new users, especially if they are also learning R or Matlab!

In the example above we used Spearman’s correlation, which is robust to univariate outliers (Pernet, Wilcox & Rousselet, 2012). To apply the technique to Pearson’s correlation, the boundaries of the confidence interval need to be adjusted, as described in Wilcox (2009). However, Pearson’s correlation is not robust so it should be used cautiously (Rousselet & Pernet 2012). Also, as described in Wilcox (2009), Fisher’s z test to compare correlation coefficients is inappropriate.

Confidence intervals are obtained like this:

alpha = 0.05; % probability coverage - 0.05 for 95% CI

hi = floor((1-alpha/2)*Nb+.5);
lo = floor((alpha/2)*Nb+.5);

% for each correlation
boot1sort = sort(bootcorr1);
boot2sort = sort(bootcorr2);
boot1ci = [boot1sort(lo) boot1sort(hi)]; 
boot2ci = [boot2sort(lo) boot2sort(hi)]; 

% for the difference between correlations
bootdiff = bootcorr1 - bootcorr2;
bootdiffsort = sort(bootdiff);
diffci = [bootdiffsort(lo) bootdiffsort(hi)];

We get:

corr(a,b) = 0.52 [0.34 0.66]
corr(a,c) = 0.79 [0.68 0.86]
difference = -0.27 [-0.44 -0.14]

The bootstrap distribution of the differences between correlation coefficients is illustrated below.

fig2_comp2dcorr

Figure 2

The bootstrap distribution does not overlap with zero, our null hypothesis. In that case the p value is exactly zero, which is calculated like this:

pvalue = mean(bootdiffsort < 0);
pvalue = 2*min(pvalue,1-pvalue);

The original difference between coefficients is marked by a thick vertical black line. The 95% percentile bootstrap confidence interval is illustrated by the two thin vertical black lines.

Case 2: non-overlapping correlations

Case 2 corresponds to a before-after scenario. For instance the same participants are tested before and after an intervention, such as a training procedure. On each occasion, we compute a correlation, say between brain activity and behaviour, and we want to know if that correlation changes following the intervention.

This case 2 is addressed using a straightforward modification of case 1. Here are example data:

fig3_comp2dcorr

Figure 3

The bootstrap is done like this:

Nb = 500;
bootcorr1 = zeros(Nb,1);
bootcorr2 = zeros(Nb,1);

for B = 1:Nb

bootsample = randi(Np,1,Np);
bootcorr1(B) = Spearman(a1(bootsample),b1(bootsample),0);
bootcorr2(B) = Spearman(a2(bootsample),b2(bootsample),0);

end

alpha = 0.05; % probability coverage - 0.05 for 95% CI
hi = floor((1-alpha/2)*Nb+.5);
lo = floor((alpha/2)*Nb+.5);

% for each correlation
boot1sort = sort(bootcorr1);
boot2sort = sort(bootcorr2);
boot1ci = [boot1sort(lo) boot1sort(hi)]; 
boot2ci = [boot2sort(lo) boot2sort(hi)]; 

% for the difference between correlations
bootdiff = bootcorr1 - bootcorr2;
bootdiffsort = sort(bootdiff);
diffci = [bootdiffsort(lo) bootdiffsort(hi)];

We get:

corr(a1,b1) = 0.52 [0.34 0.66]
corr(a2,b2) = 0.56 [0.39 0.68]
difference = -0.04 [-0.24 0.17]

The difference is very close to zero and its confidence interval includes zero. So the training procedure is associated with a very weak change in correlation.

Instead of a confidence interval, we could also report a highest density interval, which will be very close to the confidence interval if the bootstrap distribution is symmetric – the Matlab script on github shows how to compute a HDI. We could also simply report the difference and its bootstrap distribution. This provides a good summary of the uncertainty we have about the difference, without committing to a binary description of the results as significant or not.

fig4_comp2dcorr

Figure 4

Conclusion

The strategies described here have been validated for Spearman’s correlation and the Winzorised correlation (Wilcox, 2016). The skipped correlation led to too conservative confidence intervals, meaning that in simulations, the 95% confidence intervals contained the true value more than 95% of the times. This illustrates an important idea: the behaviour of a confidence interval is always estimated in the long run, using simulations, and it results from the conjunction of an estimator and a technique to form the confidence interval. Finally, a very similar bootstrap approach can be used to compare regression coefficients (Wilcox 2012), for instance to compare the slopes of robust linear regressions in an overlapping case (Bieniek et al. 2013).

References

Bieniek, M.M., Frei, L.S. & Rousselet, G.A. (2013) Early ERPs to faces: aging, luminance, and individual differences. Frontiers in psychology, 4, 268.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Communications in Statistics-Simulation and Computation, 38, 2220-2234.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R. (2016) Comparing dependent robust correlations. Brit J Math Stat Psy, 69, 215-224.

How to illustrate a 2×2 mixed ERP design

Let’s consider a simple mixed ERP design with 2 repeated measures (2 tasks) and 2 independent groups of participants (young and older participants). The Matlab code and the data are available on github. The data are time-courses of mutual information, with one vector time-course per participant and task. These results are preliminary and have not been published yet, but you can get an idea of how we use mutual information in the lab in recent publications (Ince et al. 2016a, 2016b; Rousselet et al. 2014). The code and illustrations presented in the rest of the post are not specific to mutual information.

Our 2 x 2 experimental design could be analysed using the LIMO EEG toolbox for instance, by computing a 2 x 2 ANOVA at every time point, and correcting for multiple comparisons using cluster based bootstrap statistics (Pernet et al. 2011, 2015). LIMO EEG has been used to investigate task effects for instance (Rousselet et al. 2011). But here, instead of ANOVAs, I’d like to focus on graphical representations and non-parametric assessment of our simple group design, to focus on effect sizes and to demonstrate how a few figures can tell a rich data-driven story.

First, we illustrate the 4 cells of our design. Figure 1 shows separately each group and each task: in each cell all participants are superimposed using thin coloured lines. We can immediately see large differences among participants and between groups, with overall smaller effects (mutual information) in older participants. There also seems to be task differences, in particular in young participants, which tend to present more sustained effects past 200 ms in the expressive task than the gender task.

fig1_gpmi2x2

Figure 1

To complement the individual traces, we can add measures of central tendency. The mean is shown with a thick green line, the median with a thick black line. See how the mean can be biased compared to the median in the presence of extreme values. The median was calculated using the Harrell-Davis estimator of the 50th quantile. To illustrate the group median with a measure of uncertainty, we can add a 95% percentile bootstrap confidence interval for instance (Figure 2).

fig2_gpmi2x2_ci

We can immediately see discrepancies between the median time-courses and their confidence intervals on the one hand, and the individual time-courses on the other hand. There are indeed many distributions of participants that can lead to the same average time-course. That’s why it is essential to show individual results, at least in some illustrations.

In our 2 x 2 design, we now have 3 aspects to consider: group differences, task differences and their interactions. We illustrate them in turn.

Age group differences for each task

We can look at the group differences in each task separately, as shown in Figure 3. The medians of each group is shown with 95% percentile bootstrap confidence intervals. On average, older participants tend to have weaker mutual information than young participants – less than half around 100-200 ms post-stimulus. This will need to be better quantified, for instance by reporting the median of all pairwise differences.

fig3_gpmi_group_diff

Figure 3

Under each panel showing the median + CI for each group, we plot the time-course of the group differences (young-older), with a confidence interval. For group comparisons we cannot illustrate individuals, because participants are be paired. However, we can illustrate all the bootstrap samples, shown in grey. Each sample was obtained by:

  • sampling with replacement Ny observations among Ny young observers
  • sampling with replacement No observations among No older observers
  • compute the median of each group
  • subtract the two medians

It is particularly important to illustrate the bootstrap distributions if they are skewed or contain outliers, or both, to check that the confidence intervals provide a good summary. If the bootstrap samples are very skewed, highest density intervals might be a good alternative to classic confidence intervals.

The lower panels of Figure 3 reveal relatively large group differences in a narrow window within 200 ms. The effect also appears to be stronger in the expressive task. Technically, one could also say that the effects are statistically significant, in a frequentist sense, when the 95% confidence intervals do not include zero. But not much is gained from that because some effects are large and others are small. Correction for multiple comparisons would also be required.

Task differences for each group

Figure 4 has a similar layout to Figure 3, now focusing on the task differences. The top panels suggest that the group medians don’t differ much between tasks, except maybe in young participants around 300-500 ms.

fig4_gpmi_task_effects

Figure 4

Because task effects are paired, we are not limited to the comparison of the medians between tasks; we can also illustrate the individual task differences and the medians of these differences [1]. These are shown in the bottom panels of Figure 4. In both groups, the individual differences are large and the time-courses of the task differences are scattered around zero, except in the young group starting around 300 ms, where most participants have positive differences (expressive > gender).

[1] When the mean is used as a measure of central tendency, these two perspectives are identical, because the difference between two means is the same as the mean of the pairwise differences. However, this is not the case for the median: the difference between medians is not the same as the median of the differences. Because we are interested in effect sizes, it is more informative to report descriptive statistics of the pairwise differences. The advantage of the Matlab code provided with this post is that instead of looking at the median, we can also look at other quantiles, thus getting a better picture of the strength of the effects.

Interaction between tasks and groups

Finally, in Figure 5 we consider the interactions between task and group factors. To do that we first superimpose the medians of the task differences with their confidence intervals (top panel). These traces are the same shown in the bottom panels of Figure 4. I can’t say I’m very happy with the top panel of Figure 5 because the two traces are difficult to compare. Essentially the don’t seem to differ much, except maybe for the late effect in young participants being higher than what is observed in older participants.

fig5_gpmi_task_group_interaction

In the lower panel of Figure 5 we illustrate the age group differences (young – older) between the medians of the pairwise task differences. Again confidence intervals are also provided, along with the original bootstrap samples. Overall, there is very little evidence for a 2 x 2 interaction, suggesting that the age group differences are fairly stable across tasks. Put another way, the weak task effects don’t appear to change much in the two age groups.

References

Ince, R.A., Jaworska, K., Gross, J., Panzeri, S., van Rijsbergen, N.J., Rousselet, G.A. & Schyns, P.G. (2016a) The Deceptively Simple N170 Reflects Network Information Processing Mechanisms Involving Visual Feature Coding and Transfer Across Hemispheres. Cereb Cortex.

Ince, R.A., Giordano, B.L., Kayser, C., Rousselet, G.A., Gross, J. & Schyns, P.G. (2016b) A statistical framework for neuroimaging data analysis based on mutual information estimated via a gaussian copula. Hum Brain Mapp.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, G.A., Gaspar, C.M., Wieczorek, K.P. & Pernet, C.R. (2011) Modeling Single-Trial ERP Reveals Modulation of Bottom-Up Face Visual Processing by Top-Down Task Constraints (in Some Subjects). Front Psychol, 2, 137.

Rousselet, G.A., Ince, R.A., van Rijsbergen, N.J. & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J Vis, 14, 7.

Matlab code for the shift function: a powerful tool to compare two entire marginal distributions

Recently, I presented R code for the shift function, a powerful tool to compare two entire marginal distributions.

The Matlab code is now available on github.

shifthd has the same name as its R version, which was originally programmed by Rand Wilcox and first documented in 1995 (see details ). It computes a shift function for independent groups, using a percentile bootstrap estimation of the SE of the quantiles to compute confidence intervals.

shiftdhd is the version for dependent groups.

More recently, Wilcox introduced a new version of the shift function in which a straightforward percentile bootstrap is used to compute the confidence intervals, without estimation of the SE of the quantiles. This is implemented in Matlab as shifthd_pbci for independent groups (equivalent to qcomhd in R); as shiftdhd_pbci for dependent groups (equivalent to Dqcomhd in R).

A demo file shift_function_demo is available here, along with the function shift_fig and dependencies cmu and UnivarScatter.

For instance, if we use the ozone data covered in the previous shift function post, a call to shifthd looks like this:

[xd, yd, delta, deltaCI] = shifthd(control,ozone,200,1);

producing this figure:

figure1

The output of shifthd, or any of the other 3 sf functions, can be used as input into shift_fig:

shift_fig(xd, yd, delta, deltaCI,control,ozone,1,5);

producing this figure:

figure2

This is obviously work in progress, and shift_fig is meant as a starting point.

Have fun exploring how your distributions differ!

And if you have any question, don’t hesitate to get in touch.

How to quantify typical differences between distributions

In this post, I describe two complementary lines of enquiry for group comparisons:

(1) How do typical levels compare between groups?

(2.1) for independent groups What is the typical difference between randomly selected members of the two groups?

(2.2) for dependent groups What is the typical pairwise difference?

These two questions can be answered by exploring entire distributions, not just one measure of central tendency.


The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using ggplot2. I will describe Matlab code in another post.


Independent groups

When comparing two independent groups, the typical approach consists in comparing the marginal distributions using a proxy: each distribution is summarised using one value, usually the non-robust mean. The difference between means is then normalised by some measure of variability – usually involving the non-robust variance, in which case we get the usual t-test. There is of course no reason to use only the mean as a measure of central tendency: robust alternatives such as trimmed means and M-estimators are more appropriate in many situations (Wilcox, 2012a). However, whether we compare the means or the medians or the 20% trimmed means of two groups, we focus on one question:

“How does the typical level/participant in one group compares to the typical level/participant in the other group?” Q1

There is no reason to limit our questioning of the data to the average Joe in each distribution: to go beyond differences in central tendency, we can perform systematic group comparisons using shift functions. Nevertheless, shift functions are still based on a comparison of the two marginal distributions, even if a more complete one.

An interesting alternative approach consists in asking:

“What is the typical difference between any member of group 1 and any member of group 2?” Q2

This approach involves computing all the pairwise differences between groups, as covered previously.

Let’s look at an example. Figure 1A illustrates two independent samples. The scatterplots indicate large differences in spread between the two groups, and also suggest larger differences in the right than the left tails of the distributions. The medians of the two groups appear very similar, so the two distributions do not seem to differ in central tendency. In keeping with these observations, a t-test and a Mann-Whitney-Wilcoxon test are non-significant, but a Kolmogorov-Smirnov test is.

typ_diff_fig1_ind

Figure 1. Independent groups: non-uniform shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Vertical lines mark the deciles, with a thick line for the median. C Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense. The 95% confidence intervals are controlled for multiple comparisons. D Difference asymmetry plot with 95% confidence intervals. The family-wise error is controlled by adjusting the critical p values using Hochberg’s method; the confidence intervals are not adjusted.

This discrepancy between tests highlights an important point: if a t-test is not significant, one cannot conclude that the two distributions do not differ. A shift function helps us understand how the two distributions differ (Figure 1C): the overall profile corresponds to two centred distributions that differ in spread; for each decile, we can estimate by how much they differ, and with what uncertainty; finally, the differences appear asymmetric, with larger differences in the right tails.

Is this the end of the story? No, because so far we have only considered Q1, how the two marginal distributions compare. We can get a different but complementary perspective by considering Q2, the typical difference between any member of group 1 and any member of group 2. To address Q2, we compute all the pairwise differences between members of the two groups. In this case each group has n=50, so we end up with 2,500 differences. Figure 1B shows a kernel density representation of these differences. So what does the typical difference looks like? The median of the differences is very near zero, so it seems on average, if we randomly select one observation from each group, they will differ very little. However, the differences can be quite substantial, and with real data we would need to put these differences in context, to understand how large they are, and their physiological/psychological interpretation. The differences are also asymmetrically distributed, with negative skewness: negative scores extend to -10, whereas positive scores don’t even reach +5. This asymmetry relates to our earlier observation of asymmetric differences in the shift function.

Recently, Wilcox (2012) suggested a new approach to quantify asymmetries in difference distributions. To understand his approach, we first need to consider how difference scores are usually characterised. It helps to remember that for continuous distributions, the Mann—Whitney-Wilcoxon U statistics = sum(X>Y) for all pairwise comparisons, i.e. the sum of the number of times observations in group X are larger than observations in group Y. Concretely, to compute U we sum the number of times observations in group X are larger than observations on group Y. This calculation requires to compute all pairwise differences between X and Y, and then count the number of positive differences. So the MWW test assesses P(X>Y) = 0.5. Essentially, the MWW test is a non- parametric test of the hypothesis that the distributions are identical. The MWW test does not compare the medians of the marginal distributions as often stated; also, it estimates the wrong standard error (Cliff, 1996). A more powerful test is Cliff’s delta, which uses P(X>Y) – P(X<Y) as a measure of effect size. As expected, in our current example Cliff’s delta is not significant, because the difference distribution has a median very near zero.

Wilcox’s approach is an extension of the MWW test: the idea is to get a sense of the asymmetry of the difference distribution by computing a sum of quantiles = q + (1-q), for various quantiles estimated using the Harrell-Davis estimator. A percentile bootstrap technique is used to derive confidence intervals. Figure 1D shows the resulting difference asymmetry plot  (Wilcox has not given a clear name to that new function, so I made one up). In this plot, 0.05 stands for the sum of quantile 0.05 + quantile 0.95; 0.10 stands for the sum of quantile 0.10 + quantile 0.90; and so on… The approach is not limited to these quantiles, so sparser or denser functions could be tested too. Figure 1D reveals negative sums of the extreme quantiles (0.05 + 0.95), and progressively smaller, converging to zero sums as we get closer to the centre of the distribution. So the q+(1-q) plot suggests that the two groups differ, with maximum differences in the tails, and no significant differences in central tendency. Contrary to the shift function, the q+(1-q) plot let us conclude that the difference distribution is asymmetric, based on the 95% confidence intervals. Other alpha levels can be assessed too.

In the case of two random samples from a normal population, one shifted by a constant compared to the other, the shift function and the difference asymmetry function should be about flat, as illustrated in Figure 2. In this case, because of random sampling and limited sample size, the two approaches provide different perspectives on the results: the shift function suggests a uniform shift, but fails to reject for the three highest deciles; the difference asymmetry function more strongly suggests a uniform shift, with all sums at about the same value. This shows that all estimated pairs of quantiles are asymmetric about zero, because the difference function is uniformly shifted away from zero.

typ_diff_fig2_ind_linear_effect

Figure 2. Independent groups: uniform shift. Two random samples of 50 observations were generated using rnorm. A constant of 1 was added to group 2.

When two distributions do not differ, both the shift function and the difference asymmetry function should be about flat and centred around zero – however this is not necessarily the case, as shown in Figure 3.

typ_diff_fig3_ind_no_effect

Figure 3. Independent groups: no shift – example 1. Two random samples of 50 observations were generated using rnorm.

Figure 4 shows another example in which no shift is present, and with n=100 in each group, instead of n=50 in the previous example.

typ_diff_fig4_ind_no_effect2

Figure 4. Independent groups: no shift – example 2.  Two random samples of 100 observations were generated using rnorm.

In practice, the asymmetry plot will often not be flat. Actually, it took me several attempts to generate two random samples associated with such flat asymmetry plots. So, before getting too excited about your results, it really pays to run a few simulations to get an idea of what random fluctuations can look like. This can’t be stressed enough: you might be looking at noise!

Dependent groups

Wilcox & Erceg-Hurn (2012) described a difference asymmetry function for dependent group. We’re going to apply the technique to the dataset presented in Figure 5. Panel A shows the two marginal distributions. However, we’re dealing with a paired design, so it is impossible to tell how observations are linked between conditions. This association is revealed in two different ways in panels B & C, which demonstrate a striking pattern: for participants with weak scores in condition 1, differences tend to be small and centred about zero; beyond a certain level, with increasing scores in condition 1, the differences get progressively larger. Finally, panel D shows the distribution of differences, which is shifted up from zero, with only 6 out of 35 differences inferior to zero.

At this stage, we’ve learnt a lot about our dataset – certainly much more than would be possible from current standard figures. What else do we need? Statistical tests?! I don’t think they are absolutely necessary. Certainly, providing a t-test is of no interest whatsoever if Figure 5 is provided, because it cannot provide information we already have.

typ_diff_fig5_dep1

Figure 5. Dependent groups: data visualisation. A Stripcharts of the two distributions. Horizontal lines mark the deciles, with a thick line for the median. B Stripcharts of paired observations. Scatter was introduced along the x axis to reveal overlapping observations. C Scatterplot of paired observations. The diagonal black reference line of no effect has slope one and intercept zero. The dashed grey lines mark the quartiles of the two conditions. In panel C, it would also be useful to plot the pairwise differences as a function of condition 1 results. D Stripchart of difference scores. Horizontal lines mark the deciles, with a thick line for the median.

Figure 6 provides quantifications and visualisations of the effects using the same layout as Figure 5. The shift function (Figure 6C) shows a non-uniform shift between the marginal distributions: the first three deciles do not differ significantly, the remaining deciles do, and there is an overall trend of growing differences as we progress towards the right tails of the distributions. The difference asymmetry function provides a difference perspective. The function is positive and almost flat, demonstrating that the distribution of differences is uniformly shifted away from zero, a result that cannot be obtained by only looking at the marginal distributions. Of course, when using means comparing the marginals or assessing the difference scores give the same results, because the difference of the means is the same as the mean of the differences. That’s why a paired t-test is the same as a one-sample test on the pairwise differences. With robust estimators the two approaches differ: for instance the difference between the medians of the marginals is not the same as the median of the differences.

typ_diff_fig6_dep2

Figure 6. Dependent groups: uniform difference shift. A Stripcharts of marginal distributions. Vertical lines mark the deciles, with a thick line for the median. B Kernel density representation of the distribution of difference scores. Horizontal lines mark the deciles, with a thick line for the median. C Shift function. D Difference asymmetry plot with 95% confidence intervals.

As fancy as Figure 6 can be, it still misses an important point: nowhere do we see the relationship between condition 1 and condition 2 results, as shown in panels B & C of Figure 5. This is why detailed illustrations are absolutely necessary to make sense of even the simplest datasets.

If you want to make more inferences about the distribution of differences, as shown in Figure 6B, Figure 7 shows a complementary description of all the deciles with their 95% confidence intervals. These could be substituted with highest density intervals or credible intervals for instance.

typ_diff_fig7_dep3_decile_plot

Figure 7. Dependent groups: deciles of the difference distribution. Each disk marks a difference decile, and the horizontal green line makes its 95% percentile bootstrap confidence interval. The reference line of no effect appears as a continuous black line. The dashed black line marks the difference median.

Finally, in Figure 8 we look at an example of a non-uniform difference shift. Essentially, I took the data used in Figure 6, and multiplied the four largest differences by 1.5. Now we see that the 9th decile does not respect the linear progression suggested by previous deciles, (Figure 8, panels A & B), and the difference asymmetry function suggests an asymmetric shift of the difference distribution, with larger discrepancies between extreme quantiles.

typ_diff_fig8_dep4_larger_diff

Figure 8. Dependent groups: non-uniform difference shift. A Stripchart of difference scores. B Deciles of the difference distribution. C Difference asymmetry function.

Conclusion

The techniques presented here provide a very useful perspective on group differences, by combining detailed illustrations and quantifications of the effects. The different techniques address different questions, so which technique to use depends on the question you want to ask. This choice should be guided by experience: to get a good sense of the behaviour of these techniques will require a lot of practice with various datasets, both real and simulated. If you follow that path, you will soon realise that classic approaches such as t-tests on means combined with bar graphs are far too limited, and can hide rich information about a dataset.

I see three important developments for the approach outlined here:

  • to make it Bayesian, or at least p value free using highest density intervals;

  • to extend it to multiple group comparisons (the current illustrations don’t scale up very easily);

  • to extend it to ANOVA type designs with interaction terms.

References

Cliff, N. (1996) Ordinal methods for behavioral data analysis. Erlbaum, Mahwah, N.J.

Wilcox, R.R. (2012a) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Wilcox, R.R. (2012b) Comparing Two Independent Groups Via a Quantile Generalization of the Wilcoxon-Mann-Whitney Test. Journal of Modern Applied Statistical Methods, 11, 296-302.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

the shift function: a powerful tool to compare two entire distributions

 


The R code for this post is available on github, and is based on Rand Wilcox’s WRS R package, with extra visualisation functions written using ggplot2. The R code for the 2013 percentile bootstrap version of the shift function was also covered here and here. Matlab code is described in another post.


In neuroscience & psychology, group comparison is usually an exercise that involves comparing two typical observations. This is most of the time achieved using a t-test on means. This standard procedure makes very strong assumptions:

  • the distributions differ only in central tendency, not in other aspects;
  • the typical observation in each distribution can be summarised by the mean;
  • the t-test is sufficient to detect changes in location.

As we saw previously, t-tests on means are not robust. In addition, there is no reason a priori to assume that two distributions differ only in the location of the bulk of the observations. Effects can occur in the tails of the distributions too: for instance a particular intervention could have an effect only in animals with a certain hormonal level at baseline; a drug could help participants with severe symptoms, but not others with milder symptoms… Because effects are not necessarily homogenous among participants, it is useful to have appropriate tools at hand, to determine how, and by how much, two distributions differ. Here we’re going to consider a powerful family of tools that are robust and let us compare entire distributions: shift functions.

A more systematic way to characterise how two independent distributions differ was originally proposed by Doksum (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977): to plot the difference between the quantiles of two distributions as a function of the quantiles of one group. The original shift function approach is implemented in the functions sband and wband in Rand Wilcox’s WRS R package.

In 1995, Wilcox proposed an alternative technique which has better probability coverage and potentially more power than Doksum & Sievers’ approach. Wilcox’s technique:

  • uses the Harrell-Davis quantile estimator;
  • computes confidence intervals of the decile differences with a bootstrap estimation of the standard error of the deciles;
  • controls for multiple comparisons so that the type I error rate remains around 0.05 across the 9 confidence intervals. This means that the confidence intervals are a bit larger than what they would be if only one decile was compared, so that the long-run probability of a type I error across all 9 comparisons remains near 0.05;
  • is implemented in the shifthd function.

Let’s start with an extreme and probably unusual example, in which two distributions differ in spread, not in location (Figure 1). In that case, any test of central tendency will fail to reject, but it would be wrong to conclude that the two distributions do not differ. In fact, a Kolmogorov-Smirnov test reveals a significant effect, and several measures of effect sizes would suggest non-trivial effects. However, a significant KS test just tells us that the two distributions differ, not how.

shift_function_ex1_arrows

Figure 1. Two distributions that differ in spread A Kernel density estimates for the groups. B Shift function. Group 1 – group 2 is plotted along the y-axis for each decile (white disks), as a function of group 1 deciles. For each decile difference, the vertical line indicates its 95% bootstrap confidence interval. When a confidence interval does not include zero, the difference is considered significant in a frequentist sense.

The shift function can help us understand and quantify how the two distributions differ. The shift function describes how one distribution should be re-arranged to match the other one: it estimates how and by how much one distribution must be shifted. In Figure 1, I’ve added annotations to help understand the link between the KDE in panel A and the shift function in panel B. The shift function shows the decile differences between group 1 and group 2, as a function of group 1 deciles. The deciles for each group are marked by coloured vertical lines in panel A. The first decile of group 1 is slightly under 5, which can be read in the top KDE of panel A, and on the x-axis of panel B. The first decile of group 2 is lower. As a result, the first decile difference between group 1 and group 2 is positive, as indicated by a positive value around 0.75 in panel B, as marked by an upward arrow and a + symbol. The same symbol appears in panel A, linking the deciles from the two groups: it shows that to match the first deciles, group 2’s first decile needs to be shifted up. Deciles 2, 3 & 4 show the same pattern, but with progressively weaker effect sizes. Decile 5 is well centred, suggesting that the two distributions do not differ in central tendency. As we move away from the median, we observe progressively larger negative differences, indicating that to match the right tails of the two groups, group 2 needs to be shifted to the left, towards smaller values – hence the negative sign.

To get a good understanding of the shift function, let’s look at its behaviour in several other clear-cut situations. First, let’s consider a  situation in which two distributions differ in location (Figure 2). In that case, a t-test is significant, but again, it’s not the full story. The shift function looks like this:

shift_function_ex2_complete

Figure 2. Complete shift between two distributions

What’s happening? All the differences between deciles are negative and around -0.45. Wilcox (2012) defines such systematic effect has the hallmark of a completely effective method. In other words, there is a complete and seemingly uniform shift between the two distributions.

In the next example (Figure 3), only the right tails differ, which is captured by significant differences for deciles 6 to 9. This is a case described by Wilcox (2012) as involving a partially effective experimental manipulation.

shift_function_ex3_onesided1

Figure 3. Positive right tail shift

Figure 4 also shows a right tail shift, this time in the negative direction. I’ve also scaled the distributions so they look a bit like reaction time distributions. It would be much more informative to use shift functions in individual participants to study how RT distributions differ between conditions, instead of summarising each distribution by its mean (sigh)!

shift_function_ex4_onesided2

Figure 4. Negative right tail shift

Figure 5 shows two large samples drawn from a standard normal population. As expected, the shift function suggests that we do not have enough evidence to conclude that the two distributions differ. The shift function does look bumpy tough, potentially suggesting local differences – so keep that in mind when you plug-in your own data.

shift_function_ex5_nochange

Figure 5. No difference?

And be careful not to over-interpret the shift function: the lack of significant differences should not be used to conclude that we have evidence for the lack of effect; indeed, failure to reject in the frequentist sense can still be associated with non-trivial evidence against the null – it depends on prior results (Wagenmakers, 2007).

So far, we’ve looked at simulated examples involving large sample sizes. We now turn to a few real-data examples.

Doksum & Sievers (1976) describe an example in which two groups of rats were kept in an environment with or without ozone for 7 days and their weight gains measured (Figure 6). The shift function suggests two results: overall, ozone reduces weight gain; ozone might promote larger weight gains in animals gaining the most weight. However, these conclusions are only tentative given the small sample size, which explains the large confidence intervals.

shift_function_ex6_ozone

Figure 6. Weight gains A Because the sample sizes are much smaller than in the previous examples, the distributions are illustrated using 1D scatterplots. The deciles are marked by grey vertical lines, with lines for the 0.5 quantiles. B Shift function.

Let’s consider another example used in (Doksum, 1974; Doksum, 1977), concerning the survival time in days of 107 control guinea pigs and 61 guinea pigs treated with a heavy dose of tubercle bacilli (Figure 7). Relative to controls, the animals that died the earliest tended to live longer in the treatment group, suggesting that the treatment was beneficial to the weaker animals (decile 1). However, the treatment was harmful to animals with control survival times larger than about 200 days (deciles 4-9). Thus, this is a case where the treatment has very different effects on different animals. As noted by Doksum, the same experiment was actually performed 4 times, each time giving similar results.

shift_function_ex7_guineapigs

Figure 7. Survival time

Shift function for dependent groups

All the previous examples were concerned with independent groups. There is a version of the shift function for dependent groups implemented in shiftdhd. We’re going to apply it to ERP onsets from an object detection task (Bieniek et al., 2015). In that study, 74 of our 120 participants were tested twice, to assess the test-retest reliability of different measurements, including onsets. Typically, test-retest assessment is performed using a correlation. However, we care about the units (ms), which a correlation would get rid of, and we had a more specific hypothesis, which a correlation cannot test; so we used a shift function (Figure 8). If you look at the distributions of onsets across participants, you will see that it is overall positively skewed, and with a few participants with particularly early or late onsets. With the shift function, we wanted to test for the overall reliability of the results, but also in particular the reliability of the left and right tails: if early onsets in session 1 were due to chance, we would expect session 2 estimates to be overall larger (shifted to the right); similarly, if late onsets in session 1 were due to chance, we would expect session 2 estimates to be overall smaller (shifted to the left). The shift function does not provide enough evidence to suggest a uniform or non-uniform shift – but we would probably need many more observations to make a strong claim.

shift_function_ex8_onsets

Figure 8. ERP onsets

Because we’re dealing with a paired design, the illustration of the marginal distributions in Figure 8 is insufficient: we should illustrate the distribution of pairwise differences too, as shown in Figure 9.

shift_function_ex9_onsets_diff

Figure 9. ERP onsets with KDE of pairwise differences

Figure 10 provides an alternative representation of the distribution of pairwise differences using a violin plot.

shift_function_ex10_onsets_diff_violin

Figure 10. ERP onsets with violin plot of pairwise differences

Figure 11 uses a 1D scatterplot (strip chart).

shift_function_ex11_onsets_diff_scatter

Figure 11. ERP onsets with 1D scatterplot of pairwise differences

Shift function for other quantiles

Although powerful, Wilcox’s 1995 technique is not perfect, because it:

  • is limited to the deciles;
  • can only be used with alpha = 0.05;
  • does not work well with tied values.

More recently, Wilcox’s proposed a new version of the shift function that uses a straightforward percentile bootstrap (Wilcox & Erceg-Hurn, 2012; Wilcox et al., 2014). This new approach:

  • allows tied values;
  • can be applied to any quantile;
  • can have more power when looking at extreme quantiles (<=0.1, or >=0.9).
  • is implemented in qcomhd for independent groups;
  • is implemented in Dqcomhd for dependent groups.

Examples are provided in the R script for this post.

In the percentile bootstrap version of the shift function, p values are corrected, but not the confidence intervals. For dependent variables, Wilcox & Erceg-Hurn (2012) recommend at least 30 observations to compare the .1 or .9 quantiles. To compare the quartiles, 20 observations appear to be sufficient. For independent variables, Wilcox et al. (2014) make the same recommendations made for dependent groups; in addition, to compare the .95 quantiles, they suggest at least 50 observations per group.

Conclusion

The shift function is a powerful tool that can help you better understand how two distributions differ, and by how much. It provides much more information than the standard t-test approach.

Although currently the shift function only applies to two groups, it can in theory be extended to more complex designs, for instance to quantify interaction effects.

Finally, it would be valuable to make a Bayesian version of the shift function, to focus on effect sizes, model the data, and integrate them with other results.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Annals of Statistics, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Wagenmakers, E.J. (2007) A practical solution to the pervasive problems of p values. Psychonomic bulletin & review, 14, 779-804.

Wilcox, R.R. (1995) Comparing Two Independent Groups Via Multiple Quantiles. Journal of the Royal Statistical Society. Series D (The Statistician), 44, 91-99.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam; Boston.

Wilcox, R.R. & Erceg-Hurn, D.M. (2012) Comparing two dependent groups via quantiles. J Appl Stat, 39, 2655-2664.

Wilcox, R.R., Erceg-Hurn, D.M., Clark, F. & Carlson, M. (2014) Comparing two independent groups via the lower and upper quantiles. J Stat Comput Sim, 84, 1543-1551.

The Harrell-Davis quantile estimator

Quantiles are robust and useful descriptive statistics. They belong to the family of L-estimators, which is to say that they are based on the linear combination of order statistics. They are several ways to compute quantiles. For instance, in R, the function quantile has 9 options. In Matlab, the quantile & prctile functions offer only 1 option. Here I’d like to introduce briefly yet another option: the Harrell-Davis quantile estimator (Harrell & Davis, 1982). It is the weighted average of all the order statistics (Figure 2). And, in combination with the percentile bootstrap, it is a useful tool to derive confidence intervals of quantiles (Wilcox 2012), as we will see quickly in this post. It is also a useful tool to derive confidence intervals of the difference between quantiles of two groups, as we will see in another post. As discussed previously in the percentile bootstrap post, to make accurate confidence intervals, we need to combine an estimator with a particular confidence interval building procedure, and the right combo is not obvious depending on the data at hand.

Before we motor on, a quick google search suggests that there is recent work to try to improve the Harrell-Davis estimator, so this not to say that this estimator is the best in all situations. But according to Rand Wilcox it works well in many situations, and we do use it a lot in the lab…

Let’s look at data from a paper on visual processing speed estimation (Bieniek et al. 2015). We consider ERP onsets from 120 participants aged 18 to 81.

The sorted ages are:

18 18 19 19 19 19 20 20 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 23 23 23 24 24 24 25 26 28 28 29 29 30 30 31 31 32 32 32 33 34 34 35 35 36 37 38 40 40 41 41 42 42 43 43 44 45 45 45 45 48 49 49 50 51 54 54 55 56 58 59 59 60 60 61 62 62 62 63 63 63 64 64 64 64 65 65 66 66 66 66 66 66 67 67 67 67 68 68 68 68 68 69 70 70 70 71 72 72 72 75 76 77 78 79 81 81

Fig1-age distribution

Figure 1. Age distribution.

The Matlab code to reproduce all the figures in this post is available on github. There is also a list of R functions from Rand Wilcox’s toolbox.

How do we compute Harrell-Davis quantiles of the age distribution? Figure 2 shows the Harrell-Davis weights for the deciles of the age distribution.

Fig2-weights

Figure 2. Decile weights.

The deciles are obtained by multiplying the sorted ages by the weights in Figure 2, which gives us:

21.1, 23.3, 29.7, 37.0, 45.3, 56.1, 63.3, 66.6, 70.4

For comparison, the age deciles from Matlab’s prctile function are:

21, 23, 30, 36, 45, 57, 64, 66, 70

Now, we can update the scatterplot in Figure 1 with the deciles:

Fig3-age deciles

Figure 3. Scatterplot + age deciles. The thick vertical black line marks the 50th quantiles.

We can also compute a confidence interval for a Harrell-Davis quantile. There are two ways to do that:

  • using a percentile bootstrap of the quantile (pbci approach);
  • using a percentile bootstrap estimate of the standard error of the quantile, which is then plugged into a confidence interval formula (pbse approach).

Using the code available with this post, we can try the two approaches on the median:

  • pbci approach gives 45.31 [35.89, 54.73]
  • pbse approach gives 45.31 [38.49, 54.40]

The two methods return similar upper bounds, but quite different lower bounds. Because they are both based on random resampling with replacement, running the same analysis several times will each time also give slightly different results. Actually, this is one important criterion to select a good bootstrap confidence interval technique: despite random sampling, using the same technique many times should provide overall similar results. Another important criterion is the probability coverage: if we build a 95% confidence interval, we want that confidence interval to contain the population value we’re trying to estimate 95% of the time. That’s right, the probability attached to a confidence interval is a long run coverage: assuming a population with a certain median, if we perform the same experiment over and over, every time drawing a sample of n observations and computing an (1-alpha)% confidence interval using the same technique, (1-alpha)% of these confidence intervals will contain the population median. So, if everything is fine (n is large enough, the number of bootstrap samples is large enough, the combination of bootstrap technique and estimator is appropriate), alpha% of the time (usually 5%), a confidence interval WILL NOT include the population parameter of interest. This implies that given the 1,000s of neuroscience & psychology experiments performed every year, 100s of paper report the wrong confidence intervals – but this possibility is never considered in the articles’ conclusions…

In many situations, the long run probability coverage can be actually much lower or much higher than (1-alpha). So can we check that we’re building accurate confidence intervals, at least in the long run? For that, we’ve got to run simulations. Here is an example. First, we create a fake population, for instance with a skewed distribution, which could reflect our belief of the nature of the population we’re studying:

Fig4-sim population

Figure 4. Population of 1,000,000 values with a 10 degrees of freedom chi2 distribution.

Second, we compute benchmark values, e.g. median, mean…

Third, we run simulations in which we perform fake experiments with a given sample size, and then compute confidence intervals of certain quantities. Finally, we check how often the different confidence intervals actually contain the population parameters (probability coverage):

  • pbse(hd) = 0.9530
  • pbci(hd) = 0.9473
  • pbci(median) = 0.9452
  • pbci(mean) = 0.9394

They’re all very close to 95%. However, the confidence intervals of hd created using the pbse approach tended to be larger than those created using the pbci approach. The confidence intervals for the mean missed the population mean 1% of the time compared to the expected 95% – that’s because they tended to be shorter than the other 3. The bootstrap estimates of the sampling distribution of hd, the median and the mean, as well as the width of the confidence intervals can be explored using the code on github.

Of course, no one is ever going to run 10,000 times the same experiment! And these results assume a certain population, a certain number of observations per experiment, and a certain number of bootstrap samples. We would need a more systematic exploration of the different combinations of options to be sure the present results are not special cases.

To be clear: there is absolutely no guarantee that any particular confidence interval contains the population parameter you’re trying to estimate. So be humble, and don’t make such a big deal about your confidence intervals, especially if you have small sample sizes.

Personally, more and more I use confidence intervals to try to describe the variability in the sample at hand. For that purpose, and to avoid potential inferential problems associated with confidence intervals, I think it is more satisfactory to use highest density intervals HDI. I will post R & Matlab functions to compute the HDI of the bootstrap quantiles on github at some stage. By reporting HDI, there are no associated p values and we minimise the temptation to cross proton streams (i.e. dichotomise a continuous variable to make a binary decision – MacCallum et al. 2002).

Finally, we consider something a bit more interesting than the age of our participants: the distribution of ERP onsets.

Here are the onsets in milliseconds:

Fig7-onset distribution

Figure 5. Onsets.

And the deciles with their confidence intervals, which provide a very nice summary of the distribution:

Fig8-onset deciles

Figure 6. Onset deciles with confidence intervals.

If you’re interested, I’ve also attempted a Bayesian estimation of the onset data using R and JAGS. However, I don’t know yet how to perform quantile estimation – please get in touch if you can help.

Conclusion

Now you’ve got the tools to describe a distribution in detail. There is no particular reason why we should be obsessed with the mean, especially when robust and more informative statistics are available. Next, I will show you how to compare all the deciles of two distributions using a mighty tool: the shift function. This will, of course, rely on the Harrell-Davis estimator and the bootstrap.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.

MacCallum RC, Zhang S, Preacher KJ, Rucker DD. 2002. On the practice of dichotomization of quantitative variables. Psychological Methods 7: 19-40

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.

the percentile bootstrap

 

“The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates.” Efron & Tibshirani, An introduction to the bootstrap, 1993

“The central idea is that it may sometimes be better to draw conclusions about the characteristics of a population strictly from the sample at hand, rather than by making perhaps unrealistic assumptions about the population.” Mooney & Duval, Bootstrapping, 1993

Like all bootstrap methods, the percentile bootstrap relies on a simple & intuitive idea: instead of making assumptions about the underlying distributions from which our observations could have been sampled, we use the data themselves to estimate sampling distributions. In turn, we can use these estimated sampling distributions to compute confidence intervals, estimate standard errors, estimate bias, and test hypotheses (Efron & Tibshirani, 1993; Mooney & Duval, 1993; Wilcox, 2012). The core principle to estimate sampling distributions is resampling, a technique pioneered in the 1960’s by Julian Simon (particularly inspiring is how he used dice and cards to teach resampling in statistics classes). The technique was developed & popularised by Brad Efron as the bootstrap.

Let’s consider an example, starting with this small set of 10 observations:

1.2 1.1 0.1 0.8 2.6 0.7 0.2 0.3 1.9 0.4

To take a bootstrap sample, we sample n observations with replacement. That is, given the 10 original observations above, we sample with replacement 10 observations from the 10 available. For instance, one bootstrap sample from the example above could be (sorted for convenience):

0.4 0.4 0.4 0.8 0.8 1.1 1.2 2.6 2.6 2.6

a second one:

0.1 0.3 0.4 0.8 1.1 1.2 1.2 1.9 1.9 1.9

a third one:

0.1 0.4 0.7 0.7 1.1 1.1 1.1 1.1 1.9 2.6

etc.

As you can see, in some bootstrap samples, certain observations were sampled once, others more than once, and yet others not at all. The resampling process is akin to running many experiments.

fig1-bootstrap_philosophy

Figure 1. Bootstrap philosophy.

Essentially, we are doing fake experiments using only the observations from our sample. And for each of these fake experiments, or bootstrap sample, we can compute any estimate of interest, for instance the median. Because of random sampling, we get different medians from different draws, with some values more likely than other. After repeating the process above many times, we get a distribution of bootstrap estimates, let say 1,000 bootstrap estimates of the sample median. That distribution of bootstrap estimates is a data driven estimation of the sampling distribution of the sample median. Similarly, we can use resampling to estimate the sampling distribution of any statistics, without requiring any analytical formula. This is the major appeal of the bootstrap.

Let’s consider another example, using data from figure 5 of Harvey Motulsky’s 2014 article. We’re going to reproduce his very useful figure and add a 95% percentile bootstrap confidence interval. The data and Matlab code + pointers to R code are available on github. The file pb_demo.m will walk you through the different steps of bootstrap estimation, and can be used to recreate the figures from the rest of this post.

With the bootstrap, we estimate how likely we are, given the data, to obtain medians of different values. In other words, we estimate the sampling distribution of the sample median. Here is an example of a distribution of 1,000 bootstrap medians.

fig2-boot_median_est_density

Figure 2. Kernel density distribution of the percentile bootstrap distribution of the sample median.

The distribution is skewed and rather rough, because of the particular data we used and the median estimator of central tendency. The Matlab code let you estimate other quantities, so for instance using the mean as a measure of central tendency would produce a much smoother and symmetric distribution. This is an essential feature of the bootstrap: it will suggest sampling distributions given the data at hand and a particular estimator, without assumptions about the underlying distribution. Thus, bootstrap sampling distributions can take many unusual shapes.

The interval, in the middle of the bootstrap distribution, that contains 95% of medians constitutes a percentile bootstrap confidence interval of the median.

fig3-bootci_illustration

Figure 3. Percentile bootstrap confidence interval of the median. CI = confidence interval.

Because the bootstrap sample distribution above is skewed, it might be more informative to report a highest-density interval – a topic for another post.

To test hypotheses, we can reject a point hypothesis if it is not included in the 95% confidence interval (a p value can also be obtained – see online code). Instead of testing a point hypothesis, or in addition, it can be informative to report the bootstrap distribution in a paper, to illustrate likely sample estimates given the data.

Now that we’ve got a 95% percentile bootstrap confidence interval, how do we know that it is correct? In particular, how many bootstrap samples do we need? The answer to this question depends on your goal. One goal might be to achieve stable results: if you repeatedly compute a confidence interval using the same data and the same bootstrap technique, you should obtain very similar confidence intervals. Going back to our example, if we take a sub-sample of the data, and compute many confidence intervals of the median, we sometimes get very different results. The figure below illustrates 7 confidence intervals of the median using the same small dataset. The upper boundaries of the different confidence intervals vary far too much:

fig4-median_CI_rep

Figure 4. Repeated calculations of the percentile bootstrap confidence interval of the median for the same dataset.

The variability is due in part to the median estimator, which introduces strong non-linearities. This point is better illustrated by looking at 1,000 sorted bootstrap median estimates:

fig5-boot_median_est_sorted

Figure 5. Sorted bootstrap median estimates.

If we take another series of 1,000 bootstrap samples, the non-linearities will appear at slightly different locations, which will affect confidence interval boundaries. In that particular case, one way to solve the variability problem is to increase the number of bootstrap samples – for instance using 10,000 samples produces much more stable confidence intervals (see code). Using more observations also improves matters significantly.

If we get back to the question of the number of bootstrap samples needed, another goal is to achieve accurate probability coverage. That is, if you build a 95% confidence interval, you want the interval to contain the population value 95% of the time in the long run. Concretely, if you repeat the same experiment over and over, and for each experiment you build a 95% confidence interval, 95% of these intervals should contain the population value you are trying to estimate if the sample size is large enough. This can be achieved by using a conjunction of 2 techniques: a technique to form the confidence interval (for instance a percentile bootstrap), and a technique to estimate a particular quantity (for instance the median to estimate the central tendency of the distribution). The only way to find out which combo of techniques work is to run simulations covering a lot of hypothetical scenarios – this is what statisticians do for a living, and this is why every time you ask one of them what you should do with your data, the answer will inevitably be “it depends”. And it depends on the shape of the distributions we are sampling from and the number of observations available in a typical experiment in your field. Needless to say, the best approach to use in one particular case is not straightforward: there is no one-size-fits-all technique to build confidence intervals; so any sweeping recommendation should be regarded suspiciously.

The percentile bootstrap works very well, and in certain situations is the only (frequentist) technique known to perform satisfactorily to build confidence intervals of or to compare for instance medians and other quantiles, trimmed means, M estimators, regression slopes estimates, correlation coefficients (Wilcox 2012). However, the percentile bootstrap

does not perform well with all quantities, in particular with the mean (Wilcox & Keselman 1993). You can still use the percentile bootstrap to illustrate the variability in the sample at hand, without making inferences about the underlying population. We do this in the figure below to see how the percentile bootstrap confidence interval compares to other ways to summarise the data.

Figure 6. Updated version of Motulsky’s 2014 figure 5.

This is a replication of Motulsky’s 2014 figure 5, to which I’ve added a 95% percentile bootstrap confidence interval of the mean. This figure makes a critical point: there is no substitute for a scatterplot, at least for relatively small sample sizes. Also, using the mean +/- SD, +/- SEM, with a classic confidence interval (using t formula) or with a percentile bootstrap confidence interval can provide very different impressions about the spread in the data (although it is not their primary objective). The worst representation clearly is mean +/- SEM, because it provides a very misleading impression of low variability. Here, because the sample is skewed, mean +/- SEM does not even include the median, thus providing a wrong estimation of the location of the bulk of the observations. It follows that results in an article reporting only mean +/- SEM cannot be assessed unless  scatterplots are provided, or at least estimates of skewness, bi-modality and complementary measures of uncertainty for comparison. Reporting a boxplot or the quartiles does a much better job at conveying the shape of the distribution than any of the other techniques. These representations are also robust to outliers. In the next figure, we consider a subsample of the observations from Figure 6, to which we add an outlier of increasing size: the quartiles do not move.

fig7-outliers_quartiles

Figure 7. Outlier effect on the quartiles. The y-axis is truncated.

Contrary to the quartiles, the classic confidence interval of the mean is not robust, so it provides very inaccurate results. In particular, it assumes symmetry, so even though the outlier is on the right side of the distribution, both sides of the confidence interval get larger. The mean is also  pulled towards the outlier, to the point where it is completely outside the bulk of the observations. I cannot stress this enough: you cannot trust mean estimates if scatterplots are not provided.

fig8-outliers_classic_ci

Figure 8. Outlier effect on the classic confidence interval of the mean.

In comparison, the percentile bootstrap confidence interval of the mean performs better: only its right side, the side affected by the outlier, expends as the outlier gets larger.

fig9-outliers_pbci_mean

Figure 9. Outlier effect on the percentile bootstrap confidence interval of the mean.

Of course, we do not have to use the mean as a measure of central tendency. It is trivial to compute a percentile bootstrap confidence interval of the median instead, which, as expected, does not change with outlier size:

fig10-outliers_pbci_median

Figure 10. Outlier effect on the percentile bootstrap confidence interval of the median.

Conclusion

The percentile bootstrap can be used to build a confidence interval for any quantity, whether its sampling distribution can be estimated analytically or not. However, there is no guarantee that the confidence interval obtained will be accurate. In fact, in many situations alternative methods outperform the percentile bootstrap (such as percentile-t, bias corrected, bias corrected & accelerated (BCa), wild bootstraps). With this caveat in mind, I think the percentile bootstrap remains an amazingly simple yet powerful tool to summarise the accuracy of an estimate given the variability in the data. It is also

the only frequentist tool that performs well in many situations – see Wilcox 2012 for an extensive coverage of these situations.

Finally, it is important to realise that the bootstrap does make a very strong & unwarranted assumption: only the observations in the sample can ever be observed. For this reason, for small samples the bootstrap can produce rugged sampling distributions, as illustrated above. Rasmus Bååth wrote about the limitations of the percentile bootstrap and its link to Bayesian estimation in a blog post I highly recommend; he also provided R code for the bootstrap and the Bayesian bootstrap in another post.

References

Efron, B. & Tibshirani Robert, J. (1993) An introduction to the bootstrap. Chapman & Hall, London u.a.

Mooney, C.Z. & Duval, R.D. (1993) Bootstrapping : a nonparametric approach to statistical inference. Sage Publications, Newbury Park, Calif. ; London.

Motulsky, H.J. (2014) Common misconceptions about data analysis and statistics. J Pharmacol Exp Ther, 351, 200-205.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, Amsterdam ; Boston.

Wilcox, R.R. & Keselman, H.J. (2003) Modern Robust Data Analysis Methods: Measures of Central Tendency. Psychological Methods, 8, 254-274.