Tag Archives: visualisation

R functions for the hierarchical shift function

The hierarchical shift function presented in the previous post is now available in the `rogme` R package. Here is a short demo.

Get the latest version of `rogme`:

# install.packages("devtools")
devtools::install_github("GRousselet/rogme")
library(rogme)
library(tibble)

Load data and compute hierarchical shift function:

df <- flp # get reaction time data - for details `help(flp)`
# Compute shift functions for all participants
out <- hsf(df, rt ~ condition + participant)

unnamed-chunk-21-1

Because of the large number of participants, the confidence intervals are too narrow to be visible. So let’s subset a random sample of participants to see what can happen with a more smaller sample size:

set.seed(22) # subset random sample of participants
id <- unique(df$participant) 
df <- subset(df, flp$participant %in% sample(id, 50, replace = FALSE))
out <- hsf(df, rt ~ condition + participant) 
plot_hsf(out)

unnamed-chunk-25-1

Want to estimate the quartiles only?

out <- hsf(df, rt ~ condition + participant, qseq = c(.25, .5, .75))
plot_hsf(out)

unnamed-chunk-27-1

Want to reverse the comparison?

out <- hsf(df, rt ~ condition + participant, todo = c(2,1))
plot_hsf(out)

unnamed-chunk-26-1

P values are here:

out$pvalues

P values adjusted for multiple comparisons using Hochberg’s method:

out$adjusted_pvalues 

Percentile bootstrap version:

set.seed(8899)
out <- hsf_pb(df, rt ~ condition + participant)

Plot bootstrap highest density intervals – default:

plot_hsf_pb(out) 

unnamed-chunk-40-1

Plot distributions of bootstrap samples of group differences. Bootstrap distributions are shown in orange. Black dot marks the mode. Vertical black lines mark the 50% and 90% highest density intervals.

plot_hsf_pb_dist(out)

 

unnamed-chunk-41-1

For more examples, a vignette is available on GitHub.

Feedback would be much appreciated: don’t hesitate to leave a comment or to get in touch directly.

Hierarchical shift function: a powerful alternative to the t-test

In this post I introduce a simple yet powerful method to compare two dependent groups: the hierarchical shift function. The code is on GitHub. More details are in Rousselet & Wilcox (2019), with a reproducibility package on figshare.

Let’s consider different situations in a hierarchical setting: we’ve got trials from 2 conditions in several participants. Imagine we collected data from one participant and the results look like this:

unnamed-chunk-3-1

These fake reaction time data were created by sampling from ex-Gaussian distributions. Here the two populations are shifted by a constant, so we expect a uniform shift between the two samples. Later we’ll look at examples showing  differences most strongly in early responses, late responses, and in spread.

To better understand how the distributions differ, let’s look at a shift function, in which the difference between the deciles of the two conditions are plotted as a function of the deciles in condition 1 – see details in Rousselet et al. (2017). The decile differences are all negative, showing stochastic dominance of condition 2 over condition 1. The function is not flat because of random sampling and limited sample size. 

unnamed-chunk-4-1

Now, let’s say we collected 100 trials per condition from 30 participants. How do we proceed? There are a variety of approaches available to quantify distribution differences. Ideally, such data would be analysed using a multi-level model, including for instance ex-Gaussian fits, random slopes and intercepts for participants, item analyses… This can be done using the lme4 or brms R packages. However, in my experience, in neuroscience and psychology articles, the most common approach is to collapse the variability across trials into a single number per participant and condition to be able to perform a paired t-test: typically, the mean is computed across trials for each condition and participant, then the means are subtracted, and the distribution of mean differences is entered into a one-sample t-test. Obviously, this strategy throws away a huge amount of information! And the results of such second-tier t-tests are difficult to interpret: a positive test leaves us wondering exactly how the distributions differ; a negative test is ambiguous – beside avoiding the ‘absence of evidence is not evidence of absence’ classic error, we also need to check if the distributions do not differ in other aspects than the mean. So what can we do?

Depending on how conditions differ, looking at other aspects of the data than the mean can be more informative. For instance, in Rousselet & Wilcox (2019), we consider group comparisons of individual medians. Considering that the median is the second quartile, looking at the other quartiles can be of theoretical interest to investigate effects in early or later parts of distributions. This could be done in several ways, for instance by making inferences on the first quartile (Q1) or the third quartile (Q3). If the goal is to detect differences anywhere in the distributions, a more systematic approach consists in quantifying differences at multiple quantiles. Here we consider the case of the deciles, but other quantiles could be used. First, for each participant and each condition, the sample deciles are computed over trials. Second, for each participant, condition 2 deciles are subtracted from condition 1 deciles – we’re dealing with a within-subject (repeated-measure) design. Third, for each decile, the distribution of differences is subjected to a one-sample test. Fourth, a correction for multiple comparisons is applied across the 9 one-sample tests. I call this procedure a hierarchical shift function. There are many options available to implement this procedure and the example used here is not the definitive answer: the goal is simply to demonstrate that a relatively simple procedure can be much more powerful and informative than standard approaches.

In creating a hierarchical shift function we need to make three choices: a quantile estimator, a statistical test to assess quantile differences across participants, and a correction for multiple comparisons technique. The deciles were estimated using type 8 from the base R quantile() function (see justification in Rousselet & Wilcox, 2019). The group comparisons were performed using a one-sample t-test on 20% trimmed means, which performs well in many situations, including in the presence of outliers. The correction for multiple comparisons employed Hochberg’s strategy (Hochberg, 1988), which guarantees that the probability of at least one false positive will not exceed the nominal level as long as the nominal level is not exceeded for each quantile. 

In Rousselet & Wilcox (2019), we consider power curves for the hierarchical shift function (HSF) and contrast them to other approaches: by design, HSF is sensitive to more types of differences than any standard approach using the mean or a single quantile. Another advantage of HSF is that the location of the distribution difference can be interrogated, which is impossible if inferences are limited to a single value.

Here is what the hierarchical shift function looks like for our uniform shift example:

unnamed-chunk-7-1

The decile differences between conditions are plotted for each participant (colour coded) and the group 20% trimmed means are superimposed in black. Differences are pretty constant across deciles, suggesting a uniform shift. Most participants have shift functions entirely negative – a case of stochastic dominance of one condition over the other. There is growing uncertainty as we consider higher deciles, which is expected from measurements of right skewed distributions.

We can add confidence intervals:

unnamed-chunk-9-1

P values are available in the GitHub code.

Instead of standard parametric confidence intervals, we can also consider percentile bootstrap confidence intervals (or highest density intervals), as done here:

unnamed-chunk-14-1

Distributions of bootstrap estimates can be considered cheap Bayesian posterior distributions. They also contain useful information not captured by simply reporting confidence intervals.

Here we plot them using geom_halfeyeh() from tidybayes. 

unnamed-chunk-15-1

The distributions of bootstrap estimates of the group 20% trimmed means are shown in orange, one for each decile. Along the base of each distribution, the black dot marks the mode and the vertical lines mark the 50% and 90% highest density intervals.

Nice hey?! Reporting a figure like that is dramatically more informative than reporting a P value and a confidence interval from a t-test!

A bootstrap approach can also be used to perform a cluster correction for multiple comparisons – see details on GitHub. Preliminary simulations suggest that the approach can provide substantial increase in power over the Hochberg’s correction – more on that in another post.

Let’s look at 3 more examples, just for fun…

Example 2: early difference

Example participant:

unnamed-chunk-17-1

Shift function:

unnamed-chunk-18-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-22-1

Percentile bootstrap estimate densities:

unnamed-chunk-28-1

Example 3: difference in spread

Example participant:

unnamed-chunk-29-1

Shift function:

unnamed-chunk-30-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-34-1

Percentile bootstrap estimate densities:

unnamed-chunk-40-1

Example 4: late difference

Example participant:

unnamed-chunk-41-1

Shift function:

unnamed-chunk-42-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-46-1

Percentile bootstrap estimate densities:

unnamed-chunk-52-1

Conclusion

The hierarchical shift function can be used to achieve two goals: 

  • to screen data for potential distribution differences using p values, without limiting the exploration to a single statistics like the mean;
  • to illustrate and quantify how distributions differ.

I think of the hierarchical shift function as the missing link between t-tests and multi-level models. I hope it will help a few people make sense of their data and maybe nudge them towards proper hierarchical modelling.

R functions for the parametric hierarchical shift function are available in the rogme package. I also plan bootstrap functions. Then I’ll tackle the case of 2 independent groups, which requires a third level quantifying differences of differences.

 

Illustration of continuous distributions using quantiles

In this post I’m going to show you a few simple steps to illustrate continuous distributions. As an example, we consider reaction time data, which are typically positively skewed and can differ in different ways. Reaction time distributions are also a rich source of information to constrain cognitive theories and models. So unless the distributions are at least illustrated, this information is lost (which is typically the case when distributions are summarised using a single value like the mean). Other approaches not covered here include explicit mathematical models of decision making and fitting functions to model the shape of the distributions (Balota & Yap, 2011).

For our current example, I made up data for 2 independent groups with four patterns of differences:

  • no clear differences;

  • uniform shift between distributions;

  • mostly late differences;

  • mostly early differences.

The R code is on GitHub.

Scatterplots

For our first visualisation, we use geom_jitter() from ggplot2. The 1D scatterplots give us a good idea of how the groups differ but they’re not the easiest to read. The main reason is probably that we need to estimate local densities of points in different regions and compare them between groups.

figure_scatter

For the purpose of this exercise, each group (g1 and g2) is composed of 1,000 observations, so the differences in shapes are quite striking. With smaller sample sizes the evaluation of these graphs could be much more challenging.

Kernel density plots

Relative to scatterplots, I find that kernel density plots make the comparisons between groups much easier.

figure_kde

Improved scatterplots

Scatterplots and kernel density plots can be combined by using beeswarm plots. Here we create scatterplots shaped by local density using the geom_quasirandom() function from the ggbeeswarm package. Essentially, the function creates violin plots in which the constituent points are visible. 

figure_scat_quant

To make the plots even more informative, I’ve superimposed quantiles – here deciles computed using the Harrell-Davis quantile estimator. The deciles are represented by vertical black lines, with medians shown with thicker lines. Medians are informative about the location of the bulk of the observations and comparing the lower to upper quantiles let us appreciate the amount of asymmetry within distributions. Comparing quantiles between groups give us a sense of the amount of relative compression/expansion on each side of the distributions. This information would be lost if we only compared the medians. 

Quantile plots

If we remove the scatterplots and only show the quantiles, we obtain quantile plots, which provide a compact description of how distributions differ (please post a comment if you know of older references using quantile plots). Because the quantiles are superimposed, they are easier to compare than in the previous scatterplots. To help with the group comparisons, I’ve also added plots of the quantile differences, which emphasise the different patterns of group differences.

figure_qplot

Vincentile plots

An alternative to quantiles are Vincentiles, which are computed by sorting the data and splitting them in equi-populated bins (there is the same number of observations in each bin). Then the mean is computed for each bin (Balota et al. 2008; Jiang et al. 2004). Below means were computed for 9 equi-populated bins. As expected from the way they are computed, quantile plots and Vincentile plots look very similar for our large samples from continuous variables.

figure_vinc

Group quantile and Vincentile plots can be created by averaging quantiles and Vincentiles across participants (Balota & Yap, 2011; Ratcliff, 1979). This will be the topic of another post.

Delta plots

Related to quantile plots and Vincentile plots, delta plots show the difference between conditions, bin by bin (for each Vincentile) along the y-axis, as a function of the mean across conditions for each bin along the x-axis (De Jong et al., 1994). Not surprisingly, these plots have very similar shapes to the quantile difference plots we considered earlier. 

figure_delta

Negative delta plots (nDP, delta plots with a negative slope) have received particular attention because of their theoretical importance (Ellinghaus & Miller, 2018; Schwarz & Miller, 2012).

Shift function

Delta plots are related to the shift function, a powerful tool introduced in the 1970s: it consists in plotting the difference between the quantiles of two groups as a function of the quantiles in one group, with some measure of uncertainty around the difference (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977). It was later refined by Rand Wilcox (Rousselet et al. 2017). This modern version is shown below, with deciles estimated using the Harrell-Davis quantile estimator, and percentile bootstrap confidence intervals of the quantile differences. The sign of the difference is colour-coded (purple for negative, orange for positive).

figure_shift

Unlike other graphical quantile techniques presented here, the shift function affords statistical inferences because of it’s use of confidence intervals (the shift function also comes in a few Bayesian flavours). It is probably one of the easiest ways to compare entire distributions, without resorting to explicit models of the distributions. But the shift function and the other graphical methods demonstrated in this post are not meant to compete with hierarchical models. Instead, they can be used to better understand data patterns within and between participants, before modelling attempts. They also provide powerful alternatives to the mindless application of t-tests and bar graphs, helping to nudge researchers away from the unique use of the mean (or the median) and towards considering the rich information available in continuous distributions.

References

Balota, D.A. & Yap, M.J. (2011) Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. Curr Dir Psychol Sci, 20, 160-166.

Balota, D.A., Yap, M.J., Cortese, M.J. & Watson, J.M. (2008) Beyond mean response latency: Response time distributional analyses of semantic priming. J Mem Lang, 59, 495-523.

Clarke, E. & Sherrill-Mix, S. (2016) ggbeeswarm: Categorical Scatter (Violin Point) Plots.

De Jong, R., Liang, C.C. & Lauber, E. (1994) Conditional and Unconditional Automaticity – a Dual-Process Model of Effects of Spatial Stimulus – Response Correspondence. J Exp Psychol Human, 20, 731-750.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Ann Stat, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Ellinghaus, R. & Miller, J. (2018) Delta plots with negative-going slopes as a potential marker of decreasing response activation in masked semantic priming. Psychol Res, 82, 590-599.

Jiang, Y., Rouder, J.N. & Speckman, P.L. (2004) A note on the sampling properties of the Vincentizing (quantile averaging) procedure. J Math Psychol, 48, 186-195.

Ratcliff, R. (1979) Group Reaction-Time Distributions and an Analysis of Distribution Statistics. Psychol Bull, 86, 446-461.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Schwarz, W. & Miller, J. (2012) Response time models of delta plots with negative-going slopes. Psychon B Rev, 19, 555-574.

Correlations in neuroscience: are small n, interaction fallacies, lack of illustrations and confidence intervals the norm?

As reviewer, editor and reader of research articles, I’m regularly annoyed by the low standards in correlation analyses. In my experience with such articles, typically:

  • Pearson’s correlation, a non-robust measure of association, is used;
  • R and p values are reported, but not confidence intervals;
  • sample sizes tend to be small, leading to large estimation bias and inflated effect sizes in the literature;
  • R values and confidence intervals are not considered when interpreting the results;
  • instead, most analyses are reported as significant or non-significant (p<0.05), leading to the conclusion that an association exists or not (frequentist fallacy);
  • often figures illustrating the correlations are absent;
  • the explicit or implicit comparison of two correlations is done without a formal test (interaction fallacy).

To find out if my experience was in fact representative of the typical paper, I had a look at all papers published in 2017 in the European Journal of Neuroscience, where I’m a section editor. I care about the quality of the research published in EJN, so this is not an attempt at blaming a journal in particular, rather it’s a starting point to address a general problem. I really hope the results presented below will serve as a wake-up call for all involved and will lead to improvements in correlation analyses. Also, I bet if you look systematically at articles published in other neuroscience journals you’ll find the same problems. If you’re not convinced, go ahead, prove me wrong 😉 

I proceeded like this: for all 2017 articles (volumes 45 and 46), I searched for “correl” and I scanned for figures of scatterplots. If either searches were negative, the article was categorised as not containing a correlation analysis, so I might have missed a few. When at least one correlation was present, I looked for these details: 

  • n
  • estimator
  • confidence interval
  • R
  • p value
  • consideration of effect sizes
  • figure illustrating positive result
  • figure illustrating negative result
  • interaction test.

164 articles reported no correlation.

7 articles used regression analyses, with sample sizes as low as n=6, n=10, n=12 in 3 articles.

48 articles reported correlations.

Sample size

The norm was to not report degrees of freedom or sample size along with the correlation analyses or their illustrations. In 7 articles, the sample sizes were very difficult or impossible to guess. In the others, sample sizes varied a lot, both within and between articles. To confirm sample sizes, I counted the observations in scatterplots when they were available and not too crowded – this was a tedious job and I probably got some estimations and checks wrong. Anyway, I shouldn’t have to do all these checks, so something went wrong during the reviewing process. 

To simplify the presentation of the results, I collapsed the sample size estimates across articles. Here is the distribution: 

figure_ejn_sample_sizes

The figure omits 3 outliers with n= 836, 1397, 1407, all from the same article.

The median sample size is 18, which is far too low to provide sufficiently precise estimation.

Estimator

The issue with low sample sizes is made worse by the predominant use of Pearson’s correlation or the lack of consideration for the type of estimator. Indeed, 21 articles did not mention the estimator used at all, but presumably they used Pearson’s correlation.

Among the 27 articles that did mention which estimator was used:

  • 11 used only Pearson’s correlation;
  • 11 used only Spearman’s correlation;
  • 4 used Pearson’s and Spearman’s correlations;
  • 1 used Spearman’s and Kendall’s correlations.

So the majority of studies used an estimator that is well-known for its lack of robustness and its inaccurate confidence intervals and p values (Pernet, Wilcox & Rousselet, 2012).

R & p values

Most articles reported R and p values. Only 2 articles did not report R values. The same 2 articles also omitted p values, simply mentioning that the correlations were not significant. Another 3 articles did not report p values along with the R values.

Confidence interval

Only 3 articles reported confidence intervals, without mentioning how they were computed. 1 article reported percentile bootstrap confidence intervals for Pearson’s correlations, which is the recommended procedure for this estimator (Pernet, Wilcox & Rousselet, 2012).

Consideration for effect sizes

Given the lack of interest for measurement uncertainty demonstrated by the absence of confidence intervals in most articles, it is not surprising that only 5 articles mentioned the size of the correlation when presenting the results. All other articles simply reported the correlations as significant or not.

Illustrations

In contrast with the absence of confidence intervals and consideration for effect sizes, 23 articles reported illustrations for positive results. 4 articles reported only negative results, which leaves us with 21 articles that failed to illustrate the correlation results. 

Among the 40 articles that reported negative results, only 13 illustrated them, which suggests a strong bias towards positive results.

Interaction test

Finally, I looked for interaction fallacies (Nieuwenhuis, Forstmann & Wagenmakers 2011). In the context of correlation analyses, you commit an interaction fallacy when you present two correlations, one significant, the other not, implying that the 2 differ, but without explicitly testing the interaction. In other versions of the interaction fallacy, two significant correlations with the same sign are presented together, implying either that the 2 are similar, or that one is stronger than the other, without providing a confidence interval for the correlation difference. You can easily guess the other flavours… 

10 articles presented only one correlation, so there was no scope for the interaction fallacy. Among the 38 articles that presented more than one correlation, only one provided an explicit test for the comparison of 2 correlations. However, the authors omitted the explicit test for their next comparison!

Recommendations

In conclusion, at least in 2017 EJN articles, the norm is to estimate associations using small sample sizes and a non-robust estimator, to not provide confidence intervals and to not consider effect sizes and measurement uncertainty when presenting the results. Also, positive results are more likely to be illustrated than negative ones. Finally, interaction fallacies are mainstream.

How can we do a better job?

If you want to do a correlation analysis, consider your sample size carefully to assess statistical power and even better, your long-term estimation precision. If you have a small n, I wouldn’t even look at the correlation. 

Do not use Pearson’s correlation unless you have well-behaved and large samples, and you are only interested in linear relationships; otherwise explore robust measures of associations and techniques that provide valid confidence intervals (Pernet, Wilcox & Rousselet, 2012; Wilcox & Rousselet, 2018).

Reporting

These details are essential in articles reporting correlation analyses:

  • sample size for each correlation;
  • estimator of association;
  • R value;
  • confidence interval;
  • scatterplot illustration of every correlation, irrespective of the p value;
  • explicit comparison test of all correlations explicitly or implicitly compared;
  • consideration of effect sizes (R values) and their uncertainty (confidence intervals) in the interpretation of the results.

 Report p values if you want but they are not essential and should not be given a special status (McShane et al. 2018).

Finally, are you sure you really want to compute a correlation?

“Why then are correlation coefficients so attractive? Only bad reasons seem to come to mind. Worst of all, probably, is the absence of any need to think about units for either variable. Given two perfectly meaningless variables, one is reminded of their meaninglessness when a regression coefficient is given, since one wonders how to interpret its value. A correlation coefficient is less likely to bring up the unpleasant truth—we think we know what r = —.7 means. Do we? How often? Sweeping things under the rug is the enemy of good data analysis. Often, using the correlation coefficient is “sweeping under the rug” with a vengeance. Being so disinterested in our variables that we do not care about their units can hardly be desirable.”
Analyzing data: Sanctification or detective work?

John W. Tukey.
 American Psychologist, Vol 24(2), Feb 1969, 83-91. http://dx.doi.org/10.1037/h0027108

 

References

McShane, B.B., Gal, D., Gelman, A., Robert, C. & Tackett, J.L. (2018) Abandon Statistical Significance. arxiv.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

[preprint]

A clearer explanation of the shift function

The shift function is a power tool to compare two marginal distributions. It’s covered in detail in this previous post. Below is a new illustration which might help better understand the graphical representation of the shift function. The R code to generate the figure is available in the README of the rogme package.

Panel A illustrates two distributions, both n = 1000, that differ in spread. The observations in the scatterplots were jittered based on their local density, as implemented in ggforce::geom_sina.

Panel B illustrates the same data from panel A. The dark vertical lines mark the deciles of the distributions. The thicker vertical line in each distribution is the median. Between distributions, the matching deciles are joined by coloured lined. If the decile difference between group 1 and group 2 is positive, the line is orange; if it is negative, the line is purple. The values of the differences for deciles 1 and 9 are indicated in the superimposed labels.

Panel C focuses on the portion of the x-axis marked by the grey shaded area at the bottom of panel B. It shows the deciles of group 1 on the x-axis – the same values that are shown for group 1 in panel B. The y-axis shows the differences between deciles: the difference is large and positive for decile 1; it then progressively decreases to reach almost zero for decile 5 (the median); it becomes progressively more negative for higher deciles. Thus, for each decile the shift function illustrates by how much one distribution needs to be shifted to match another one. In our example, we illustrate by how much we need to shift deciles from group 2 to match deciles from group 1.

More generally, a shift function shows quantile differences as a function of quantiles in one group. It estimates how and by how much two distributions differ. It is thus a powerful alternative to the traditional t-test on means, which focuses on only one, non-robust, quantity. Quantiles are robust, intuitive and informative.

figure2

Problems with small sample sizes

In psychology and neuroscience, the typical sample size is too small. I’ve recently seen several neuroscience papers with n = 3-6 animals. For instance, this article uses n = 3 mice per group in a one-way ANOVA. This is a real problem because small sample size is associated with:

  • low statistical power

  • inflated false discovery rate

  • inflated effect size estimation

  • low reproducibility

Here is a list of excellent publications covering these points:

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Colquhoun, D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. R Soc Open Sci, 1, 140216.

Forstmeier, W., Wagenmakers, E.J. & Parker, T.H. (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev Camb Philos Soc.

Lakens, D., & Albers, C. J. (2017, September 10). When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias. Retrieved from psyarxiv.com/b7z4q

See also these two blog posts on small n:

When small samples are problematic

Low Power & Effect Sizes

Small sample size also prevents us from properly estimating and modelling the populations we sample from. As a consequence, small n stops us from answering a fundamental, yet often ignored empirical question: how do distributions differ?

This important aspect is illustrated in the figure below. Columns show distributions that differ in four different ways. The rows illustrate samples of different sizes. The scatterplots were jittered using ggforce::geom_sina in R. The vertical black bars indicate the mean of each sample. In row 1, examples 1, 3 and 4 have exactly the same mean. In example 2 the means of the two distributions differ by 2 arbitrary units. The remaining rows illustrate random subsamples of data from row 1. Above each plot, the t value, mean difference and its confidence interval are reported. Even with 100 observations we might struggle to approximate the shape of the parent population. Without additional information, it can be difficult to determine if an observation is an outlier, particularly for skewed distributions. In column 4, samples with n = 20 and n = 5 are very misleading.

figure1

Small sample size could be less of a problem in a Bayesian framework, in which information from prior experiments can be incorporated in the analyses. In the blind and significance obsessed frequentist world, small n is a recipe for disaster.

Matlab code for the shift function: a powerful tool to compare two entire marginal distributions

Recently, I presented R code for the shift function, a powerful tool to compare two entire marginal distributions.

The Matlab code is now available on github.

shifthd has the same name as its R version, which was originally programmed by Rand Wilcox and first documented in 1995 (see details ). It computes a shift function for independent groups, using a percentile bootstrap estimation of the SE of the quantiles to compute confidence intervals.

shiftdhd is the version for dependent groups.

More recently, Wilcox introduced a new version of the shift function in which a straightforward percentile bootstrap is used to compute the confidence intervals, without estimation of the SE of the quantiles. This is implemented in Matlab as shifthd_pbci for independent groups (equivalent to qcomhd in R); as shiftdhd_pbci for dependent groups (equivalent to Dqcomhd in R).

A demo file shift_function_demo is available here, along with the function shift_fig and dependencies cmu and UnivarScatter.

For instance, if we use the ozone data covered in the previous shift function post, a call to shifthd looks like this:

[xd, yd, delta, deltaCI] = shifthd(control,ozone,200,1);

producing this figure:

figure1

The output of shifthd, or any of the other 3 sf functions, can be used as input into shift_fig:

shift_fig(xd, yd, delta, deltaCI,control,ozone,1,5);

producing this figure:

figure2

This is obviously work in progress, and shift_fig is meant as a starting point.

Have fun exploring how your distributions differ!

And if you have any question, don’t hesitate to get in touch.