…as a teaching example and should be avoided in research.

These statements are common in the psychology and neuroscience literature:

“In order to assess the normal distribution of the population in terms of age, BV% and CSF%, the Lilliefors-corrected Kolmogorov–Smirnov test was performed” (Porcu et al. 2019)

“The Kolmogorov–Smirnov-Test revealed a normal distribution (p = 0.82).” (Knolle et al. 2019)

“The distribution was not normal (P < 0.01 with the Shapiro–Wilk test).” (Beaudu-Lange et al. 2001)

“Assumptions of the one-way anova for normality were also confirmed with the Shapiro–Wilk test.” (Holloway et al. 2015)

“The Shapiro-Wilk-W-test (P < 0.05) revealed that all distributions could be assumed to be Gaussian as a prerequisite for the application of a t-test.” (Dicke et al. 2008)

“Given the non-normal distribution of such data (Shapiro–Wilk’s p < .05), we applied a nonparametric one-sample t test (the one-sample Wilcoxon signed rank test).” (Zapparoli et al. 2019)

A common recipe goes like this:

- apply a normality test;
- if p>0.05, conclude that the data are normally distributed and proceed with a parametric test;
- if p<0.05, conclude that the data are not normally distributed and proceed with a non-parametric test (or transform the data to try to achieve normality).

It is a useful exercise or class activity to consider the statements above with the goal of identifying all the underlying issues. It could take several hours of teaching to do justice to the rich topics we need to cover to properly understand these issues.

Here is a succinct and non-exhaustive list of issues, with references for follow-up readings:

**[1]** In the general context of linear regression, the normality assumption applies to the residuals, not the marginal distributions. The main solution involves graphical checks of the residuals (Ernst & Albers, 2017; Vanhove, 2018).

Resources for graphical checks:

Visualization of Regression Models Using visreg

Visualizing regression model predictions

Extracting and visualizing tidy residuals from Bayesian models

Other solutions involve model comparison, to contrast models making different assumptions, and using models robust to assumption violations (Bürkner, 2017; Kruschke, 2013; Wilcox & Rousselet, 2018).

**[2]** The p value from standard frequentist tests, such as normality tests, cannot be used to accept the null (Rouder et al., 2016; Kruschke, 2018). The p value being computed assuming that the null is true, it cannot in turn be used to support the null — that’s circular. To find support for the null, we need an alternative hypothesis (to compute a Bayes Factor; Rouder et al., 2016; Wagenmakers et al., 2020) or a Region of Practical Equivalence (ROPE, to compute a test of equivalence; Freedman et al., 1984; Kruschke, 2018; Lakens, 2017; Campbell & Gustafson, 2022). Setting an alternative hypothesis is also crucial to get a consistent test (Rouder et al., 2016; Kruschke & Liddell, 2018). Tests of normality, like all Point Null Hypothesis Significance Tests (PNHST), are inconsistent: given alpha = 0.05, even if normality holds, 5% of tests will be positive no matter how large the sample size is.

**[3]** Failure to reject (p>0.05) doesn’t mean data were sampled from a normal distribution. Another function could fit the data equally well (for instance a shifted lognormal distribution). This point follows directly from [2]. Since our alternative hypothesis is extremely vague, the possibility of another distribution being a plausible data generation process is ignored: the typical test considers only a point null hypothesis versus “anything else”. So when we ask a very vague question, we can only get a very vague answer (there is no free lunch in inference – Rouder et al., 2016).

**[4]** Failure to reject (p>0.05) could be due to low power. This is well known but usually ignored. Here are the results of simulations to illustrate this point. The code is available on GitHub. We sample from *g-and-h* distributions (Yan & Genton, 2019), which let us vary asymmetry (parameter `g`

) and tail-thickness (parameter `h`

, which also affects how peaky the distribution is). We start by varying `g`

, keeping a constant `h=0`

.

Here are results for the Shapiro-Wilk test, based on a simulation with 10,000 iterations.

The Shapiro-Wilk test has low power unless the departure from normality is pronounced, or sample sizes are large. With small departures from normality (say `g=0.1`

, `g=0.2`

), achieving high power won’t be possible with typical sample sizes in psychology and neuroscience. For `g=0`

, the proportion of false positives is at the expected 5% level (false positive rate).

The Kolmogorov-Smirnov test is dramatically less powerful than the Shapiro-Wilk test (Yap & Sim, 2011).

What happens if we sample from symmetric distributions that are more prone to outliers than the normal distribution? By varying the `h`

parameter, keeping a constant `g=0`

, we can consider distributions that are progressively more kurtotic than the normal distribution.

Are the tests considered previously able to detect such deviations from normality? Here is how the Shapiro-Wilk test behaves.

And here are the catastrophic results for the Kolmogorov-Smirnov test.

**[5]** As the sample size increases, progressively smaller and smaller deviations from normality can be detected, eventually reaching absurd levels of precision, such that tiny differences of no practical relevance will be flagged. This point applies to all PNHST and again follows from [2]: because in PNHST no alternative is considered, tests are biased against the null (Rouder et al., 2016; Wagenmakers et al., 2020). Even when p<0.05, contrasting two hypotheses could reveal that a normal distribution and a non-normal distribution are equally plausible, given our data. Also, because PNHST is not consistent, even when the null is true, 5% of tests will be positive.

**[6]** Choosing a model conditional on the outcome of a preliminary check affects sampling distributions and thus p values and confidence intervals. The same problem arises when doing balance tests. If a t-test is conditional on a normality test, the p value of the t-test will be different (but unknown) from the one obtained if a t-test is performed without a preliminary check. That’s because p values depend on sampling distributions of imaginary experiments, which in turn depend on sampling and testing intentions (Wagenmaker, 2007; Kruschke & Liddell, 2018). This dependence can make p values difficult to interpret, because unless we simulate the world of possibilities that led to our p value, the sampling distribution for our statistic (say t statistic) is unknown.

**[7]** When non-normality is detected or suspected, a classic alternative to the two sample t-test is the Wilcoxon-Mann-Whitney test. However, in general different tests or models address different hypotheses — they are not interchangeable. For instance, the WMW’s U statistics is related to the distribution of all pairwise differences between two independent groups; unlike the t-test it doesn’t involve a comparison of the marginal means. Similarly, if instead of the mean, we use a trimmed mean, a robust measure of central tendency, our inferences are about the population trimmed mean, not the population mean.

**[8]** In most cases, researchers know the answer to the normality question before conducting the experiment. For instance, we know that reaction times, accuracy and questionnaire data are not normally distributed. Testing for normality when we already know the answer is unnecessary and falls into the category of tautological tests. Since we know the answer in most situations, it is better practice to use appropriate models and drop the checks altogether. For instance, accuracy data follow beta-binomial distributions (Jaeger, 2008; Kruschke, 2014); questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018; Bürkner & Vuorre, 2019; Taylor et al., 2022); reaction time data can be modelled using several families of skewed distributions (Lindeløv, 2019).

# References

Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. & Vuorre, M. (2019) Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science, 2, 77–101. https://journals.sagepub.com/doi/full/10.1177/2515245918823199

Campbell, H. & Gustafson, P. (2021) re:Linde et al. (2021): The Bayes factor, HDI-ROPE and frequentist equivalence tests can all be reverse engineered – almost exactly – from one another. https://arxiv.org/abs/2104.07834

Ernst AF, Albers CJ. 2017. Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. PeerJ 5:e3323 https://doi.org/10.7717/peerj.3323

Freedman, L.S., Lowe, D., & Macaskill, P. (1984) Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586.

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434–446.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573–603.

Kruschke, J.K. (2014) Doing Bayesian Data Analysis, 2nd Edition. edn. Academic Press.

Kruschke, J.K. (2018) Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1, 270–280.

Kruschke, J.K. & Liddell, T.M. (2018) The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev, 25, 178–206.

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Liddell, T.M. & Kruschke, J.K. (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348.

Lindeløv, J.K. (2019) Reaction time distributions: an interactive overview

https://lindeloev.github.io/shiny-rt/

Rouder, J.N., Morey, R.D., Verhagen, J., Province, J.M. and Wagenmakers, E.-J. (2016), Is There a Free Lunch in Inference?. Top Cogn Sci, 8: 520-547. https://doi.org/10.1111/tops.12214

Taylor, J.E., Rousselet, G.A., Scheepers, C. et al. Rating norms should be calculated from cumulative link mixed effects models. Behav Res (2022). https://doi.org/10.3758/s13428-022-01814-7

Torrin M.Liddell & John K.Kruschke (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348

https://www.sciencedirect.com/science/article/abs/pii/S0022103117307746

Vanhove (2018) Checking model assumptions without getting paranoid. https://janhove.github.io/analysis/2018/04/25/graphical-model-checking

Wagenmakers, E.-J. (2007) A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2020) The Principle of Predictive Irrelevance or Why Intervals Should Not be Used for Model Comparison Featuring a Point Null Hypothesis. In Gruber, C.W. (ed), The Theory of Statistics in Psychology: Applications, Use, and Misunderstandings. Springer International Publishing, Cham, pp. 111–129.

Wilcox RR, Rousselet GA. A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci. 2018 Jan 22;82:8.42.1-8.42.30. doi: 10.1002/cpns.41. PMID: 29357109.

Yan, Y. & Genton, M.G. (2019) The Tukey g-and-h distribution. Significance, 16, 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Yap, B.W. & Sim, C.H. (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141–2155. DOI: 10.1080/00949655.2010.520163