Tag Archives: beta

All your p-values are wrong

Or they don’t mean what you think, or they are not interpretable in most situations (Wagenmakers, 2007; Kruschke, 2013). Why is that? Let’s consider how a p-value is calculated. For simplicity, we focus on a one-sample one-sided t-test. Imagine we collected this sample of 50 observations.

The mean is indicated by the vertical solid line. Imagine our hypothesis is a mean 70% correct. The t value is 1.52, and the p-value is 0.0670. We obtain the p-value by comparing our observed t value to a hypothetical distribution of t values obtained from imaginary experiments we will never carry out. The default approach is to assume a world in which there is no effect and we sample from a normal distribution. In each imaginary experiment, we get a sample of n=50 observations from our null distribution, and calculate t. Over an infinite number of imaginary experiments, we get this sampling distribution:

The p-value is the tail area highlighted in red, corresponding to the probability to observe imaginary t values at least as extreme as our observed t value under our model. Essentially, a p-value is a measure of surprise, which can be expressed as a s-value in bits (Greenland, 2019). For a p-value of 0.05, the s-value = -log2(0.05) = 4.32. That’s equivalent to flipping a coin 4 times and getting 4 heads in a row. A p-value can also be described as a continuous measure of compatibility between our model and our data, ranging from 0 for complete incompatibility, to 1 for complete compatibility (Greenland et al. 2016). This is key: p-values are not absolute, they change with the data and the model, and even with the experimenter’s intentions.

Let’s unpack this fundamental property of p-values. Our sample of n=50 scores is associated with a p-value of 0.067 under our model. This model includes:

  • a hypothesis of 70% correct;
  • sampling from a normal distribution;
  • independent samples;
  • fixed sample size of n=50;
  • a fixed statistical model, that is a t-test is applied every time.

(The full model includes other assumptions, for instance that our sample is unbiased, that we have precise measurements, that our measure of interest is informative in the context of a causal model linking data and theory (Meehl, 1997), but we will ignore these aspects here.)

In practice some of these assumptions are incorrect, making the interpretation of p-values difficult.

Data-generating process

The scores in our sample do not come from a normal distribution. Like any proportion data, they follow a beta distribution. Here is the population our data came from:

A boxplot suggests the presence of one outlier:

A Q-Q plot suggests some deviations from normality, but a Shapiro-Wilk test fails to reject:

Should we worry or rely on the reassuring call to the central limit theorem and some hand-wavy statement about the robustness of t-test and ANOVA? In practice, it is a bad idea to assume that empirical t distributions will match theoretical ones because skewness and outliers can significantly mess things up, even for relatively large sample sizes (Wilcox, 2022). In the one-sample case, ignoring skewness and outliers can lead to inflated false positives and low power (Wilcox & Rousselet, 2023; Rousselet & Wilcox, 2020).

In our case, we can simulate the t distribution under normality–Normal (sim) in the figure below, and compare it to the t distribution obtained when sampling from our beta population–Beta (sim). As a reference, the figure also shows the theoretical, non-simulated t distribution–Normal (theory). The simulation involved 100,000 iterations with n=50 samples.

Let’s zoom in to better compare the right side of the distributions:

The simulated t distribution under normality is a good approximation of the theoretical distribution. The simulated t when sampling from our beta population is not accessible to the user, because we typically don’t know exactly how the data were generated. Here we have full knowledge, so we can derive the correct t distribution for our data. Remember that the p-value from the standard t-test was 0.0670. Using our simulation under normality the p-value is 0.0675. When using the t distribution obtained by sampling from the correct beta distribution, now the p-value is 0.0804.

In most situations, p-values are calculated using inappropriate theoretical sampling distributions of t values. This might not affect the observed p-value much, but the correct p-value is unknown.

Independence

The independence assumption is violated whenever data-dependent exclusion is applied to the sample. For instance, it is very common for outliers to be identified and removed before applying a t-test or other frequentist inferential test. This is often done using a non-robust method, such as flagging observations more than 2 SD from the mean. A more robust method could also be used, such as a boxplot rule or a MAD-median rule (Wilcox & Rousselet, 2023). Whatever the method, if the outliers are identified using the sample we want to analyse, the remaining observations are no longer independent, which affects the standard error of the test. This is well documented in the case of inferences about trimmed means (Tukey & McLaughlin, 1963; Yuen, 1974; Wilcox, 2022). Trimmed means are robust estimators of central tendency that can boost statistical power in the presence of skewness and outliers. To calculate a 20% trimmed mean, we sort the data, and remove the lowest 20% and the highest 20% (so 40% of observations in total), and average the remaining observations. This introduces a dependency among the remaining observations, which is taken into account in the calculation of the standard error. In other words, removing observations in a data-dependent manner, and then using a t-test as if the new, lower, sample size was the one intended is inappropriate. To illustrate the problem, we can do a simulation in which we sample from a normal or a beta population, each time take a sample of n=50, trim 20% of observations from each end of the distribution, and either apply the incorrect t-test to the remaining n=30 observations, or apply the t formula from Tukey & McLaughlin (1963; Wilcox, 2022) to the full sample. Here are the results:

T values computed on observations left after trimming are far too large. The discrepancy depends on the amount of trimming. Elegantly, the equation of the t-test on trimmed means reverts to the standard equation if we trim 0%. Of course, the amount of trimming should be pre-registered and not chosen after seeing the data.

If we apply a t-test on means to our beta sample after trimming 20%, the (incorrect) p-value is 0.0007. The t-test on trimmed means returns p = 0.0256. That’s a large difference! Using the t distribution from the simulation in which we sampled from the correct beta population, now p = 0.0329. Also, with a t-test on means we fail to reject, whereas we do for an inference on 20% trimmed means. In general, inferences on trimmed means tend to be more powerful relative to means in the presence of skewness or outliers. However, keep in mind that means and trimmed means are not interchangeable: they ask different questions about the populations. Sample means are used to make inferences about population means, and both are non-robust measures of central tendency. Sample trimmed means are used to make inferences about population trimmed means.

Now the problem is more complicated, and somewhat intractable, if instead of trimming a pre-registered amount of data, we apply an outlier detection method. In that case, independence is violated, but correcting the standard error is difficult because the number of removed observations is a random variable: it will change between experiments.

In our sample, we detect one outlier. Removing the outlier and applying a t-test on means, pretending that our sample size was always n-1 is inappropriate, although very common in practice. The standard error could be corrected using a similar equation to that used in the trimmed mean t-test. However, in other experiments we might reject a different number of outliers. Remember that p-values are not about our experiment, they reflect what could happen in other similar experiments that we will never carry out. In our example, we can do a simulation to derive a sampling distribution that match the data generation and analysis steps. For each sample of n=50 observations from the beta distribution, we apply a boxplot rule, remove any outliers, and then compute a t value. If we simply remove the outlier from our sample, the t-test returns p = 0.0213. If instead we compute the p-value by using the simulated sampling distribution, we get p = 0.0866. That p-value reflects the correct data generating process and the fact that, in other experiments, we could have rejected a different number of outliers. Actually, in the simulation the median number of rejected outliers is zero, the 3rd quartile is 1, and the maximum is 9.

In practice, we don’t have access to the correct t sampling distribution. However, we can get a good approximation by using a percentile bootstrap that incorporates the outlier detection and rejection step after sampling with replacement from the full dataset, and before calculating the statistic of interest (Rousselet, Pernet & Wilcox, 2021).

If outliers are expected and common, a good default strategy is to make inferences about trimmed means. Another approach is to make inferences about M-estimators in conjunction with a percentile bootstrap (Wilcox, 2022). M-estimators adjust the amount of trimming based on the data, instead of removing a pre-specified amount. Yet another approach is to fit a distribution with a tail parameter that can account for outliers (Kruschke, 2013). Or it might well be that what looks like an outlier is a perfectly legitimate member of a skewed or heavy-tailed distribution: use more appropriate models that account for rich distributional differences (Rousselet, Pernet, Wilcox, 2017; Farrell & Lewandowsky, 2018; Lindeløv, 2019).

Fixed sample size?

The t-test assumes that the sample size is fixed. This seems obvious, but in practice it is not the case. As we saw in the previous example, sample sizes can depend on outlier rejection, a very common procedure that make p-values uninterpretable. In general, data-dependent analyses will mess up traditional frequentist inferences (Gelman & Loken 2014). Sample sizes can also be affected by certain inclusion criteria. For instance, data are included in the final analyses for participants who scored high enough in a control attention check. Deriving correct p-values would require simulations of sampling distributions that incorporate the inclusion check. In other situations, the sample sizes vary because of reasons outside the experimenters’ control. For instance, data are collected in an online experiment until a deadline. In that case the final sample size is a surprise revealed at the end of the experiment and is thus a random variable. Consequently, deriving a sampling distribution for a statistic of interest requires another sampling distribution of plausible sample sizes that could have been obtained. The sampling distribution, say for a t value, would be calculated by integrating over the sampling distribution of sample sizes, and any other sources of variability, such as different plausible numbers of outliers that could have been removed, even if they were not in our sample. Failure to account for these sources of variability leads to incorrect p-values. It gets even more complicated in some situations: p-values also depend on our sampling intentions.

Imagine this scenario inspired by Kruschke (2013), in which a supervisor asked two research assistants to collect data from n=8 participants in total. They misunderstood the instructions, and instead collected n=8 each, so a total of n=16. The plan was to do a one-sample t-test. What sample size should the research team use to compute the degrees of freedom: 8 or 16? So 7 df or 15 df? Here is a plot of the p-values as a function of the critical t values in the two situations.

The answer depends on the sampling distribution matching the data acquisition process, including the probability that the instructions are misunderstood (Kruschke, 2013). If we assume that a misunderstanding leading to this specific error could occur in 10% of experiments, then the matching curve is the dashed one in the figure below, obtained by mixing the two curves for n=8 and n=16.

That’s right, even though the sample size is n=16, because it was obtained by accident and we intended to collect n=8, the critical t and the p-value are obtained from a distribution that is in-between the two for n=8 and n=16, but closer to n=8. This correct distribution reflects the long-run perspective of conducting imaginary experiments in which the majority would have led to n=8. Again, the p-value is not about the current experiment. This scenario reveals that p-values depend on intentions, which has consequences in many situations. In practice, all the points raised so far demonstrate that p-values in most situations are necessarily inaccurate and very difficult to interpret.

Conditional analyses

Another common way to mess up the interpretation of our analyses is to condition one analysis on another one. For instance, it is common practice to conduct a test of data normality: reject and apply a rank-based test; fail to reject and apply a t-test. Testing for normality of the data is a bad idea for many reasons, including because it makes the subsequent statistical tests conditional on the outcome of the normality test. Again, unless we can simulate the appropriate conditional sampling distribution for our statistic, our p-value will be incorrect. Similarly, anyone tempted to use such an approach would need to justify sample sizes using a power simulation that includes the normality step, and any other step that affects the statistic sampling distribution. In my experience, all conditional steps are typically ignored in power analyses and pre-registrations. It’s not just p-values, all power analyses are wrong too.

No measurement error?

It gets worse. Often, t-tests and similar models are applied to data that have been averaged over repetitions, for instance mean accuracy or reaction times averaged over trials in each condition and participant. In this common situation, the t-test ignores measurement error, because all variability has been wiped out. Obviously, in such situations, mixed-effect (hierarchical) models should be used (DeBruine & Barr, 2021). Using a t-test instead of a mixed-effect model is equivalent to using a mixed-effect model in which the trial level data have been copied and pasted an infinite number of times, such that measurement precision becomes infinite. This is powerfully illustrated here.

Conclusion

In most articles, the p-values are wrong. How they would change using appropriate sampling distributions is hard to determine, and ultimately a futile exercise. Even if the p-values changed very little, the uncertainty makes the obsession for declaring “statistical significance” whenever p<0.05, no matter how close the p-value is to the threshold, all the more ridiculous. So the next time you read in an article that “there was a trend towards significance, p=0.06”, or some other non-sense, in addition to asking the authors if they pre-registered a threshold for a trend, and asking them to also write “a trend towards non-significance, p=0.045”, also point out that the p-value matching their design, analyses, and data generation process is likely to be different from the one reported.

What can we do? A plan of action, from easy to hard:

[1] Take a chill pill, and consider p-values as just one of many outputs, without a special status (Vasishth & Gelman, 2021). Justify your choices in the methods section, unlike the traditional article in which tests pop up out of the blue in the results section, with irrational focus on statistical significance.

[2] Use bootstrap methods to derive more appropriate sampling distributions. Bootstrap methods, combined with robust estimators, can boost statistical power and help you answer more interesting questions. These methods also let you include preprocessing steps in the analyses, unlike standard parametric methods.

[3] Pre-register everything, along with careful justifications of models, pre-processing steps, and matching power simulations.

[4] Abandon the chase for statistical significance. Instead of focusing on finding effects, focus on a model-centric approach (Devezer & Buzbas, 2023). The goal is to contrast models that capture different hypotheses or mechanisms by assessing how they explain or predict data (Farrell & Lewandowsky, 2018; Gelman, Hill & Vehtari, 2020; James et al., 2021; McElreath, 2020; Yarkoni & Westfall, 2017). What is the explanatory power of the models? What is their predictive accuracy?

Code

https://github.com/GRousselet/blog-pwrong

References

DeBruine, L., & Barr, D. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119

Devezer, B., & Buzbas, E. O. (2023). Rigorous exploration in a model-centric science via epistemic iteration. Journal of Applied Research in Memory and Cognition, 12(2), 189–194. https://doi.org/10.1037/mac0000121

Farrell, S., & Lewandowsky, S. (2018). Computational Modeling of Cognition and Behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–465. https://www.jstor.org/stable/43707868

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer US. https://doi.org/10.1007/978-1-0716-1418-1

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603. https://doi.org/10.1037/a0029146

Lindeløv, J. K. (2019). Reaction time distributions: An interactive overview. https://lindeloev.github.io/shiny-rt/

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press. https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46(2), 1738–1748. https://doi.org/10.1111/ejn.13610

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

Yuen, K. K. (1974). The Two-Sample Trimmed t for Unequal Population Variances. Biometrika, 61(1), 165–170. https://doi.org/10.2307/2334299