Kaiser & Herzog (2025) offer a very useful tutorial on generating distribution-free prediction intervals using cross-validation methods. However, in covering this important topic, they get the definition of a confidence interval wrong. This is all the more annoying because their article appears in AMPPS, an influential methods and stats journal in psychology.
Here are the problematic statements:
[1] “For estimating a population parameter, such as the mean, the sample estimate is often given a confidence interval (CI). Following a probabilistic interpretation of CIs, it can be expected with a certain probability that the population parameter lies within this interval”
[2] “A 95% CI, on the other hand, has a different interpretation: It would indicate the range in which the average job performance for individuals with an IQ of 120 and an integrity score of 40 is likely to fall.”
These statements reflect a common misinterpretation of confidence intervals (Greenland et al. 2016; Hoekstra et al. 2014): the coverage of say 95% does not apply to the interval obtained in that one experiment. This is easier to grasp with an illustration:

The figure shows the outcome of 20 experiments, along the y-axis, each sampling from the same population (a standard normal distribution). Along the x-axis, small dots show individual observations in each experiment. The black disk is the sample mean, and the horizontal line marks the bounds of the confidence interval. The vertical dashed line marks the population mean (zero). Because of sampling variability, the sample means and the associated confidence intervals vary across experiments. Confidence intervals in black include the population mean, those in grey exclude it. For a given experiment, there is no probability associated with the confidence interval: it contains the population value or not, so the outcome is one or zero, nothing in between. The coverage, that is the probability to include the population value, only applies to an infinite number of imaginary experiments carried out in the same way as our experiment. So the coverage is a long run property of a recipe: every time we collect data in a certain way, and then calculate a confidence interval, which contains the population value or not. This last part is obviously unknown in practice, unless we carry out simulations in which we control the population value. Similarly, an experiment doesn’t have statistical power: it detects an effect or not. Power is the long run property of a programme or area of research, considering an infinite number of imaginary experiments we will never actually carry out.
And of course, the coverage of the confidence interval is at the nominal level only under certain circumstances, and the same goes for prediction intervals. In practice, a 95% confidence interval is unlikely to have 95% coverage.
Following Greenland (2019), it is more intuitive to describe confidence intervals as compatibility intervals: such intervals suggest population values highly compatible with the data, given our model. For more on this, there is a very useful discussion about appropriate reporting of frequentist statistics here.
Pingback: Improving statistical reporting in psychology | basic statistics
One more time, a great, accurate, and timely post! 🙂
LikeLike
“Power is the long run property of a programme or area of research…”
I would argue that power, like confidence interval coverage, is a property of a statistical procedure.
LikeLike
Hi Noah, thanks for your comment. Sure the statistical method is a key ingredient, but power doesn’t exist in a vacuum: we need to know the type of data the model is applied to in order to compute power. “Programme of research” captures that aspect very well: for a certain type of study, we collect the same type of data, which are modelled in the same way.
LikeLike