S4 conference

Confirmed Speakers


A large fraction (some claim > 1/2) of published research in top journals in applied sciences such as medicine and psychology is irreproduceable. In light of this ‘replicability crisis’, standard p-value based hypothesis testing has come under intense scrutiny. One of its many problems is the following: if our test result is promising but nonconclusive (say, p = 0.07) we cannot simply decide to gather a few more data points. While this practice is ubiquitous in science, it invalidates p-values and error guarantees. 

Recently, we proposed an alternative hypothesis testing methodology which we call ‘safe testing’.  The main concept is the S-value, a notion of evidence which, unlike p-values,  allows for effortlessly combining evidence from several tests, even in the common scenario where the decision to perform a new test depends on previous test outcomes. ‘Safe’ tests based on S-values generally preserve error guarantees  under such  `optional continuation’, thereby potentially alleviating one of the main causes for the reproducibility crisis. The basic idea is not completely new  – similar ideas have been put forward by e.g. Robbins (1960s), Vovk and Shafer (2000s). Yet in the past it could essentially only be applied to problems with a ‘simple’ null hypothesis (no free parameters, e.g. testing whether a coin is fair) – even though nearly all  tests used in practice, such as the t-test or independence tests or nonparametric tests, have nonsimple, ‘composite’ null hypotheses (i.e. free parameters). Our breakthrough is that we found out how to construct safe tests for arbitrary composite testing scenarios. It turns out that some S-values also be interpreted as Bayes factors, and vice versa – but not all S-values are Bayes factors, and not all S-values  are safe tests.  

We will illustrate the new method using the R package ‘SafeTest‘, which provides safe versions of the t-test and 2×2 contingency table tests.  We will show how Bayesian and frequentist statistician can happily work together in unprecedented ways; and finally, we will indicate how this research can be extended to ‘safe confidence intervals’. 

Peter Grünwald
CWI (the Netherlands Centre for Mathematics and Computer Science)
Leiden University