What if hypothesis tests formed an algebra?
In SICP terms, we want three things: primitive
expressions (the tests themselves), means of
combination (ways to build compound tests from simpler ones),
and means of abstraction (ways to transform and
generalize tests). The hypothesize package provides all
three, and in v0.11.0, the combinators form a genuine Boolean
algebra.
This vignette develops the algebra from first principles.
Statistical theory gives us three fundamental likelihood-based tests. They approach the same question from different angles and are asymptotically equivalent, but each requires different information:
| Test | What it needs | When to use |
|---|---|---|
| Wald | MLE + standard error | You already fit the full model |
| LRT | Log-likelihoods of both models | You can fit both models |
| Score | Score + Fisher info at null only | The null model is cheap, the alternative is expensive |
The score test completes the trinity:
# Setup: testing whether a Poisson rate equals 5
# Observed: 60 events in 10 time units
# MLE: lambda_hat = 6, se = sqrt(6/10)
# Wald: evaluate at the MLE
w <- wald_test(estimate = 6, se = sqrt(6/10), null_value = 5)
# Score: evaluate at the null
# Score function at lambda0=5: n*xbar/lambda0 - n = 60/5 - 10 = 2
# Fisher info at lambda0=5: n/lambda0 = 10/5 = 2
s <- score_test(score = 2, fisher_info = 2)
# LRT: evaluate at both
# loglik(lambda) = sum(x)*log(lambda) - n*lambda
ll_null <- 60 * log(5) - 10 * 5
ll_alt <- 60 * log(6) - 10 * 6
l <- lrt(null_loglik = ll_null, alt_loglik = ll_alt, dof = 1)
# All three give similar (not identical) answers:
data.frame(
test = c("Wald", "Score", "LRT"),
statistic = c(test_stat(w), test_stat(s), test_stat(l)),
p_value = round(c(pval(w), pval(s), pval(l)), 4)
)
#> test statistic p_value
#> 1 Wald 1.666667 0.1967
#> 2 Score 2.000000 0.1573
#> 3 LRT 1.878587 0.1705The three tests agree asymptotically but can diverge in finite samples. This is expected and well-understood — the choice between them depends on what information is available, not which gives the “right” answer.
The heart of v0.11.0 is a Boolean algebra over hypothesis tests. Just as propositional logic has AND, OR, and NOT, we can combine tests with the same operations.
The complement of a test rejects when the original fails to reject:
# A significant Wald test
w <- wald_test(estimate = 3.0, se = 1.0)
pval(w) # small — rejects H0
#> [1] 0.002699796
# Its complement
cw <- complement_test(w)
pval(cw) # large — fails to reject
#> [1] 0.9973002The algebra is clean: double complement is identity.
And the class hierarchy is preserved — a complemented Wald test is
simultaneously a complemented_test, a
wald_test, and a hypothesis_test:
Connection to equivalence testing. If a Wald test asks “is the parameter different from the null?”, its complement asks “is the parameter close to the null?” This is exactly the logic of equivalence testing. In bioequivalence studies, you don’t want to show a drug is different — you want to show it’s the same (within tolerance).
The intersection test rejects only when all component tests reject:
# Both must be significant
intersection_test(0.01, 0.03) # both < 0.05
#> Hypothesis test (intersection_test)
#> -----------------------------
#> Test statistic: 0.03
#> P-value: 0.03
#> Degrees of freedom: 2
#> Significant at 5% level: TRUE
is_significant_at(intersection_test(0.01, 0.03), 0.05)
#> [1] TRUE
# One non-significant component blocks rejection
intersection_test(0.01, 0.80)
#> Hypothesis test (intersection_test)
#> -----------------------------
#> Test statistic: 0.8
#> P-value: 0.8
#> Degrees of freedom: 2
#> Significant at 5% level: FALSE
is_significant_at(intersection_test(0.01, 0.80), 0.05)
#> [1] FALSEThe p-value is simply \(\max(p_1, \ldots, p_k)\) — the intersection rejects at level \(\alpha\) if and only if every component rejects, which happens if and only if the largest p-value is below \(\alpha\).
This is the intersection-union test (IUT; Berger, 1982). No multiplicity correction is needed — the max operation is inherently conservative.
Use case: bioequivalence. Bioequivalence requires showing a drug’s effect is both “not too low” AND “not too high”:
# Drug effect estimate: 1.05, with SE = 0.08
# Must show: effect > 0.8 AND effect < 1.25
t_lower <- wald_test(estimate = 1.05, se = 0.08, null_value = 0.8)
t_upper <- wald_test(estimate = 1.05, se = 0.08, null_value = 1.25)
bioequiv <- intersection_test(t_lower, t_upper)
is_significant_at(bioequiv, 0.05)
#> [1] TRUEThe union test rejects when any component test rejects. But here’s the beautiful part: we don’t implement it directly. Instead, we define it through De Morgan’s law:
\[\text{OR}(t_1, \ldots, t_k) = \text{NOT}(\text{AND}(\text{NOT}(t_1), \ldots, \text{NOT}(t_k)))\]
The implementation IS the theorem:
# This is (essentially) the actual source code of union_test:
union_test <- function(...) {
complement_test(
do.call(intersection_test, lapply(tests, complement_test))
)
}Let’s verify De Morgan’s law holds:
# Three p-values
p1 <- 0.03; p2 <- 0.15; p3 <- 0.07
# Direct union
ut <- union_test(p1, p2, p3)
# Manual De Morgan construction
tests <- list(
hypothesis_test(stat = 0, p.value = p1, dof = 1),
hypothesis_test(stat = 0, p.value = p2, dof = 1),
hypothesis_test(stat = 0, p.value = p3, dof = 1)
)
dm <- complement_test(
do.call(intersection_test, lapply(tests, complement_test))
)
# Identical
c(union = pval(ut), de_morgan = pval(dm))
#> union de_morgan
#> 0.03 0.03The resulting p-value is \(\min(p_1, \ldots, p_k)\). This falls out algebraically: complement maps \(p \to 1-p\), the intersection takes the max, and the outer complement maps back. So:
\[1 - \max(1-p_1, \ldots, 1-p_k) = \min(p_1, \ldots, p_k)\]
A note on multiplicity. The uncorrected \(\min(p)\) is anti-conservative. If you need
to control the family-wise error rate, apply adjust_pval()
to the component tests before combining. The raw union test is
appropriate for screening or exploratory analysis where you want to
detect any signal.
The three operations satisfy the axioms of a Boolean algebra:
| Property | Statement |
|---|---|
| Involution | complement(complement(t)) = t |
| De Morgan 1 | union(a, b) = complement(intersection(complement(a), complement(b))) |
| De Morgan 2 | intersection(a, b) = complement(union(complement(a), complement(b))) |
And they compose with the rest of the package — intersection and
union tests are hypothesis_test objects, so they work with
fisher_combine(), adjust_pval(),
pval(), and everything else:
There is a deep connection between hypothesis tests and confidence
intervals: a \((1-\alpha)\) confidence
set contains exactly those null values that would not be
rejected at level \(\alpha\). The
invert_test() function makes this duality operational.
It takes a test constructor — any function that maps a null value to
a hypothesis_test — and returns the set of non-rejected
values:
# Invert a Wald test into a confidence interval
cs <- invert_test(
test_fn = function(theta) wald_test(estimate = 2.5, se = 0.8, null_value = theta),
grid = seq(0, 5, by = 0.01)
)
cs
#> Confidence set (95% level)
#> -----------------------------
#> Lower: 0.94
#> Upper: 4.06
#> Grid points in set: 313 of 501This agrees with the analytical confidence interval (up to grid resolution):
# Analytical CI from the Wald test
confint(wald_test(estimate = 2.5, se = 0.8))
#> lower upper
#> 0.9320288 4.0679712
# Numerical CI from test inversion
c(lower = lower(cs), upper = upper(cs))
#> lower.lower upper.upper
#> 0.94 4.06The power of invert_test() is generality. The analytical
confint() methods only work for specific test types (Wald,
z-test). But invert_test() works with any test —
including user-defined ones:
# A custom test: reject if |observation - theta| > 2
# (This has no analytical CI, but invert_test handles it)
my_test <- function(theta) {
obs <- 5.0
hypothesis_test(
stat = (obs - theta)^2,
p.value = if (abs(obs - theta) > 2) 0.01 else 0.5,
dof = 1
)
}
cs <- invert_test(my_test, grid = seq(0, 10, by = 0.1))
c(lower = lower(cs), upper = upper(cs))
#> lower.lower upper.upper
#> 3 7This is the most SICP function in the package. It takes a
function as input (the test constructor) and returns a
structured result (the confidence set). It works because of the
abstraction barrier: invert_test() only needs
pval(), so any test that implements the
hypothesis_test interface gets confidence sets for
free.
Both wald_test() and score_test() are
polymorphic — they accept scalar or vector inputs and dispatch
accordingly.
The multivariate Wald test uses the quadratic form:
\[W = (\hat\theta - \theta_0)^\top \Sigma^{-1} (\hat\theta - \theta_0) \sim \chi^2_k\]
# Test two parameters jointly
est <- c(2.0, 3.0)
V <- matrix(c(1.0, 0.3, 0.3, 1.0), 2, 2) # correlated
w_joint <- wald_test(estimate = est, vcov = V, null_value = c(0, 0))
w_joint
#> Hypothesis test (wald_test)
#> -----------------------------
#> Test statistic: 10.3296703296703
#> P-value: 0.00571400463027544
#> Degrees of freedom: 2
#> Significant at 5% level: TRUEWhen parameters are independent (diagonal covariance), the multivariate statistic decomposes into a sum of univariate statistics:
# Independent parameters: diagonal vcov
se1 <- 0.8; se2 <- 1.2
V_diag <- diag(c(se1^2, se2^2))
w_multi <- wald_test(estimate = c(2.0, 3.0), vcov = V_diag)
w1 <- wald_test(estimate = 2.0, se = se1)
w2 <- wald_test(estimate = 3.0, se = se2)
# The joint statistic equals the sum of individual statistics
c(joint = test_stat(w_multi), sum = test_stat(w1) + test_stat(w2))
#> joint sum
#> 12.5 12.5This decomposition breaks down when parameters are correlated —
that’s when you need the full vcov matrix. The off-diagonal
entries capture the dependence structure that individual standard errors
miss.
Stepping back, the package implements a small algebra with clear structure:
Primitives: z_test, wald_test, lrt, score_test
Combinators: fisher_combine, intersection_test, union_test
Transformers: adjust_pval, complement_test
Duality: invert_test, confint
Accessors: pval, test_stat, dof, is_significant_at, lower, upper
Each layer corresponds to an SICP principle:
| Layer | SICP Principle | What it does |
|---|---|---|
| Primitives | Primitive expressions | The four ways to test a hypothesis |
| Combinators | Means of combination | Build compound tests (closure property) |
| Transformers | Means of abstraction | Transform tests into new tests |
| Duality | Power of the abstraction barrier | Any test becomes a confidence set |
The package has 18 exported functions. That’s deliberately small. The goal is not to provide every test, but to provide the algebraic primitives from which users can compose arbitrary test logic. A well-chosen basis is more powerful than an exhaustive catalog.