Hypothesis Testing on Fitted Models

Why a Separate Vignette?

vignette("maskedhaz") establishes that fit(model)(df, par) returns a fisher_mle object, which realises both the mle_fit concept (from algebraic.mle) and the algebraic.dist distribution interface. That gives us coef, vcov, confint, se, bias, observed_fim, score_val, and a logLik method, all through generics the maskedhaz package itself never implemented.

This vignette takes the next step: everything hypothesize does works on these fits directly. No bridge code, no wrappers, just generics flowing through generics. We’ll work through five questions a statistician would actually ask about a real analysis, using a different test machinery for each.

library(maskedhaz)
library(hypothesize)
library(algebraic.mle)

The Scenario

We generate data from a 2-component Weibull series system (true shapes are 2, not 1) and pretend we don’t know the true family. We’ll fit both an exponential-series model (simpler, 2 params) and a Weibull-series model (fuller, 4 params) to the same data, then use hypothesis tests to decide which is justified.

set.seed(42)

true_model <- dfr_series_md(components = list(
  dfr_weibull(shape = 2, scale = 5),
  dfr_weibull(shape = 2, scale = 8)
))
df <- rdata(true_model)(theta = c(2, 5, 2, 8), n = 300, tau = 10, p = 0.3)
table(df$omega)
#> 
#> exact 
#>   300
null_model <- dfr_series_md(
  components = list(dfr_exponential(), dfr_exponential()),
  n_par = c(1L, 1L)
)
alt_model <- dfr_series_md(
  components = list(dfr_weibull(), dfr_weibull()),
  n_par = c(2L, 2L)
)

fit_null <- suppressWarnings(fit(null_model)(df, par = c(0.1, 0.1)))
fit_alt  <- suppressWarnings(fit(alt_model)(df, par = c(1.5, 4, 1.5, 6)))

coef(fit_null)
#> [1] 0.17853850 0.09337168
coef(fit_alt)
#> [1] 1.798337 5.115741 2.306984 6.855044

Question 1: Is the Exponential Family Adequate?

The exponential-series model is nested inside the Weibull-series model (it’s the boundary case where all shapes equal 1). A likelihood ratio test compares the two.

# lrt() accepts logLik objects directly and derives the dof difference
# from the `df` attribute attached by likelihood.model::logLik.fisher_mle
lrt_result <- lrt(logLik(fit_null), logLik(fit_alt))
lrt_result
#> Hypothesis test (likelihood_ratio_test)
#> -----------------------------
#> Test statistic: 157.552414688934
#> P-value: 6.13660041402268e-35
#> Degrees of freedom: 2
#> Significant at 5% level: TRUE

The test statistic is \(2(\ell_{\text{alt}} - \ell_{\text{null}})\), asymptotically \(\chi^2\) with dof equal to the parameter-count difference (here, 2 extra Weibull shape parameters). The p-value is vanishingly small — the exponential family is rejected decisively.

Notice that we never wrote any maskedhaz-specific glue code. logLik() is a standard R generic, hypothesize::lrt() consumes its output. The maskedhaz package didn’t have to know anything about hypothesize, and vice versa.

Question 2: Is Each Individual Shape Significantly Different from 1?

The LRT rejects the global hypothesis “both shapes equal 1 simultaneously,” but says nothing about each shape separately. For that we use Wald tests — one per shape parameter — feeding them the estimate and standard error directly.

est <- coef(fit_alt)
se_vec <- se(fit_alt)         # from algebraic.mle

# Parameter layout: (shape1, scale1, shape2, scale2) = positions 1, 2, 3, 4
w_shape1 <- wald_test(est[1], se = se_vec[1], null_value = 1)
w_shape2 <- wald_test(est[3], se = se_vec[3], null_value = 1)

c(shape1_pval = pval(w_shape1),
  shape2_pval = pval(w_shape2))
#>  shape1_pval  shape2_pval 
#> 2.932220e-12 7.595611e-09

Both individual tests reject shape = 1 at any conventional level. The MLEs are about 1.8 and 2.3, the standard errors are small enough to distinguish both from 1 cleanly.

Question 3: Controlling Family-Wise Error

If we plan to report these two tests as a family (making decisions only when the family-wise error rate is bounded), we should adjust. adjust_pval() accepts a list of tests and returns a list of tests with the corrected p-values stored.

adjusted <- adjust_pval(list(w_shape1, w_shape2), method = "bonferroni")
c(shape1_adj = pval(adjusted[[1]]),
  shape2_adj = pval(adjusted[[2]]))
#>   shape1_adj   shape2_adj 
#> 5.864441e-12 1.519122e-08

With only two tests, Bonferroni simply doubles each p-value; both remain far below 0.05. The point is that adjust_pval() returns hypothesis_test objects — they continue to compose with the rest of the algebra. You could hand these adjusted tests into is_significant_at(), invert_test(), or anything else that expects a test.

Question 4: Composite Hypotheses

A different but related question: is the conjunction “shape1 = 1 AND shape2 = 1” supported by the data? Two natural ways to build a test:

Intersection-union via max p-value. intersection_test() takes the max of the individual p-values. If the max is small, both individual tests rejected, so the intersection rejects too.

comp_intersection <- intersection_test(w_shape1, w_shape2)
pval(comp_intersection)
#> [1] 7.595611e-09

Fisher’s method for combining independent p-values. Under the conjunction null, -2 \sum \log p_i follows \(\chi^2_{2k}\). fisher_combine() implements this.

comp_fisher <- fisher_combine(w_shape1, w_shape2)
comp_fisher
#> Hypothesis test (fisher_combined_test)
#> -----------------------------
#> Test statistic: 90.5019128167015
#> P-value: 1.03010156933654e-18
#> Degrees of freedom: 4
#> Significant at 5% level: TRUE

Both agree that the data contradicts the composite exponential null, but with different levels of sensitivity (Fisher’s method uses more of the information in moderate p-values; max-p is conservative when some tests are strong and others weak).

Question 5: A Score Test Without Refitting

The LRT and Wald tests both required fitting the full Weibull model. A score test can evaluate a null hypothesis using only the null model and the full-model score/Hessian machinery — no alternative-fit needed. This is useful when the alternative is expensive to fit or not globally identified.

For the composite null “all shapes = 1, scales from the null MLE,” we evaluate the score and observed Fisher information of the Weibull-series model at the null parameter vector.

# Map the exp-series MLE into the Weibull parameter space:
# (shape = 1, scale = 1/rate) for each component
null_in_alt <- c(1, 1/coef(fit_null)[1],
                 1, 1/coef(fit_null)[2])

# Score and observed Fisher info of the ALT model at the null point
s_at_null <- score(alt_model)(df, par = null_in_alt)
I_at_null <- -hess_loglik(alt_model)(df, par = null_in_alt)

s_at_null
#> [1] 124.464166255  -0.003897959  74.107135509   0.002022227

sc <- score_test(s_at_null, I_at_null, null_value = null_in_alt)
sc
#> Hypothesis test (score_test)
#> -----------------------------
#> Test statistic: 107.425992161895
#> P-value: 2.57532884934985e-22
#> Degrees of freedom: 4
#> Significant at 5% level: TRUE

Two things worth pointing out. First, the test statistic \(s^\top I^{-1} s\) and its \(\chi^2_4\) reference distribution come out large — the null is rejected strongly, agreeing with the LRT and Wald results. Second, look at the components of the score vector: positions 1 and 3 (the shape entries) are far from zero, while positions 2 and 4 (the scales) are nearly zero. The likelihood gradient at the null “wants” to move shape up, not scale; the null MLE already got the scales roughly right for a shape-1 fit. That’s diagnostic information the LRT doesn’t give you for free.

Question 6: Confidence Intervals by Test Inversion

A \(100(1-\alpha)\%\) confidence set for a parameter is, by definition, the set of null values the test fails to reject at level \(\alpha\). invert_test() automates this inversion over a user-supplied grid.

# Build a Wald test parameterised by shape1's null value
wald_for_shape1 <- function(theta0) {
  wald_test(est[1], se = se_vec[1], null_value = theta0)
}

ci_shape1 <- invert_test(wald_for_shape1, grid = seq(1.0, 3.0, by = 0.01), alpha = 0.05)
c(lower = lower(ci_shape1), upper = upper(ci_shape1))
#> lower.lower upper.upper 
#>        1.58        2.02

Compare with the direct Wald CI:

confint(wald_test(est[1], se = se_vec[1], null_value = est[1]))
#>    lower    upper 
#> 1.574196 2.022478

They agree to within the grid resolution. For a Wald test this redundancy is by design (the two are algebraically identical). The real power of invert_test() shows when you plug in a different test-fn — e.g., a likelihood-ratio test with a profile likelihood — to get a non-Wald confidence region whose shape is driven by the actual likelihood surface rather than the quadratic approximation. That’s expensive (requires refitting under each grid point) but fully composable: the test-fn just has to return a hypothesis_test.

Cross-Implementation Check

Because maskedhaz and maskedcauses both implement the series_md protocol and both return fisher_mle objects, the hypothesis-testing code is literally identical across them. We can verify by fitting the same data with maskedcauses’ closed-form exponential-series likelihood and running the same LRT.

library(maskedcauses)

fit_null_mc <- suppressWarnings(
  likelihood.model::fit(exp_series_md_c1_c2_c3())(df, par = c(0.1, 0.1))
)

# LRT using maskedcauses null fit + maskedhaz alt fit, across two packages
lrt_cross <- lrt(logLik(fit_null_mc), logLik(fit_alt))
c(
  lrt_same_package = pval(lrt_result),
  lrt_cross_package = pval(lrt_cross)
)
#>  lrt_same_package lrt_cross_package 
#>        6.1366e-35        6.1366e-35

The p-values agree. The test statistic, dof, and conclusion are the same because both packages compute the same likelihood (just via different numerical strategies). hypothesize doesn’t know or care which package the logLik came from — it only needs the logLik generic with the df attribute. That’s the protocol in action.

Summary

We ran six distinct hypothesis-testing analyses on the same fitted model, each using a different test machinery:

Question Test hypothesize function Input from maskedhaz fit
Is the simpler model enough? LRT lrt() logLik(fit_null), logLik(fit_alt)
Is each parameter different from the null value? Wald wald_test() coef(fit), se(fit)
Family-wise error control Multiple testing adjust_pval() list of Wald tests
Is the composite null rejected? Intersection-union, Fisher intersection_test(), fisher_combine() individual tests
Alt-model-free null rejection Score score_test() score(), hess_loglik() from the alt model at the null point
Confidence interval as rejection set Test inversion invert_test() a test-fn parameterised by the null value

None of this is maskedhaz-specific. Everything works through the logLik / coef / vcov / se / score_val / observed_fim generics that algebraic.mle and likelihood.model wired up long before hypothesize existed. The layered design lets each package do its one job well; capabilities compose transparently.