ssp.relogit: Optimal
Subsampling for Logistic Regression with Rare EventsThis vignette introduces ssp.relogit(), the rare-event
logistic regression method in subsampling. When positive
responses are rare, subsampling both classes symmetrically can waste
computation. ssp.relogit() keeps all rare events and
subsamples only the non-rare events.
Let \((y_i, x_i)\), \(i = 1, \ldots, N\), denote the full data, where \(y_i \in \{0,1\}\) and \(x_i\) includes the intercept. The logistic regression model is
\[ P(Y_i = 1 \mid x_i) = p_i(\beta) = \frac{\exp(x_i^\top\beta)}{1 + \exp(x_i^\top\beta)}. \]
The full-data objective is
\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]
In rare-event logistic regression, the number of positives
\[ N_1 = \sum_{i=1}^N y_i \]
is much smaller than the number of negatives
\[ N_0 = N - N_1. \]
In this setting, ssp.relogit() keeps all positive
observations and subsamples only the negatives, which reduces
computation while preserving the information carried by the rare class
(Wang, Zhang, and Wang 2021).
ssp.relogit() WorksFor criterion %in% c("optA", "optL", "LCC"), the
function uses the following two-stage design.
n.plt.n.ssp.So in this function, n.ssp is the expected number of
sampled negatives, not the expected total sample size. The returned
expected total subsample size is
\[ N_1 + n_{\mathrm{ssp}}. \]
For criterion = "uniform", the function still keeps all
positives, but it samples negatives uniformly with probability
n.ssp / N0. There is no pilot estimator in that branch.
The pilot estimator is designed specifically for rare-event logistic regression. Let \(S_{\mathrm{plt}}\) denote the pilot sample. The code draws about half the pilot sample from positives and half from negatives, then fits an unweighted logistic model on that balanced pilot sample.
The pilot objective is
\[ \widehat{\beta}^{\,\mathrm{bal}} = \arg\max_\beta \sum_{i \in S_{\mathrm{plt}}} \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]
Because the pilot sample has an artificial class balance, the fitted intercept is then corrected by
\[ \widehat{\beta}_0 = \widehat{\beta}^{\,\mathrm{bal}}_0 - \log\left(\frac{N_0}{N_1}\right), \]
while the slope coefficients are kept unchanged. The fitted pilot probabilities are
\[ \widehat{p}_i = p_i(\widehat{\beta}),\qquad i = 1,\ldots,N. \]
These pilot probabilities are used to construct the raw importance scores for negative sampling.
criterionThe criterion argument determines the raw importance
score \(m_i\). These scores represent
relative importance. They are not probabilities yet. The function turns
them into valid negative-sampling probabilities by truncation, scaling,
and the uniform mixture controlled by alpha.
Although the score is written for all observations, it matters operationally for the negative class because all positives are kept automatically.
criterion = "optA"The A-optimal score is
\[ m_i = \widehat{p}_i \sqrt{1-\widehat{p}_i} \left\lVert \widehat{M}^{-1} x_i \right\rVert, \]
where \(\widehat{M}\) is the pilot estimate of the information matrix. This criterion aims to minimize the asymptotic variance of the subsample estimator. It targets smaller asymptotic variance but is computationally expensive because it uses \(\widehat{M}^{-1}\).
criterion = "optL"The L-optimal score is
\[ m_i = \widehat{p}_i \sqrt{1-\widehat{p}_i}\,\lVert x_i \rVert. \]
This is the default. It keeps the same leading rare-event factor as
optA, but replaces \(\lVert
\widehat{M}^{-1} x_i \rVert\) by \(\lVert x_i \rVert\). So it avoids
multiplying by the inverse pilot information matrix inside the norm and
is computationally cheaper.
criterion = "LCC"The local case-control score is
\[ m_i = |y_i - \widehat{p}_i|. \]
This is a baseline method related to local case-control sampling.
criterion = "uniform"In the uniform branch, all positives are still retained. The negative observations are sampled with the same probability, so there is no pilot-based importance score.
For optA, optL, and LCC,
ssp.relogit() uses a Poisson negative-sampling design.
First, the raw score is truncated:
\[ m_i^* = \min(m_i, H), \]
where \(H\) is a threshold
controlled by b.
Then the negative inclusion probability is
\[ \pi_i^{-} = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat{\Phi}} + \alpha\frac{1}{N} \right\}, 1 \right], \]
where \(\widehat{\Phi}\) is the pilot-based normalizing quantity. The actual subsampling design is
\[ \pi_i = \begin{cases} 1, & y_i = 1,\\[6pt] \pi_i^{-}, & y_i = 0. \end{cases} \]
So positives are always kept, while negatives are independently included.
For criterion = "uniform", the design becomes
\[ \pi_i = \begin{cases} 1, & y_i = 1,\\[6pt] \dfrac{n_{\mathrm{ssp}}}{N_0}, & y_i = 0. \end{cases} \]
likelihoodThe likelihood argument controls how the model fit
corrects for unequal sampling probabilities.
likelihood = "weighted"The weighted likelihood uses inverse sampling probabilities:
\[ \ell_S^{\mathrm{w}}(\beta) = \sum_{i \in S} \frac{1}{\pi_i} \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]
This is the more general correction and can be useful as a baseline.
likelihood = "logOddsCorrection"This is the default and the more specialized rare-event correction (Wang, Zhang, and Wang 2021). For negative sampling, the sampled conditional model can be written as
\[ P(Y_i = 1 \mid x_i, i \in S) = \frac{ \exp(x_i^\top\beta + a_i) }{ 1 + \exp(x_i^\top\beta + a_i) }, \]
where the offset is
\[ a_i = -\log(\pi_i^{-}). \]
This is exactly the offset used in the implementation for the non-uniform negative-sampling branch.
| Component | Setting |
|---|---|
criterion |
optA |
criterion |
optL |
criterion |
LCC |
criterion |
uniform |
likelihood |
logOddsCorrection |
likelihood |
weighted |
The main design features to remember are:
n.ssp is the expected number of sampled negativesN1 + n.sspWe simulate a rare-event logistic regression problem with \(N = 2 \times 10^4\) observations.
set.seed(2)
N <- 2 * 1e4
beta0 <- c(-6, -rep(0.5, 6))
d <- length(beta0) - 1
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
n.plt <- 200
n.ssp <- 600
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste0("V", 1:ncol(X)))
formula <- Y ~ .
print(paste("N:", N))
#> [1] "N: 20000"
print(paste("sum(Y):", sum(Y)))
#> [1] "sum(Y): 266"The default recommendation is criterion = "optL" with
likelihood = "logOddsCorrection".
set.seed(11)
fit_default <- ssp.relogit(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = "optL",
likelihood = "logOddsCorrection"
)
summary(fit_default)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "optL", likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 20000
#> 2 Expected Subsample Size 866
#> 3 Actual Subsample Size 863
#> 4 Unique Subsample Size 863
#> 5 Expected Subample Rate 4.33%
#> 6 Actual Subample Rate 4.315%
#> 7 Unique Subample Rate 4.315%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate
#> Subsample 863 266 30.82%
#> Full data 20000 266 1.33%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -5.9076 0.1424 -41.4856 <0.0001
#> V1 -0.4545 0.0849 -5.3541 <0.0001
#> V2 -0.5072 0.0925 -5.4858 <0.0001
#> V3 -0.5110 0.0886 -5.7699 <0.0001
#> V4 -0.4335 0.0931 -4.6563 <0.0001
#> V5 -0.5916 0.0905 -6.5400 <0.0001
#> V6 -0.4641 0.0817 -5.6832 <0.0001If you want the A-optimal criterion, you can switch
criterion while keeping the same rare-event correction.
set.seed(2)
fit_optA <- ssp.relogit(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = "optA",
likelihood = "logOddsCorrection"
)
summary(fit_optA)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "optA", likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 20000
#> 2 Expected Subsample Size 866
#> 3 Actual Subsample Size 879
#> 4 Unique Subsample Size 879
#> 5 Expected Subample Rate 4.33%
#> 6 Actual Subample Rate 4.395%
#> 7 Unique Subample Rate 4.395%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate
#> Subsample 879 266 30.26%
#> Full data 20000 266 1.33%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -5.8214 0.1345 -43.2958 <0.0001
#> V1 -0.3599 0.0814 -4.4214 <0.0001
#> V2 -0.6198 0.0954 -6.4991 <0.0001
#> V3 -0.4419 0.0983 -4.4971 <0.0001
#> V4 -0.4006 0.0921 -4.3498 <0.0001
#> V5 -0.5694 0.0952 -5.9838 <0.0001
#> V6 -0.4913 0.0833 -5.8987 <0.0001The uniform branch still keeps all positives and samples negatives uniformly.
set.seed(9)
fit_uniform <- ssp.relogit(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
criterion = "uniform",
likelihood = "logOddsCorrection"
)
fit_uniform$subsample.size.expect
#> [1] 866
length(fit_uniform$index)
#> [1] 872
summary(fit_uniform)
#> Model Summary
#>
#>
#> Call:
#>
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> criterion = "uniform", likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 20000
#> 2 Expected Subsample Size 866
#> 3 Actual Subsample Size 872
#> 4 Unique Subsample Size 872
#> 5 Expected Subample Rate 4.33%
#> 6 Actual Subample Rate 4.36%
#> 7 Unique Subample Rate 4.36%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate
#> Subsample 872 266 30.5%
#> Full data 20000 266 1.33%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -5.7572 0.1591 -36.1775 <0.0001
#> V1 -0.4193 0.1127 -3.7198 0.0002
#> V2 -0.5637 0.1357 -4.1532 <0.0001
#> V3 -0.3973 0.1294 -3.0707 0.0021
#> V4 -0.4071 0.1190 -3.4201 0.0006
#> V5 -0.5804 0.1343 -4.3221 <0.0001
#> V6 -0.4144 0.1159 -3.5769 0.0003The returned object stores the fitted estimators, covariance estimates, and row indices of the selected observations.
names(fit_default)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "cov.ssp" "cov"
#> [7] "index.plt" "index" "N"
#> [10] "N1" "N0" "Y.count.ssp"
#> [13] "subsample.size.expect" "terms"Important components include:
coef.plt: pilot estimator.coef.ssp: estimator from the retained positives plus
sampled negatives.coef: combined estimator printed by
summary().cov.ssp: covariance estimate for
coef.ssp.cov: covariance estimate for coef.index.plt: row indices of the pilot sample, when a
pilot sample is used.index: row indices of the final retained-and-sampled
subsample.subsample.size.expect: expected total subsample
size.The coefficients and standard errors printed by
summary() are based on coef and
sqrt(diag(cov)).