ssp.relogit: Optimal Subsampling for Logistic Regression with Rare Events

library(subsampling)

This vignette introduces ssp.relogit(), the rare-event logistic regression method in subsampling. When positive responses are rare, subsampling both classes symmetrically can waste computation. ssp.relogit() keeps all rare events and subsamples only the non-rare events.

Model and Rare-Event Setting

Let \((y_i, x_i)\), \(i = 1, \ldots, N\), denote the full data, where \(y_i \in \{0,1\}\) and \(x_i\) includes the intercept. The logistic regression model is

\[ P(Y_i = 1 \mid x_i) = p_i(\beta) = \frac{\exp(x_i^\top\beta)}{1 + \exp(x_i^\top\beta)}. \]

The full-data objective is

\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]

In rare-event logistic regression, the number of positives

\[ N_1 = \sum_{i=1}^N y_i \]

is much smaller than the number of negatives

\[ N_0 = N - N_1. \]

In this setting, ssp.relogit() keeps all positive observations and subsamples only the negatives, which reduces computation while preserving the information carried by the rare class (Wang, Zhang, and Wang 2021).

How ssp.relogit() Works

For criterion %in% c("optA", "optL", "LCC"), the function uses the following two-stage design.

  1. Draw a pilot sample of size n.plt.
  2. Fit a pilot estimator \(\widehat{\beta}\).
  3. Use \(\widehat{\beta}\) to construct negative-sampling probabilities.
  4. Keep all positive observations.
  5. Draw a Poisson subsample from the negative observations with expected size n.ssp.
  6. Fit the subsample estimator and form an information-weighted average of the pilot and subsample estimators.

So in this function, n.ssp is the expected number of sampled negatives, not the expected total sample size. The returned expected total subsample size is

\[ N_1 + n_{\mathrm{ssp}}. \]

For criterion = "uniform", the function still keeps all positives, but it samples negatives uniformly with probability n.ssp / N0. There is no pilot estimator in that branch.

Pilot Estimator

The pilot estimator is designed specifically for rare-event logistic regression. Let \(S_{\mathrm{plt}}\) denote the pilot sample. The code draws about half the pilot sample from positives and half from negatives, then fits an unweighted logistic model on that balanced pilot sample.

The pilot objective is

\[ \widehat{\beta}^{\,\mathrm{bal}} = \arg\max_\beta \sum_{i \in S_{\mathrm{plt}}} \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]

Because the pilot sample has an artificial class balance, the fitted intercept is then corrected by

\[ \widehat{\beta}_0 = \widehat{\beta}^{\,\mathrm{bal}}_0 - \log\left(\frac{N_0}{N_1}\right), \]

while the slope coefficients are kept unchanged. The fitted pilot probabilities are

\[ \widehat{p}_i = p_i(\widehat{\beta}),\qquad i = 1,\ldots,N. \]

These pilot probabilities are used to construct the raw importance scores for negative sampling.

Choosing criterion

The criterion argument determines the raw importance score \(m_i\). These scores represent relative importance. They are not probabilities yet. The function turns them into valid negative-sampling probabilities by truncation, scaling, and the uniform mixture controlled by alpha.

Although the score is written for all observations, it matters operationally for the negative class because all positives are kept automatically.

criterion = "optA"

The A-optimal score is

\[ m_i = \widehat{p}_i \sqrt{1-\widehat{p}_i} \left\lVert \widehat{M}^{-1} x_i \right\rVert, \]

where \(\widehat{M}\) is the pilot estimate of the information matrix. This criterion aims to minimize the asymptotic variance of the subsample estimator. It targets smaller asymptotic variance but is computationally expensive because it uses \(\widehat{M}^{-1}\).

criterion = "optL"

The L-optimal score is

\[ m_i = \widehat{p}_i \sqrt{1-\widehat{p}_i}\,\lVert x_i \rVert. \]

This is the default. It keeps the same leading rare-event factor as optA, but replaces \(\lVert \widehat{M}^{-1} x_i \rVert\) by \(\lVert x_i \rVert\). So it avoids multiplying by the inverse pilot information matrix inside the norm and is computationally cheaper.

criterion = "LCC"

The local case-control score is

\[ m_i = |y_i - \widehat{p}_i|. \]

This is a baseline method related to local case-control sampling.

criterion = "uniform"

In the uniform branch, all positives are still retained. The negative observations are sampled with the same probability, so there is no pilot-based importance score.

Negative-Sampling Probabilities

For optA, optL, and LCC, ssp.relogit() uses a Poisson negative-sampling design.

First, the raw score is truncated:

\[ m_i^* = \min(m_i, H), \]

where \(H\) is a threshold controlled by b.

Then the negative inclusion probability is

\[ \pi_i^{-} = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat{\Phi}} + \alpha\frac{1}{N} \right\}, 1 \right], \]

where \(\widehat{\Phi}\) is the pilot-based normalizing quantity. The actual subsampling design is

\[ \pi_i = \begin{cases} 1, & y_i = 1,\\[6pt] \pi_i^{-}, & y_i = 0. \end{cases} \]

So positives are always kept, while negatives are independently included.

For criterion = "uniform", the design becomes

\[ \pi_i = \begin{cases} 1, & y_i = 1,\\[6pt] \dfrac{n_{\mathrm{ssp}}}{N_0}, & y_i = 0. \end{cases} \]

Choosing likelihood

The likelihood argument controls how the model fit corrects for unequal sampling probabilities.

likelihood = "weighted"

The weighted likelihood uses inverse sampling probabilities:

\[ \ell_S^{\mathrm{w}}(\beta) = \sum_{i \in S} \frac{1}{\pi_i} \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]

This is the more general correction and can be useful as a baseline.

likelihood = "logOddsCorrection"

This is the default and the more specialized rare-event correction (Wang, Zhang, and Wang 2021). For negative sampling, the sampled conditional model can be written as

\[ P(Y_i = 1 \mid x_i, i \in S) = \frac{ \exp(x_i^\top\beta + a_i) }{ 1 + \exp(x_i^\top\beta + a_i) }, \]

where the offset is

\[ a_i = -\log(\pi_i^{-}). \]

This is exactly the offset used in the implementation for the non-uniform negative-sampling branch.

Method Summary

Component Setting
criterion optA
criterion optL
criterion LCC
criterion uniform
likelihood logOddsCorrection
likelihood weighted

The main design features to remember are:

Example Data

We simulate a rare-event logistic regression problem with \(N = 2 \times 10^4\) observations.

set.seed(2)
N <- 2 * 1e4
beta0 <- c(-6, -rep(0.5, 6))
d <- length(beta0) - 1
corr <- 0.5
sigmax <- corr ^ abs(outer(1:d, 1:d, "-"))
X <- MASS::mvrnorm(n = N, mu = rep(0, d), Sigma = sigmax)
Y <- rbinom(N, 1, 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])))
n.plt <- 200
n.ssp <- 600
data <- as.data.frame(cbind(Y, X))
colnames(data) <- c("Y", paste0("V", 1:ncol(X)))
formula <- Y ~ .
print(paste("N:", N))
#> [1] "N: 20000"
print(paste("sum(Y):", sum(Y)))
#> [1] "sum(Y): 266"

Example 1: Default Rare-Event Fit

The default recommendation is criterion = "optL" with likelihood = "logOddsCorrection".

set.seed(11)

fit_default <- ssp.relogit(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  criterion = "optL",
  likelihood = "logOddsCorrection"
)

summary(fit_default)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     criterion = "optL", likelihood = "logOddsCorrection")
#> 
#> Subsample Size:
#>                                 
#> 1       Total Sample Size  20000
#> 2 Expected Subsample Size    866
#> 3   Actual Subsample Size    863
#> 4   Unique Subsample Size    863
#> 5  Expected Subample Rate  4.33%
#> 6    Actual Subample Rate 4.315%
#> 7    Unique Subample Rate 4.315%
#> 
#> Sample Composition (Logistic Regression):
#>     Sample  Size Y_count Y_rate
#>  Subsample   863     266 30.82%
#>  Full data 20000     266  1.33%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error  z value Pr(>|z|)
#> Intercept  -5.9076     0.1424 -41.4856  <0.0001
#> V1         -0.4545     0.0849  -5.3541  <0.0001
#> V2         -0.5072     0.0925  -5.4858  <0.0001
#> V3         -0.5110     0.0886  -5.7699  <0.0001
#> V4         -0.4335     0.0931  -4.6563  <0.0001
#> V5         -0.5916     0.0905  -6.5400  <0.0001
#> V6         -0.4641     0.0817  -5.6832  <0.0001

Example 2: A-Optimal Criterion

If you want the A-optimal criterion, you can switch criterion while keeping the same rare-event correction.

set.seed(2)

fit_optA <- ssp.relogit(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  criterion = "optA",
  likelihood = "logOddsCorrection"
)

summary(fit_optA)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     criterion = "optA", likelihood = "logOddsCorrection")
#> 
#> Subsample Size:
#>                                 
#> 1       Total Sample Size  20000
#> 2 Expected Subsample Size    866
#> 3   Actual Subsample Size    879
#> 4   Unique Subsample Size    879
#> 5  Expected Subample Rate  4.33%
#> 6    Actual Subample Rate 4.395%
#> 7    Unique Subample Rate 4.395%
#> 
#> Sample Composition (Logistic Regression):
#>     Sample  Size Y_count Y_rate
#>  Subsample   879     266 30.26%
#>  Full data 20000     266  1.33%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error  z value Pr(>|z|)
#> Intercept  -5.8214     0.1345 -43.2958  <0.0001
#> V1         -0.3599     0.0814  -4.4214  <0.0001
#> V2         -0.6198     0.0954  -6.4991  <0.0001
#> V3         -0.4419     0.0983  -4.4971  <0.0001
#> V4         -0.4006     0.0921  -4.3498  <0.0001
#> V5         -0.5694     0.0952  -5.9838  <0.0001
#> V6         -0.4913     0.0833  -5.8987  <0.0001

Example 3: Uniform Baseline

The uniform branch still keeps all positives and samples negatives uniformly.

set.seed(9)

fit_uniform <- ssp.relogit(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  criterion = "uniform",
  likelihood = "logOddsCorrection"
)

fit_uniform$subsample.size.expect
#> [1] 866
length(fit_uniform$index)
#> [1] 872
summary(fit_uniform)
#> Model Summary
#> 
#> 
#> Call:
#> 
#> ssp.relogit(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     criterion = "uniform", likelihood = "logOddsCorrection")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 20000
#> 2 Expected Subsample Size   866
#> 3   Actual Subsample Size   872
#> 4   Unique Subsample Size   872
#> 5  Expected Subample Rate 4.33%
#> 6    Actual Subample Rate 4.36%
#> 7    Unique Subample Rate 4.36%
#> 
#> Sample Composition (Logistic Regression):
#>     Sample  Size Y_count Y_rate
#>  Subsample   872     266  30.5%
#>  Full data 20000     266  1.33%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error  z value Pr(>|z|)
#> Intercept  -5.7572     0.1591 -36.1775  <0.0001
#> V1         -0.4193     0.1127  -3.7198   0.0002
#> V2         -0.5637     0.1357  -4.1532  <0.0001
#> V3         -0.3973     0.1294  -3.0707   0.0021
#> V4         -0.4071     0.1190  -3.4201   0.0006
#> V5         -0.5804     0.1343  -4.3221  <0.0001
#> V6         -0.4144     0.1159  -3.5769   0.0003

Returned Object

The returned object stores the fitted estimators, covariance estimates, and row indices of the selected observations.

names(fit_default)
#>  [1] "model.call"            "coef.plt"              "coef.ssp"             
#>  [4] "coef"                  "cov.ssp"               "cov"                  
#>  [7] "index.plt"             "index"                 "N"                    
#> [10] "N1"                    "N0"                    "Y.count.ssp"          
#> [13] "subsample.size.expect" "terms"

Important components include:

The coefficients and standard errors printed by summary() are based on coef and sqrt(diag(cov)).

References

Wang, HaiYing, Aonan Zhang, and Chong Wang. 2021. “Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data.” Advances in Neural Information Processing Systems 34: 19847–59.