ssp.glm: Optimal Subsampling for Generalized Linear Models

This vignette introduces ssp.glm(), the main function for optimal subsampling with generalized linear models (GLMs). We use logistic regression as the main example because it is the most common use case and because all three key choices in ssp.glm()criterion, sampling.method, and likelihood–are available for logistic models.

library(subsampling)

Model and Objective

Let \((y_i, x_i)\), \(i = 1, \ldots, N\), denote the full data, where \(x_i\) includes the intercept. For a GLM with canonical parameter \(\theta_i = x_i^\top\beta\), the full-data objective can be written as

\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left\{y_i x_i^\top\beta - \psi(x_i^\top\beta)\right\}. \]

For logistic regression,

\[ p_i(\beta) = P(Y_i = 1 \mid x_i) = \psi'(x_i^\top\beta) = \frac{\exp(x_i^\top\beta)}{1 + \exp(x_i^\top\beta)}, \]

where \(\psi(t) = \log\{1 + \exp(t)\}\). The objective becomes

\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]

Fitting this objective on all \(N\) observations may be expensive when \(N\) is large. Subsampling replaces the full-data fit with a smaller, informative subsample. The central question is how to assign a sampling probability to each observation so that the resulting subsample estimator is statistically efficient.

Two Ways ssp.glm() Works

Most choices of criterion use a two-stage procedure.

  1. Draw a pilot subsample of size n.plt.
  2. Fit a pilot estimator \(\widehat{\beta}\).
  3. Use \(\widehat{\beta}\) to compute second-stage subsampling probabilities.
  4. Draw the second-stage subsample with expected size n.ssp.
  5. Fit the GLM on the second-stage subsample and form an information-weighted average of the pilot and second-stage estimators.

This two-stage path is used when

criterion %in% c("optA", "optL", "LCC")

The exception is criterion = "uniform". In that case, no pilot estimator is needed. Uniform means that all \(N\) observations have the same sampling probability. The function draws one uniform subsample of size n.plt + n.ssp, and the returned subsample.size.expect is therefore n.plt + n.ssp rather than n.ssp.

Pilot Estimator

The pilot sample provides an initial estimate of the unknown parameter. Let \(S_{\mathrm{plt}}\) denote the pilot sample and let \(\pi_i^{\mathrm{plt}}\) be the pilot sampling probability for observation \(i\). The pilot estimator is obtained from the weighted pilot objective

\[ \widehat{\beta} = \arg\max_\beta \sum_{i \in S_{\mathrm{plt}}} \frac{1}{\pi_i^{\mathrm{plt}}} \left\{ y_i x_i^\top\beta - \psi(x_i^\top\beta) \right\}. \]

The fitted mean from this pilot estimator is

\[ \widehat{p}_i = p_i(\widehat{\beta}),\qquad i = 1,\ldots,N. \]

This fitted mean appears in the raw importance scores used below.

For binomial and quasibinomial families, ssp.glm() draws the pilot sample by case-control sampling with replacement:

\[ \pi_i^{\mathrm{plt}} = \begin{cases} \dfrac{1}{2N_1}, & y_i = 1,\\[6pt] \dfrac{1}{2N_0}, & y_i = 0, \end{cases} \]

where \(N_1 = \sum_i y_i\) and \(N_0 = N - N_1\). The pilot model is then fit by weighted likelihood. For other GLM families, the pilot sample is drawn uniformly with replacement.

Choosing criterion

The criterion argument controls the raw importance score \(m_i\) assigned to each observation. These scores are not probabilities yet. They measure the relative importance of observations, and then ssp.glm() turns them into valid sampling probabilities by normalization for with-replacement sampling, or by truncation and scaling for Poisson subsampling. For logistic regression, the common factor is the pilot residual magnitude \(|y_i - \widehat{p}_i|\).

criterion = "optA"

The A-optimal criterion uses

\[ m_i = |y_i - \widehat{p}_i| \left\lVert \widehat{M}^{-1} x_i \right\rVert, \]

where \(\widehat{M}\) is the pilot estimate of the information matrix. This criterion aims to minimize the asymptotic variance of the subsample estimator (Wang, Zhu, and Ma 2018; Wang 2019). It targets smaller asymptotic variance but is computationally heavier because it uses \(\widehat{M}^{-1}\).

criterion = "optL"

The L-optimal criterion uses

\[ m_i = |y_i - \widehat{p}_i| \, \lVert x_i \rVert. \]

This is the default. It keeps the main idea of assigning larger scores to observations with large pilot residuals and large covariate magnitude, but it is computationally cheaper than optA because it removes the matrix \(\widehat{M}^{-1}\) before \(x_i\) inside the norm.

criterion = "LCC"

Local case-control sampling uses

\[ m_i = |y_i - \widehat{p}_i|. \]

It favors observations whose responses are surprising under the pilot model. This is a simple baseline related to local case-control sampling (Fithian and Hastie 2014).

criterion = "uniform"

Uniform sampling assigns equal probability to all observations and does not use the pilot stage:

\[ m_i = 1,\qquad i = 1,\ldots,N. \]

This is useful as a baseline because it separates the effect of subsample size from the effect of optimal probability construction.

Choosing sampling.method

After \(m_i\) is computed, sampling.method controls how the actual subsample is drawn.

sampling.method = "withReplacement"

With replacement sampling draws exactly n.ssp observations from the full data, allowing repeated indices. The sampling probability is

\[ p_i = (1-\alpha)\frac{m_i}{\sum_{j=1}^N m_j} + \alpha\frac{1}{N}, \]

where alpha is a small mixing weight controlled by control = list(alpha = ...). The default is alpha = 0.01. This mixture prevents extremely small sampling probabilities.

sampling.method = "poisson"

Poisson subsampling includes each observation independently. The realized subsample size is random, but its expectation is controlled by n.ssp.

The implementation first truncates large scores,

\[ m_i^* = \min(m_i, H), \]

where \(H\) is estimated from the pilot sample and depends on the control parameter b. Then the inclusion probability is

\[ \pi_i = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat\Phi} + \alpha\frac{1}{N} \right\}, 1 \right], \]

where \(\widehat\Phi\) estimates the full-data total of the truncated score. The parameter b controls the truncation level; smaller values push the sampling probabilities closer to uniform.

Choosing likelihood

The likelihood argument controls how the model fit corrects for unequal subsampling probabilities.

likelihood = "weighted"

The weighted likelihood is the general-purpose option. For a subsample \(S\), let \(\pi_i\) denote the relevant sampling probability: the drawing probability for with-replacement sampling or the inclusion probability for Poisson subsampling. The weighted objective is

\[ \ell_S^{\mathrm{w}}(\beta) = \sum_{i \in S} \frac{1}{\pi_i} \left[ y_i x_i^\top\beta - \psi(x_i^\top\beta) \right]. \]

This likelihood works for all supported GLM families and both sampling methods.

likelihood = "logOddsCorrection"

The log odds correction likelihood is a specialized option for logistic regression with Poisson subsampling (Wang and Kim 2022). Instead of fitting a weighted likelihood, it fits a sampled conditional likelihood with an offset. For an observation in the Poisson subsample,

\[ P(Y_i = 1 \mid x_i, i \in S) = \frac{ \exp(x_i^\top\beta + a_i) }{ 1 + \exp(x_i^\top\beta + a_i) }, \]

where

\[ a_i = \log\left(\frac{\pi_i^{(1)}}{\pi_i^{(0)}}\right). \]

Here \(\pi_i^{(1)}\) and \(\pi_i^{(0)}\) are the inclusion probabilities that would be assigned to observation \(i\) if its response were 1 or 0, respectively, with the same covariates and pilot fit. In ssp.glm(), this option is implemented only for binomial or quasibinomial families with sampling.method = "poisson".

Method Summary

Component Setting
criterion optA
criterion optL
criterion LCC
criterion uniform
sampling.method withReplacement
sampling.method poisson
likelihood weighted
likelihood logOddsCorrection

The main restrictions are:

Example Data

We simulate a logistic regression problem with \(N = 10^4\) observations and six covariates.

set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1 - corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste0("V", 1:d)
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
head(data)
#>   Y         V1         V2          V3          V4          V5         V6
#> 1 1 -1.0918680 -0.4462684 -0.02250989 -0.19626329 -0.67460551 -0.4392570
#> 2 0 -0.1591053 -0.4748068  0.46515238  0.88370061 -0.05910325  0.1857218
#> 3 1 -1.6260754 -0.3394421 -0.68490712 -0.55721107  0.01024563 -0.6319413
#> 4 0  0.1251949  1.5113247  1.38931519  1.24287417  2.48829727  0.5534888
#> 5 0  0.1931921 -0.1478401 -0.14788926  0.46973556  0.05205022  1.0907459
#> 6 0 -0.2560258 -1.6065024  0.32710042 -0.04590727 -0.94748664 -1.2310368

Example 1: Default Optimal Subsampling

The default choice is criterion = "optL" with Poisson subsampling and weighted likelihood. This is a good starting point for large logistic regression problems.

set.seed(2)
n.plt <- 200
n.ssp <- 600

fit_optL <- ssp.glm(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "optL",
  sampling.method = "poisson",
  likelihood = "weighted"
)

summary(fit_optL)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "optL", sampling.method = "poisson", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   600
#> 4   Unique Subsample Size   600
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate    6%
#> 7    Unique Subample Rate    6%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept  -0.4849     0.0991 -4.8955  <0.0001
#> V1         -0.6951     0.1325 -5.2468  <0.0001
#> V2         -0.4648     0.1168 -3.9777  <0.0001
#> V3         -0.3502     0.1121 -3.1249   0.0018
#> V4         -0.4941     0.1177 -4.1987  <0.0001
#> V5         -0.4792     0.1252 -3.8264   0.0001
#> V6         -0.5693     0.1175 -4.8468  <0.0001

Example 2: Logistic Poisson Correction

For logistic regression with Poisson subsampling, logOddsCorrection uses the sampled conditional likelihood described above.

set.seed(18)

fit_logodds <- ssp.glm(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "optA",
  sampling.method = "poisson",
  likelihood = "logOddsCorrection"
)

summary(fit_logodds)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "optA", sampling.method = "poisson", 
#>     likelihood = "logOddsCorrection")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   600
#> 3   Actual Subsample Size   611
#> 4   Unique Subsample Size   611
#> 5  Expected Subample Rate    6%
#> 6    Actual Subample Rate 6.11%
#> 7    Unique Subample Rate 6.11%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept  -0.4752     0.0890 -5.3368  <0.0001
#> V1         -0.4172     0.1089 -3.8291   0.0001
#> V2         -0.4723     0.1069 -4.4188  <0.0001
#> V3         -0.4716     0.1123 -4.2010  <0.0001
#> V4         -0.6337     0.1046 -6.0565  <0.0001
#> V5         -0.4335     0.1134 -3.8219   0.0001
#> V6         -0.5808     0.1090 -5.3279  <0.0001

Example 3: Uniform Baseline

Uniform sampling is useful as a baseline. Notice that the expected subsample size is n.plt + n.ssp, because the pilot and second-stage sizes are combined into one uniform draw.

set.seed(4)

fit_uniform <- ssp.glm(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "uniform",
  sampling.method = "withReplacement",
  likelihood = "weighted"
)

fit_uniform$subsample.size.expect
#> [1] 800
length(fit_uniform$index)
#> [1] 800
summary(fit_uniform)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "uniform", sampling.method = "withReplacement", 
#>     likelihood = "weighted")
#> 
#> Subsample Size:
#>                                
#> 1       Total Sample Size 10000
#> 2 Expected Subsample Size   800
#> 3   Actual Subsample Size   800
#> 4   Unique Subsample Size   771
#> 5  Expected Subample Rate    8%
#> 6    Actual Subample Rate    8%
#> 7    Unique Subample Rate 7.71%
#> 
#> Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept  -0.3727     0.1080 -3.4507   0.0006
#> V1         -0.3866     0.1303 -2.9674   0.0030
#> V2         -0.6510     0.1414 -4.6031  <0.0001
#> V3         -0.6600     0.1374 -4.8034  <0.0001
#> V4         -0.4579     0.1363 -3.3587   0.0008
#> V5         -0.3745     0.1381 -2.7115   0.0067
#> V6         -0.6519     0.1352 -4.8229  <0.0001

Returned Object

The returned object stores the model call, coefficient estimates, covariance estimates, and row indices of the drawn samples.

names(fit_optL)
#>  [1] "model.call"            "coef.plt"              "coef.ssp"             
#>  [4] "coef"                  "cov.plt"               "cov.ssp"              
#>  [7] "cov"                   "index.plt"             "index"                
#> [10] "N"                     "subsample.size.expect" "comp.time"            
#> [13] "terms"

Important components include:

The coefficients and standard errors printed by summary() are based on coef and sqrt(diag(cov)).

Other Families

ssp.glm() also supports Poisson, quasipoisson, Gaussian, and Gamma GLMs with the weighted likelihood. The logOddsCorrection likelihood is specific to logistic regression and should be used only with binomial or quasibinomial families and Poisson subsampling.

For binomial and Poisson models, survey::svyglm() recommends the quasi families, such as quasibinomial() and quasipoisson(), to avoid warnings about non-integer successes after weighting. The quasi families give the same point estimates for these models.

References

Fithian, William, and Trevor Hastie. 2014. “Local Case-Control Sampling: Efficient Subsampling in Imbalanced Data Sets.” Annals of Statistics 42 (5): 1693.
Wang, HaiYing. 2019. “More Efficient Estimation for Logistic Regression with Optimal Subsamples.” Journal of Machine Learning Research 20 (132): 1–59.
Wang, HaiYing, and Jae Kwang Kim. 2022. “Maximum Sampled Conditional Likelihood for Informative Subsampling.” Journal of Machine Learning Research 23 (332): 1–50.
Wang, HaiYing, Rong Zhu, and Ping Ma. 2018. “Optimal Subsampling for Large Sample Logistic Regression.” Journal of the American Statistical Association 113 (522): 829–44.