ssp.glm: Optimal Subsampling
for Generalized Linear ModelsThis vignette introduces ssp.glm(), the main function
for optimal subsampling with generalized linear models (GLMs). We use
logistic regression as the main example because it is the most common
use case and because all three key choices in
ssp.glm()–criterion,
sampling.method, and likelihood–are available
for logistic models.
Let \((y_i, x_i)\), \(i = 1, \ldots, N\), denote the full data, where \(x_i\) includes the intercept. For a GLM with canonical parameter \(\theta_i = x_i^\top\beta\), the full-data objective can be written as
\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left\{y_i x_i^\top\beta - \psi(x_i^\top\beta)\right\}. \]
For logistic regression,
\[ p_i(\beta) = P(Y_i = 1 \mid x_i) = \psi'(x_i^\top\beta) = \frac{\exp(x_i^\top\beta)}{1 + \exp(x_i^\top\beta)}, \]
where \(\psi(t) = \log\{1 + \exp(t)\}\). The objective becomes
\[ \ell_N(\beta) = \frac{1}{N}\sum_{i=1}^N \left[ y_i x_i^\top\beta - \log\{1 + \exp(x_i^\top\beta)\} \right]. \]
Fitting this objective on all \(N\) observations may be expensive when \(N\) is large. Subsampling replaces the full-data fit with a smaller, informative subsample. The central question is how to assign a sampling probability to each observation so that the resulting subsample estimator is statistically efficient.
ssp.glm() WorksMost choices of criterion use a two-stage procedure.
n.plt.n.ssp.This two-stage path is used when
The exception is criterion = "uniform". In that case, no
pilot estimator is needed. Uniform means that all \(N\) observations have the same sampling
probability. The function draws one uniform subsample of size
n.plt + n.ssp, and the returned
subsample.size.expect is therefore
n.plt + n.ssp rather than n.ssp.
The pilot sample provides an initial estimate of the unknown parameter. Let \(S_{\mathrm{plt}}\) denote the pilot sample and let \(\pi_i^{\mathrm{plt}}\) be the pilot sampling probability for observation \(i\). The pilot estimator is obtained from the weighted pilot objective
\[ \widehat{\beta} = \arg\max_\beta \sum_{i \in S_{\mathrm{plt}}} \frac{1}{\pi_i^{\mathrm{plt}}} \left\{ y_i x_i^\top\beta - \psi(x_i^\top\beta) \right\}. \]
The fitted mean from this pilot estimator is
\[ \widehat{p}_i = p_i(\widehat{\beta}),\qquad i = 1,\ldots,N. \]
This fitted mean appears in the raw importance scores used below.
For binomial and quasibinomial families, ssp.glm() draws
the pilot sample by case-control sampling with replacement:
\[ \pi_i^{\mathrm{plt}} = \begin{cases} \dfrac{1}{2N_1}, & y_i = 1,\\[6pt] \dfrac{1}{2N_0}, & y_i = 0, \end{cases} \]
where \(N_1 = \sum_i y_i\) and \(N_0 = N - N_1\). The pilot model is then fit by weighted likelihood. For other GLM families, the pilot sample is drawn uniformly with replacement.
criterionThe criterion argument controls the raw importance score
\(m_i\) assigned to each observation.
These scores are not probabilities yet. They measure the relative
importance of observations, and then ssp.glm() turns them
into valid sampling probabilities by normalization for with-replacement
sampling, or by truncation and scaling for Poisson subsampling. For
logistic regression, the common factor is the pilot residual magnitude
\(|y_i - \widehat{p}_i|\).
criterion = "optA"The A-optimal criterion uses
\[ m_i = |y_i - \widehat{p}_i| \left\lVert \widehat{M}^{-1} x_i \right\rVert, \]
where \(\widehat{M}\) is the pilot estimate of the information matrix. This criterion aims to minimize the asymptotic variance of the subsample estimator (Wang, Zhu, and Ma 2018; Wang 2019). It targets smaller asymptotic variance but is computationally heavier because it uses \(\widehat{M}^{-1}\).
criterion = "optL"The L-optimal criterion uses
\[ m_i = |y_i - \widehat{p}_i| \, \lVert x_i \rVert. \]
This is the default. It keeps the main idea of assigning larger
scores to observations with large pilot residuals and large covariate
magnitude, but it is computationally cheaper than optA
because it removes the matrix \(\widehat{M}^{-1}\) before \(x_i\) inside the norm.
criterion = "LCC"Local case-control sampling uses
\[ m_i = |y_i - \widehat{p}_i|. \]
It favors observations whose responses are surprising under the pilot model. This is a simple baseline related to local case-control sampling (Fithian and Hastie 2014).
criterion = "uniform"Uniform sampling assigns equal probability to all observations and does not use the pilot stage:
\[ m_i = 1,\qquad i = 1,\ldots,N. \]
This is useful as a baseline because it separates the effect of subsample size from the effect of optimal probability construction.
sampling.methodAfter \(m_i\) is computed,
sampling.method controls how the actual subsample is
drawn.
sampling.method = "withReplacement"With replacement sampling draws exactly n.ssp
observations from the full data, allowing repeated indices. The sampling
probability is
\[ p_i = (1-\alpha)\frac{m_i}{\sum_{j=1}^N m_j} + \alpha\frac{1}{N}, \]
where alpha is a small mixing weight controlled by
control = list(alpha = ...). The default is
alpha = 0.01. This mixture prevents extremely small
sampling probabilities.
sampling.method = "poisson"Poisson subsampling includes each observation independently. The
realized subsample size is random, but its expectation is controlled by
n.ssp.
The implementation first truncates large scores,
\[ m_i^* = \min(m_i, H), \]
where \(H\) is estimated from the
pilot sample and depends on the control parameter b. Then
the inclusion probability is
\[ \pi_i = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat\Phi} + \alpha\frac{1}{N} \right\}, 1 \right], \]
where \(\widehat\Phi\) estimates the
full-data total of the truncated score. The parameter b
controls the truncation level; smaller values push the sampling
probabilities closer to uniform.
likelihoodThe likelihood argument controls how the model fit
corrects for unequal subsampling probabilities.
likelihood = "weighted"The weighted likelihood is the general-purpose option. For a subsample \(S\), let \(\pi_i\) denote the relevant sampling probability: the drawing probability for with-replacement sampling or the inclusion probability for Poisson subsampling. The weighted objective is
\[ \ell_S^{\mathrm{w}}(\beta) = \sum_{i \in S} \frac{1}{\pi_i} \left[ y_i x_i^\top\beta - \psi(x_i^\top\beta) \right]. \]
This likelihood works for all supported GLM families and both sampling methods.
likelihood = "logOddsCorrection"The log odds correction likelihood is a specialized option for logistic regression with Poisson subsampling (Wang and Kim 2022). Instead of fitting a weighted likelihood, it fits a sampled conditional likelihood with an offset. For an observation in the Poisson subsample,
\[ P(Y_i = 1 \mid x_i, i \in S) = \frac{ \exp(x_i^\top\beta + a_i) }{ 1 + \exp(x_i^\top\beta + a_i) }, \]
where
\[ a_i = \log\left(\frac{\pi_i^{(1)}}{\pi_i^{(0)}}\right). \]
Here \(\pi_i^{(1)}\) and \(\pi_i^{(0)}\) are the inclusion
probabilities that would be assigned to observation \(i\) if its response were 1 or 0,
respectively, with the same covariates and pilot fit. In
ssp.glm(), this option is implemented only for binomial or
quasibinomial families with
sampling.method = "poisson".
| Component | Setting |
|---|---|
criterion |
optA |
criterion |
optL |
criterion |
LCC |
criterion |
uniform |
sampling.method |
withReplacement |
sampling.method |
poisson |
likelihood |
weighted |
likelihood |
logOddsCorrection |
The main restrictions are:
logOddsCorrection is implemented only for binomial and
quasibinomial families.logOddsCorrection requires
sampling.method = "poisson".criterion = "uniform" uses a one-stage uniform sample
of size n.plt + n.ssp.We simulate a logistic regression problem with \(N = 10^4\) observations and six covariates.
set.seed(1)
N <- 1e4
beta0 <- rep(-0.5, 7)
d <- length(beta0) - 1
corr <- 0.5
sigmax <- matrix(corr, d, d) + diag(1 - corr, d)
X <- MASS::mvrnorm(N, rep(0, d), sigmax)
colnames(X) <- paste0("V", 1:d)
P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1]))
Y <- rbinom(N, 1, P)
data <- data.frame(Y = Y, X)
formula <- Y ~ .
head(data)
#> Y V1 V2 V3 V4 V5 V6
#> 1 1 -1.0918680 -0.4462684 -0.02250989 -0.19626329 -0.67460551 -0.4392570
#> 2 0 -0.1591053 -0.4748068 0.46515238 0.88370061 -0.05910325 0.1857218
#> 3 1 -1.6260754 -0.3394421 -0.68490712 -0.55721107 0.01024563 -0.6319413
#> 4 0 0.1251949 1.5113247 1.38931519 1.24287417 2.48829727 0.5534888
#> 5 0 0.1931921 -0.1478401 -0.14788926 0.46973556 0.05205022 1.0907459
#> 6 0 -0.2560258 -1.6065024 0.32710042 -0.04590727 -0.94748664 -1.2310368The default choice is criterion = "optL" with Poisson
subsampling and weighted likelihood. This is a good starting point for
large logistic regression problems.
set.seed(2)
n.plt <- 200
n.ssp <- 600
fit_optL <- ssp.glm(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optL",
sampling.method = "poisson",
likelihood = "weighted"
)
summary(fit_optL)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optL", sampling.method = "poisson",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 600
#> 4 Unique Subsample Size 600
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6%
#> 7 Unique Subample Rate 6%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.4849 0.0991 -4.8955 <0.0001
#> V1 -0.6951 0.1325 -5.2468 <0.0001
#> V2 -0.4648 0.1168 -3.9777 <0.0001
#> V3 -0.3502 0.1121 -3.1249 0.0018
#> V4 -0.4941 0.1177 -4.1987 <0.0001
#> V5 -0.4792 0.1252 -3.8264 0.0001
#> V6 -0.5693 0.1175 -4.8468 <0.0001For logistic regression with Poisson subsampling,
logOddsCorrection uses the sampled conditional likelihood
described above.
set.seed(18)
fit_logodds <- ssp.glm(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "optA",
sampling.method = "poisson",
likelihood = "logOddsCorrection"
)
summary(fit_logodds)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "optA", sampling.method = "poisson",
#> likelihood = "logOddsCorrection")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 600
#> 3 Actual Subsample Size 611
#> 4 Unique Subsample Size 611
#> 5 Expected Subample Rate 6%
#> 6 Actual Subample Rate 6.11%
#> 7 Unique Subample Rate 6.11%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.4752 0.0890 -5.3368 <0.0001
#> V1 -0.4172 0.1089 -3.8291 0.0001
#> V2 -0.4723 0.1069 -4.4188 <0.0001
#> V3 -0.4716 0.1123 -4.2010 <0.0001
#> V4 -0.6337 0.1046 -6.0565 <0.0001
#> V5 -0.4335 0.1134 -3.8219 0.0001
#> V6 -0.5808 0.1090 -5.3279 <0.0001Uniform sampling is useful as a baseline. Notice that the expected
subsample size is n.plt + n.ssp, because the pilot and
second-stage sizes are combined into one uniform draw.
set.seed(4)
fit_uniform <- ssp.glm(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "uniform",
sampling.method = "withReplacement",
likelihood = "weighted"
)
fit_uniform$subsample.size.expect
#> [1] 800
length(fit_uniform$index)
#> [1] 800
summary(fit_uniform)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "uniform", sampling.method = "withReplacement",
#> likelihood = "weighted")
#>
#> Subsample Size:
#>
#> 1 Total Sample Size 10000
#> 2 Expected Subsample Size 800
#> 3 Actual Subsample Size 800
#> 4 Unique Subsample Size 771
#> 5 Expected Subample Rate 8%
#> 6 Actual Subample Rate 8%
#> 7 Unique Subample Rate 7.71%
#>
#> Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept -0.3727 0.1080 -3.4507 0.0006
#> V1 -0.3866 0.1303 -2.9674 0.0030
#> V2 -0.6510 0.1414 -4.6031 <0.0001
#> V3 -0.6600 0.1374 -4.8034 <0.0001
#> V4 -0.4579 0.1363 -3.3587 0.0008
#> V5 -0.3745 0.1381 -2.7115 0.0067
#> V6 -0.6519 0.1352 -4.8229 <0.0001The returned object stores the model call, coefficient estimates, covariance estimates, and row indices of the drawn samples.
names(fit_optL)
#> [1] "model.call" "coef.plt" "coef.ssp"
#> [4] "coef" "cov.plt" "cov.ssp"
#> [7] "cov" "index.plt" "index"
#> [10] "N" "subsample.size.expect" "comp.time"
#> [13] "terms"Important components include:
coef.plt: the pilot estimator.coef.ssp: the estimator from the second-stage
subsample.coef: the combined estimator printed by
summary().cov.ssp: covariance estimate for
coef.ssp.cov: covariance estimate for coef.index.plt: row indices for the pilot subsample, when a
pilot sample is used.index: row indices for the final drawn subsample.subsample.size.expect: the expected subsample size
reported by the method.The coefficients and standard errors printed by
summary() are based on coef and
sqrt(diag(cov)).
ssp.glm() also supports Poisson, quasipoisson, Gaussian,
and Gamma GLMs with the weighted likelihood. The
logOddsCorrection likelihood is specific to logistic
regression and should be used only with binomial or quasibinomial
families and Poisson subsampling.
For binomial and Poisson models, survey::svyglm()
recommends the quasi families, such as quasibinomial() and
quasipoisson(), to avoid warnings about non-integer
successes after weighting. The quasi families give the same point
estimates for these models.