Algorithmic Structure of Optimal Subsampling

This vignette describes the common algorithmic structure used by the optimal subsampling methods in subsampling.

The exact model, score, and objective function differ across functions, but the optimal subsampling workflow in this package can be viewed as the following procedure.

The steps below explain the same structure used across the models in this package.

Full-Data Problem

Suppose the full data are

\[ \{z_i: i = 1,\ldots,N\}, \]

where \(z_i\) may contain a response and covariates. Let \(\theta\) denote the model parameter. The full-data estimator can often be written as an optimization problem of the form

\[ \widehat{\theta}_{\mathrm{full}} = \arg\min_\theta \frac{1}{N}\sum_{i=1}^N \ell_i(\theta). \]

When \(N\) is large, solving the full-data problem may be computationally expensive. The goal of subsampling is to draw a smaller informative subsample that balances computational efficiency and statistical efficiency.

Step 1: Pilot Estimation

Optimal subsampling probabilities usually depend on the unknown model parameter and on quantities that would be expensive to calculate on the full data. The pilot sample provides an initial value or a cheaper approximation to these quantities.

Let \(S_{\mathrm{plt}}\) be the pilot sample and let \(\pi_i^{\mathrm{plt}}\) be the pilot sampling probability. A generic weighted pilot estimator is

\[ \widehat{\theta}_{\mathrm{plt}} = \arg\min_\theta \sum_{i \in S_{\mathrm{plt}}} \frac{1}{\pi_i^{\mathrm{plt}}}\ell_i(\theta). \]

Some models use a specialized pilot design. For example, ssp.relogit() draws a case-control pilot sample from positive and negative classes, then applies an intercept correction. ssp.glm.rF() can use balanced pilot sampling to preserve rare binary features. The purpose is the same: obtain a useful first estimate or approximation without fitting the model on all \(N\) observations.

Step 2: Calculate Raw Importance Scores

The pilot estimator is used to assign each observation a raw importance score, say

\[ m_i = m(z_i; \widehat{\theta}_{\mathrm{plt}}). \]

This score measures relative importance, not a valid sampling probability. A large \(m_i\) means observation \(i\) is more useful for the target estimation goal.

For example, in ssp.glm() with logistic regression,

\[ m_i^{\mathrm{optL}} = |y_i - \widehat{p}_i|\,\lVert x_i\rVert. \]

Other models use different scores, but the role of the score is the same: it summarizes how important each observation is for constructing the second-step subsampling probabilities.

Step 3: Construct Probabilities and Draw the Second-Step Subsample

Raw scores are transformed so that they become valid sampling probabilities. Once these probabilities are available, the second-step subsample is drawn directly from the corresponding sampling design.

The package uses two main sampling procedures: sampling with replacement and Poisson subsampling. We use \(\pi_i\) for both mechanisms, but the meaning is different.

For sampling with replacement, a common drawing probability is

\[ \pi_i = (1-\alpha)\frac{m_i}{\sum_{j=1}^N m_j} + \alpha\frac{1}{N}, \]

where \(\alpha \in [0,1]\) mixes the optimal probabilities with uniform probabilities. These probabilities satisfy

\[ \sum_{i=1}^N \pi_i = 1. \]

Here \(\pi_i\) is the probability that observation \(i\) is selected on one draw. Sampling with replacement draws exactly n.ssp times:

\[ I_1,\ldots,I_{n_{\mathrm{ssp}}} \sim \mathrm{Multinomial}(\pi_1,\ldots,\pi_N). \]

Repeated row indices are allowed.

For Poisson subsampling, \(\pi_i\) is an inclusion probability for observation \(i\). The raw score may be truncated as

\[ m_i^* = \min(m_i, H), \]

and then transformed into

\[ \pi_i = \min\left[ n_{\mathrm{ssp}} \left\{ (1-\alpha)\frac{m_i^*}{\widehat{\Phi}} + \alpha\frac{1}{N} \right\}, 1 \right]. \]

Here \(H\) is a truncation threshold and \(\widehat{\Phi}\) is a pilot-based normalizing quantity. The probabilities are designed so that

\[ \sum_{i=1}^N \pi_i \approx n_{\mathrm{ssp}}, \]

up to truncation at 1 and pilot approximation error. Poisson subsampling draws each observation independently:

\[ \delta_i \sim \mathrm{Bernoulli}(\pi_i), \]

and

\[ S_{\mathrm{ssp}} = \{i: \delta_i = 1\}. \]

There is no replacement under Poisson subsampling because each observation is included at most once. The actual sample size can differ from the designed expected size:

\[ |S_{\mathrm{ssp}}| = \sum_{i=1}^N \delta_i. \]

Step 4: Fit a Corrected Subsample Objective Function

Nonuniform subsampling changes the objective. A corrected objective is required to remove the bias introduced by informative sampling.

The most general correction is inverse-probability weighting:

\[ \widehat{\theta}_{\mathrm{ssp}} = \arg\min_\theta \sum_{i \in S_{\mathrm{ssp}}} \frac{1}{\pi_i}\ell_i(\theta). \]

For with-replacement sampling, the denominator should be read as the drawing probability. For Poisson subsampling, the denominator is the inclusion probability.

Some models support other objectives instead of a weighted objective. For logistic regression with Poisson subsampling, ssp.glm() supports logOddsCorrection. For rare-event logistic regression, ssp.relogit() uses a negative-sampling log-odds correction by default. For softmax regression, ssp.softmax() supports MSCLE, a sampled conditional likelihood.

The implementation details differ by model, but the role is the same: fit the model on the subsample using a loss that accounts for the sampling design.

Step 5: Optional Combination

Some functions combine the pilot estimator and the subsample estimator. The pilot sample has already been computed, so it can still contribute information. A typical combination has the form

\[ \widehat{\theta}_{\mathrm{cmb}} = \left(\widehat{I}_{\mathrm{plt}} + \widehat{I}_{\mathrm{ssp}}\right)^{-1} \left( \widehat{I}_{\mathrm{plt}}\widehat{\theta}_{\mathrm{plt}} + \widehat{I}_{\mathrm{ssp}}\widehat{\theta}_{\mathrm{ssp}} \right), \]

where \(\widehat{I}_{\mathrm{plt}}\) and \(\widehat{I}_{\mathrm{ssp}}\) are estimated information matrices.

Not every function uses this step in the same way. For example, ssp.glm(), ssp.relogit(), and ssp.softmax() combine pilot and subsample estimators, while ssp.quantreg() uses repeated subsampling to estimate covariance for the subsample estimator.

Function Map

The table below maps package-level functions to the helper functions that implement the common algorithmic pattern.

Function	Pilot	Probability and draw	Subsample fit	Combination
`ssp.glm()`	`pilot.estimate()`	`subsampling()`	`subsample.estimate()`	`combining()`
`ssp.relogit()`	`rare.pilot.estimate()`	`rare.subsampling()`	`rare.subsample.estimate()`	`rare.combining()`
`ssp.softmax()`	`softmax.plt.estimate()`	`softmax.subsampling()`	`softmax.subsample.estimate()`	`softmax.combining()`
`ssp.quantreg()`	`plt.estimation.quantreg()`	`ssp.estimation.quantreg()`	`ssp.estimation.quantreg()`	Not used directly
`ssp.glm.rF()`	`rF.pilot.estimate()`	`rF.subsampling()`	`rF.subsample.estimate()`	`rF.combining()`

Checking Points

When adapting the package structure for a new method, the following checks are useful.

Check the scale of the probabilities. For with-replacement sampling, \(\sum_i \pi_i\) should be 1. For Poisson subsampling, \(\sum_i \pi_i\) should be close to the designed expected subsample size, after accounting for any specific constraints.
For Poisson subsampling, check whether the actual sample size is far from the expected size. Some variation is natural. If the actual size is repeatedly far too small or too large, the probability normalization, truncation, or constraint logic may be wrong.
In the combination step, sample size matters. A larger second-step subsample usually pulls the combined estimator toward the second-step estimator because the combination is an information-weighted average.
Check the corrected loss against the sampling design. If the subsample was drawn with unequal probabilities, the fitted loss should use the corresponding weights, offsets, or conditional likelihood correction.
Check the pilot sample before trusting the second-step probabilities. If the pilot estimator is unstable, the raw importance scores can be unstable even when the probability formulas are coded correctly.