ssp.glm.rF: Balanced
Subsampling for Preserving Rare Features in Generalized Linear
ModelsThis vignette illustrates how to use ssp.glm.rF() for
generalized linear models with rare binary features. A rare feature is a
binary covariate that takes the value 1 in only a small fraction of
observations. Ordinary subsampling can miss many of these expressed
rare-feature observations, making estimation of the corresponding
coefficients unstable.
ssp.glm.rF() uses rarity-aware sampling probabilities
for one-step balance-score sampling, two-step optimal subsampling,
optional response balancing for binary outcomes, automatic rare-feature
detection, and a combined estimator fitted on the union of the pilot and
second-step subsamples.
We first simulate a logistic regression dataset with two rare binary
features and two continuous covariates. In the formula
Y ~ ., the model matrix contains an intercept column.
Numeric rareFeature.index values follow the original
covariate order supplied by the user; the function internally shifts the
indices to account for the intercept column.
The default criterion is "BL-Uni". It draws one Poisson
subsample with probabilities proportional to the rare-feature balance
score. For "BL-Uni" and "Uni", the expected
sample size is n.plt + n.ssp.
For observation \(i\), the balance score is \[ b(Z_i) = \sum_{j=1}^{d_r} \frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j}, \] where \(d_r\) is the number of rare features and \(\bar{Z}_j\) is the prevalence of the \(j\)th rare feature in the full data. Observations containing expressed rare features receive larger scores and therefore larger sampling probabilities.
fit_bl <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "BL-Uni",
rareFeature.index = rareFeature.index
)
summary(fit_bl)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = rareFeature.index)
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 1001
#> Expected Subsample Size 1000
#> Actual Subsample Size 1001
#> Unique Subsample Size 1001
#> Combined Union Size 1001
#> Overlap Count 1001
#> Overlap Rate in Pilot 100%
#> Overlap Rate in Subsample 100%
#> Expected Subsample Rate 20%
#> Actual Subsample Rate 20.02%
#> Unique Subsample Rate 20.02%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate Rare_row_rate
#> Pilot 1001 656 65.53% 49.15%
#> Subsample 1001 656 65.53% 49.15%
#> Combined union 1001 656 65.53% 49.15%
#> Full data 5000 3044 60.88% 11.32%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z1 212 21.18% 212 4.24% 100%
#> Z2 298 29.77% 372 7.44% 80.11%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 509 50.85%
#> 1_ones 474 47.35%
#> 2_ones 18 1.8%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001The summary reports the realized sample size, response composition, rare-feature coverage, and coefficient estimates. Because this is a one-step method, the pilot, subsample, and combined estimators are the same.
Two-step criteria such as "Lopt", "Aopt",
"R-Lopt", and "BL-Lopt" first draw a pilot
sample, fit a pilot estimator, compute second-step sampling
probabilities, and then refit the model on the second-step subsample.
The final combined estimator is fitted on the union of the pilot and
second-step samples.
For two-step methods, balance.X.plt = TRUE draws the
pilot sample using the balance score.
fit_rlopt <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "R-Lopt",
balance.X.plt = TRUE,
rareFeature.index = c("Z1", "Z2")
)
summary(fit_rlopt)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE,
#> rareFeature.index = c("Z1", "Z2"))
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 300
#> Expected Subsample Size 700
#> Actual Subsample Size 702
#> Unique Subsample Size 702
#> Combined Union Size 892
#> Overlap Count 110
#> Overlap Rate in Pilot 36.67%
#> Overlap Rate in Subsample 15.67%
#> Expected Subsample Rate 14%
#> Actual Subsample Rate 14.04%
#> Unique Subsample Rate 14.04%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate Rare_row_rate
#> Pilot 300 211 70.33% 55%
#> Subsample 702 332 47.29% 49%
#> Combined union 892 483 54.15% 45.74%
#> Full data 5000 3044 60.88% 11.32%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z1 150 21.37% 212 4.24% 70.75%
#> Z2 209 29.77% 372 7.44% 56.18%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 358 51%
#> 1_ones 329 46.87%
#> 2_ones 15 2.14%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.7663 0.1986 3.8579 0.0001
#> Z1 0.5044 0.3246 1.5541 0.1202
#> Z2 0.3788 0.3488 1.0860 0.2775
#> X1 0.5885 0.1784 3.2984 0.0010
#> X2 0.8452 0.1801 4.6936 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.3045 0.1166 2.6118 0.0090
#> Z1 0.7609 0.2004 3.7969 0.0001
#> Z2 0.6093 0.1799 3.3874 0.0007
#> X1 0.5275 0.0927 5.6909 <0.0001
#> X2 0.5109 0.0974 5.2464 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4040 0.0943 4.2854 <0.0001
#> Z1 0.6489 0.1860 3.4887 0.0005
#> Z2 0.6148 0.1625 3.7833 0.0002
#> X1 0.5406 0.0803 6.7342 <0.0001
#> X2 0.5595 0.0767 7.2953 <0.0001rareFeature.index = NULLIf rareFeature.index = NULL, the function searches for
binary columns whose prevalence is below rareThreshold.
For binary outcomes, balance.Y.ssp = TRUE applies a
case-control style allocation for one-step "Uni" and
"BL-Uni" methods. The option
balance.Y.all = TRUE includes all observations with
Y = 1 and subsamples from observations with
Y = 0.
fit_y_balanced <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "BL-Uni",
balance.Y.ssp = TRUE,
rareFeature.index = c("Z1", "Z2")
)
c(
full_Y_rate = mean(data$Y),
subsample_Y_rate = fit_y_balanced$Y.proportion.ssp
)
#> full_Y_rate subsample_Y_rate
#> 0.6088000 0.5185557By default, sampled observations are fitted with inverse-probability
weights. For one-step methods whose sampling probabilities do not depend
on the response, objective.weight = "unweighted" can be
used. Two-step methods currently use a weighted second-step
objective.
The control argument accepts alpha,
b, and poi.method. The default
poi.method = "exact" computes Poisson probabilities using
full-data normalization. The alternative
poi.method = "estimated" uses the pilot sample to estimate
the normalizing quantity.
fit_estimated <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "R-Lopt",
balance.X.plt = TRUE,
rareFeature.index = c("Z1", "Z2"),
control = list(poi.method = "estimated", b = 2),
record.stage.time = TRUE
)
fit_estimated$stage.time
#> balance_score pilot.estimate subsampling subsample.estimate
#> 0.001 0.001 0.001 0.001
#> combining.union result.assembly
#> 0.001 0.000The rare-feature machinery can also be used with non-binomial GLMs. Response balancing options are ignored for non-binary outcomes.
set.seed(3)
N_g <- 3000
Z <- rbinom(N_g, 1, 0.05)
X <- rnorm(N_g)
y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g)
gaussian_data <- data.frame(y, Z, X)
fit_gaussian <- ssp.glm.rF(
y ~ .,
data = gaussian_data,
n.plt = 200,
n.ssp = 500,
family = "gaussian",
criterion = "BL-Uni",
rareFeature.index = "Z"
)
summary(fit_gaussian)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = y ~ ., data = gaussian_data, n.plt = 200,
#> n.ssp = 500, family = "gaussian", criterion = "BL-Uni", rareFeature.index = "Z")
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 680
#> Expected Subsample Size 700
#> Actual Subsample Size 680
#> Unique Subsample Size 680
#> Combined Union Size 680
#> Overlap Count 680
#> Overlap Rate in Pilot 100%
#> Overlap Rate in Subsample 100%
#> Expected Subsample Rate 23.33%
#> Actual Subsample Rate 22.67%
#> Unique Subsample Rate 22.67%
#>
#> Sample Composition:
#> Sample Size Rare_row_rate
#> Pilot 680 22.21%
#> Subsample 680 22.21%
#> Combined union 680 22.21%
#> Full data 3000 5.03%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z 151 22.21% 151 5.0333% 100%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 529 77.79%
#> 1_ones 151 22.21%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001