PUMF data are a synthetic sample, altered from the original survey
responses for privacy reasons. As with any survey sample, point
estimates carry sampling uncertainty, and it is good practice to
quantify it. Some PUMF ship with bootstrap weights for exactly this
purpose; when they do not, canpumf can generate them
with add_bootstrap_weights().
This vignette explains how add_bootstrap_weights()
works: the method it uses, where the weights live, how stratification
works, and — importantly — what happens when you call it again on a
dataset that already has weights.
For each of n_replicates replicates, a sample of
n rows is drawn with replacement from the
n rows of the (stratum of the) survey. The bootstrap weight
for row \(i\) in replicate \(b\) is
\[ w^{*}_{i,b} = w_i \cdot c_{i,b}, \]
where \(w_i\) is the original survey weight and \(c_{i,b}\) is the number of times row \(i\) was selected in replicate \(b\). Because each row is selected once on average, \(\mathbb{E}[c_{i,b}] = 1\) and the replicate weights are unbiased for the original weights. The spread of an estimate across the replicates is what gives you its sampling variance (see Estimating uncertainty).
The default is n_replicates = 500. Pass
seed for reproducibility.
add_bootstrap_weights() works on two kinds of input and
behaves differently for each:
get_pumf() (the typical case). The replicate weights are
written into the DuckDB file as a persistent, separate table and exposed
through a VIEW that joins the survey table to the weights. The returned
tbl points at that view. Because the weights are persisted,
subsequent calls reuse them instead of
recomputing.data.frame /
tibble (e.g. after collect()). The
weights are generated in memory and the augmented data frame is
returned. Nothing is persisted.cis <- get_pumf("CIS", "2019")
cis_bsw <- cis |>
add_bootstrap_weights(weight_col = "FWEIGHT", n_replicates = 200, seed = 42)
# 200 replicate columns CPBSW1 … CPBSW200 are now available
grep("^CPBSW", colnames(cis_bsw), value = TRUE) |> head()
#> [1] "CPBSW1" "CPBSW2" "CPBSW3" "CPBSW4" "CPBSW5" "CPBSW6"The replicate weights are not copied into the main survey table. Instead:
paste0("pumf_bsw_", tolower(weight_col))
(e.g. pumf_bsw_pweight). The auto-naming means generating
weights for two different weight columns (e.g. a household weight and a
person weight) produces two independent BSW tables that never
collide.eng_bsw_pweight) joins the
survey table to the BSW table on a row identifier, so the returned
tbl exposes every survey column plus the replicate
columns.bsw_info() summarises what is stored:
bsw_info(cis_bsw)
#> # A tibble: 1 × 6
#> weight_col bsw_table view_name view_exists n_replicates size_mb
#> <chr> <chr> <chr> <lgl> <int> <dbl>
#> 1 FWEIGHT pumf_bsw_fweight eng_bsw_fweight TRUE 200 0.07This separation keeps the main table untouched and lets you add, inspect, or remove replicate weights independently of the survey data.
To join replicate weights back to the survey rows a stable row
identifier is needed. With id_col = NULL (the default):
bsw_join_key is used when one is
defined (e.g. PEFAMID for the SFS), requiring no change to
the table; otherwisepumf_row_id column (DuckDB rowid) is
added to the survey table once.You can always pass id_col explicitly to use a natural
key.
Many survey designs are stratified, and the resampling should respect
that: each replicate should resample within each
stratum, preserving the stratum sample sizes. Pass
strata_cols:
cis_bsw <- cis |>
add_bootstrap_weights(weight_col = "FWEIGHT", strata_cols = "PROV",
n_replicates = 200, seed = 42)Resolution order for the strata:
strata_cols if you pass it,bsw_strata for the survey,c("SURVYEAR", "SURVMNTH") so each month is resampled
separately,Pass strata_cols = character(0) to force unstratified
weights even when a default exists (e.g. for the LFS).
To get a bootstrap standard error for an estimate, compute the
estimate once with the original weights and once with each replicate
weight; the spread of the replicate estimates is the sampling
variability. The example below estimates the total population
represented by the survey (the sum of the weights), which depends only
on the weight columns, but the same pattern applies to any weighted
statistic — replace sum(.x) with your estimator (a weighted
mean, share, total, …) evaluated with each weight column.
est <- cis_bsw |>
summarise(across(c(FWEIGHT, matches("^CPBSW[0-9]+$")), ~ sum(.x))) |>
collect()
point_estimate <- est$FWEIGHT
replicate_estimates <- est |> select(matches("^CPBSW[0-9]+$")) |> unlist()
# Bootstrap variance: mean squared deviation of the replicate estimates from
# the full-sample estimate; the standard error is its square root.
# Confidence intervals can be obtained by taking the appropriate quantiles.
confidence_interval <- quantile(replicate_estimates,c(0.025,0.975))
std_error <- sqrt(mean((replicate_estimates - point_estimate)^2))
c(estimate = point_estimate, se = std_error, conf=confidence_interval)
#> estimate se conf.2.5% conf.97.5%
#> 36831173 170660 36493061 37126328The same pattern works after group_by() /
summarise(): carry the CPBSW* columns through
the summary, then collapse them into a standard error per group.
Calling add_bootstrap_weights() again on a survey that
already has weights does only the work needed to satisfy the request.
The decision is driven by two questions: are there enough replicate
columns, and do all rows have weights?
If the stored weights already cover every row and there are at least
n_replicates of them, they are reused without recomputation
(requesting fewer than are stored simply returns a subset). This is why
repeatedly opening a weighted survey is cheap.
If you ask for more replicates than are stored (and no rows were added), the additional replicate columns are generated and appended; the existing columns are kept unchanged. When stratified, the new columns are resampled within strata, consistent with the existing ones.
This is the subtle case. A bootstrap replicate resamples the
full population (or full stratum), so adding rows
invalidates the replicate weights of the affected resampling universe —
you cannot simply generate weights for the new rows in isolation.
add_bootstrap_weights() detects that not all rows have
weights and deletes and regenerates the affected
weights:
Extending the number of replicates and adding rows can happen in the same call; each is handled on the rows it applies to.
Extending a weighted table after appending rows relies on the new rows having a valid identifier. Use a natural
id_col(or the registrybsw_join_key) when you expect to grow the table, so the new rows can be matched.
Because each BSW table is named after its weight column, hierarchical surveys with several weights are handled by simply calling the function once per weight:
By default all replicate columns share the CPBSW prefix,
so when storing weights for more than one weight column give each a
distinct prefix to tell the two sets of replicates apart
(the column names are survey-specific):
Bootstrap weights always cover the complete physical survey table, so
variance estimates use the full sample. If the input tbl
carries filter() operations, they are captured and
re-applied to the returned view, so the visible rows still match your
subset. Other operations (select(), mutate(),
…) are not replayed — apply them to the returned tbl instead.
Generating weights on a DuckDB-backed table requires exclusive write
access, so add_bootstrap_weights() shuts down the
connection held by the input tbl. Use the returned
tbl afterwards; the input tbl (and any other lazy
tables backed by the same DuckDB file) are invalid after the call.