This document describes how get_pumf() turns a
Statistics Canada PUMF zip file into a lazy DuckDB-backed
dplyr::tbl(), covering every choice and fallback along the
way. The LFS has its own accumulating pipeline, described separately at
the end.
pumf_locate_or_download() ensures the version directory
exists with extracted content before any parsing begins.
Cache layout:
<cache_path>/
<series>/
<version>/
<original>.zip # retained after extraction
<extracted dirs>/
metadata/ # written by Stage 2
<series>_<version>.duckdb
Decision sequence:
refresh = TRUE — delete the
.duckdb file(s) and metadata/ subdirectory,
leaving the zip and extracted content untouched. Stages 2 and 3 then
re-run without re-downloading.
redownload = TRUE — wipe the
entire version directory first, then proceed as a first-time
run. Implies refresh.
Already extracted? —
version_is_extracted() returns TRUE if any
subdirectory (other than metadata/) or non-zip non-duckdb
file is present. If TRUE, the zip step is skipped even when
the zip is still on disk.
Download — the URL is looked up in
list_canpumf_collection(). Surveys distributed only via
Statistics Canada’s EFT portal have the marker "(EFT)"
instead of a URL; the function stops with instructions to deposit the
zip manually.
Extract — robust_unzip() handles
two edge cases:
2025-CSV.zip/). The colliding directory is renamed to
strip .zip before being moved into the version
directory.grep/sub calls on zip entry names use
useBytes = TRUE to avoid “invalid in this locale”
warnings.pumf_resolve_version() canonicalises Census version
strings before any registry lookup. Any string starting with a
four-digit year is parsed flexibly: the file type is detected by
grepping for "hierarchical", "household", or
"famil" (defaulting to "individuals"), and CMA
vs provincial by grepping for "cma". The registry is then
probed to determine the correct canonical format for that year.
Examples:
| User input | Resolved |
|---|---|
"2021" |
"2021 (individuals)" |
"1971" |
"1971/individuals_prov" |
"1971 CMA" |
"1971/individuals_cma" |
"1971 households CMA" |
"1971/households_cma" |
"1986 families" |
"1986/families" |
"2001 households" |
"2001 (households)" |
pumf_parse_metadata() converts raw SPSS/SAS command
files into three canonical CSVs. The function is idempotent: it does
nothing if metadata/variables.csv already exists and
refresh = FALSE.
detect_formats() scans the entire version directory
recursively and identifies which parser(s) apply. Multiple
parsers can fire for the same survey (e.g. SPSS split for
layout/codes and SAS cards for BSW weights).
| Priority | Format | Detection rule |
|---|---|---|
| 1 | LFS codebook CSV | filename matches codebook\.csv (case-insensitive) |
| 2 | CPSS variables CSV | filename is exactly variables.csv |
| 3 | SAS reading cards | directory contains both a .lay and a .lbe
file |
| 4 | SPSS split-file | any .sps file whose name ends in vare,
vale, or _i |
| 5 | SPSS monolithic | .sps file, *SPSS.txt file, or
.xmf file whose content contains VALUE LABELS
or DATA LIST (checked with
useBytes = TRUE to tolerate CP850/Latin-1 data);
VARIABLE LABELS is optional |
| 6 | SPSS .sav |
a .sav binary file readable by haven |
| 7 | PDF Data Dictionary | *Dictionary.pdf present and pdftools
installed; supplements label-only surveys where the SPSS file has
DATA LIST but no VARIABLE LABELS or
VALUE LABELS |
| 8 | PDF frequency codebook | a bilingual StatCan frequency codebook PDF (per-variable
Variable Name: / Answer Categories blocks)
under a Codebook/LivreDesCodes path,
content-verified; pdftools installed. A
last-resort fallback consulted only when no command
file or codebook CSV was found — recovers labels for surveys whose only
machine-readable companion is the data file (e.g. CPSS cycle 1) |
Detection for case 5 also searches for a parallel French file — any
candidate in the same set whose path includes /fran or
/french (case-insensitive).
parse_spss_mono)Handles the single-file SPSS format used by Census (2001–2021), SFS
1999, SHS, and others. The file typically contains
DATA LIST, VARIABLE LABELS,
VALUE LABELS, and sometimes MISSING VALUES and
FORMATS sections. VARIABLE LABELS is optional
(e.g. Census 2011 individuals omits it). Older releases like SFS 1999
have only DATA LIST with no label sections at all — these
produce a fully importable table with raw codes but no human-readable
factor levels.
Key parsing details:
Column ranges — DATA LIST ranges
may have spaces on either side of the dash (129-135,
129 - 135, or 129- 135). All three are
normalised by the regex (\\d+)\\s*-\\s*(\\d+) before
tokenisation.
Record-group marker — a leading /
on the first variable line (e.g. /PROVP 1-2) is stripped,
not discarded, so the variable is retained.
Section terminator — the DATA LIST
section ends at the first blank line, . line, or occurrence
of VARIABLE LABELS, VALUE LABELS,
MISSING VALUES, FORMATS, or
EXECUTE at the start of a line. The keyword check is the
reliable terminator for older files (e.g. 1991 XMF) that have no blank
line between DATA LIST and
VARIABLE LABELS.
DATA LIST type annotations — the
(A) suffix after a column range marks a character-type
variable. The parser records an is_char flag per column and
uses it to populate variables.csv types when no
VARIABLE LABELS section is present.
Sentinel detection — variables whose only VALUE
LABELS are sentinel phrases (“Not applicable”, “Valid skip”, “Don’t
know”, “Data not available”, etc.) are classified as
numeric with a missing_low/missing_high range,
not as character. This prevents spurious NA warnings when
numeric values fall outside the label set.
Zero-padded codes — unquoted SPSS numeric codes
like 01, 02 are normalised via
as.numeric() → as.character() so they match
bare integer values in CSV data.
Multi-variable VALUE LABELS blocks —
/VAR1 VAR2 VAR3 headers (possibly spanning continuation
lines) are fully parsed so all listed variables receive the code/label
pairs.
parse_spss_split)Used by SFS, CPSS, and similar surveys that ship separate files for
variable labels (*vare.sps), value labels
(*vale.sps), missing values (*miss.sps), and
layout (*_i.sps). The layout_mask from the
registry disambiguates when a single directory holds multiple sets
(e.g. individual vs. household files).
parse_sas_cards).lay files supply the fixed-width column positions;
.lbe files supply the value labels in
PROC FORMAT syntax. Variable labels come from a companion
.sas file if present. This parser reuses
parse_spss_split’s layout parser since the
.lay format is identical.
parse_lfs_codebook)The LFS ships a single *codebook.csv with one row per
code value. Columns are always read as CP1252 regardless of the
metadata_encoding registry field.
parse_cpss_csv)The Canadian Perspectives Survey Series ships a
variables.csv with variable metadata only; no layout or
codes. The encoding defaults to Latin-1 (CP1252 if the registry
overrides).
.sav (parse_spss_sav)Haven is used for binary .sav files when no text-format
command file is available. This is a fallback for surveys that do not
ship SPSS syntax.
parse_pdf_dictionary)StatCan PDF Data Dictionaries follow a standard bilingual format.
Variable blocks start with
<name> Position: N Character/Numeric(w). The parser
extracts variable long-names (Long name: /
Long nom:) and code-value labels (Codes: /
Domaine:). Reserved codes (Reserved Codes: /
Codes Réservés:) set missing_low/missing_high
ranges.
This parser produces only variables and
codes (no layout), and fires only when
pdftools is installed and a matching
*Dictionary.pdf is found. It is used as a label-only
supplement for surveys like SFS 1999 where the SPSS file is
DATA LIST-only.
parse_pdf_codebook)A second, distinct StatCan PDF layout, used when a survey ships
no machine-readable command file or codebook CSV — only a
bilingual frequency codebook PDF. Variable blocks start with
Variable Name: / Nom de la variable : and
carry the label on the Concept: line; an
Answer Categories / Catégories de réponse
frequency table supplies the value labels (parsed from a right-anchored
code-row regex that tolerates comma- and space-grouped counts and
rejoins wrapped answer text). Produces only variables and
codes. Like the dictionary parser it requires
pdftools, but detection is a fallback of last
resort — only consulted when no command file or codebook CSV
was found, and only for PDFs under a
Codebook/LivreDesCodes path that
content-verify for the Variable Name: +
Answer Categories signature. This is what gives CPSS cycle
1 (the only cycle without a variables.csv) full bilingual
labels.
The registry metadata_encoding field sets the encoding
for all text-format parsers. Default is "CP1252" (a
superset of Latin-1 that correctly decodes Windows-era en-dashes and
curly quotes). Exceptions:
| Surveys | Encoding | Reason |
|---|---|---|
| Census 2021, 2021 hierarchical | "UTF-8" |
Command files shipped as UTF-8 |
| Census 1991 (individuals) | "CP850" |
DOS-era IBM Code Page 850 |
merge_metadata() takes the list of parser outputs and
produces a single list(variables, codes, layout). Conflicts
are resolved: later parsers win on duplicate variable names. If a layout
is present in only some parsers, the function checks that every variable
with a layout entry also appears in variables, stopping
with a diagnostic otherwise.
The final result is written to:
metadata/variables.csv — one row per variable (name,
label_en, label_fr, type, decimals, missing_low, missing_high)metadata/codes.csv — one row per code value (name, val,
label_en, label_fr)metadata/layout.csv — one row per fixed-width column
(name, start, end); absent for CSV-format surveyspumf_build_duckdb() reads the canonical CSVs from
metadata/, reads the raw data file, applies
transformations, and writes a .duckdb file. The function
skips the build if the target table already exists and
refresh = FALSE.
find_pumf_data_file() searches the version directory
recursively.
Extension pre-filter — derived from the registry
file_mask:
file_mask ends in |
Pre-filter |
|---|---|
.csv |
only files matching \.csv$ |
.txt or .dat |
only files matching \.(txt\|dat)$ |
other / unusual (e.g. .INDIV) |
all files (relies on file_mask alone) |
| absent + layout exists | \.(txt\|dat)$ (FWF inferred from layout) |
| absent + no layout | \.csv$ |
Several subdirectories are always excluded from the search:
metadata/, SPSS/, Command/,
Syntax/, Layout/, SpssCard/,
Reading_cards/, Documents/. Bootstrap weight
(_BSW.) files are also excluded; they are handled
separately.
When multiple candidates survive, the file_mask regex
narrows the list. If more than one still remains, the function stops
with a message listing the ambiguous files and asks to set
file_mask in the registry.
After the data file is identified:
metadata/layout.csv exists
and the data file does not end in .csv. This
handles the edge case (e.g. CHS) where the SPSS DATA LIST produces a
layout but the actual data ships as CSV.Both paths read all columns as character
(col_types = cols(.default = "c")) to preserve leading
zeros and avoid premature type coercion. Numeric conversion happens
explicitly in the next step.
After reading a fixed-width file, any row where fewer than two
columns are non-NA is dropped. FWF files from older StatCan archives
often end with \r\n\x1a (a DOS EOF marker), which the FWF
reader interprets as a one-character row with a single non-NA field;
this step removes it silently. CSV files are not affected — CSV parsers
handle trailing newlines correctly.
Registry data_fixups entries are applied to the raw
character data before label mapping:
str_pad — left- or right-pad specified
columns to a target width. Used to zero-pad codes that arrive without
leading zeros in some CSV formats (e.g. SFS).rename — rename a column; applied only
when the old name is present (safe for surveys that ship in multiple
release variants, e.g. Census 2021 RELIG/RELIGION_DER).cols_swap — named character vector
c(A = "B", C = "D") swapping pairs of column names. Used
for surveys where the DATA LIST variable names are transposed relative
to the PDF documentation (e.g. WKACTMA/WKACTFA and FAOCC81/MAOCC81 in
Census 1981 individuals).force_numeric — character vector of
column names to treat as numeric regardless of how many VALUE LABELS are
declared. Used when a variable carries boundary or top-code labels
(e.g. "85 years and over") alongside otherwise-continuous
values, or is an integer index the SPSS file mis-classifies as
categorical (e.g. SUBSAMPL in Census 1971). The codes are dropped, but
any true-missing sentinel codes (Not stated, Don’t
know, Valid skip, … — not zero-value labels like “None”) are
first converted into a per-variable
missing_low/missing_high range so those sentinels still
become NA. An existing missing range (from
MISSING VALUES or a split-SPSS miss file) takes
precedence.force_character / force_integer /
force_bigint — character vectors of variable names
whose DuckDB storage type is overridden. Unlike the
conversions above, the raw string values are kept verbatim (no numeric
conversion, no code labeling), so geographic codes retain leading zeros
and out-of-int-range IDs survive.
force_character keeps the column VARCHAR;
force_integer / force_bigint cast it to
INTEGER / BIGINT via ALTER COLUMN after the table
is written (an INTEGER cast that overflows 2^31 errors — use
force_bigint). A variable may appear in at most one
force_* set (including force_numeric); this is
validated at build time. LFS sources its SURVYEAR /
SURVMNTH / REC_NUM integer-forcing through
this mechanism from the shared LFS registry entry.codes_supplement — named list of
data.frames injecting code-label rows absent from the SPSS
command files (values present in the data but not declared in the
command files, e.g. the CHS PPROV territories code). Each
data frame has columns val, label_en,
label_fr. Setting label_en = NA marks a value
as intentionally missing (produces a silent NA factor entry
without a warning, and without introducing a spurious factor level). All
entries are verified in the override ledger
(tests/testthat/override_verification.csv).na_values — character vector of raw
string sentinels that become NA. In numeric columns they
are exact-matched and NA’d during numeric conversion; in labeled
(factor) columns they are silently blanked. Used for undeclared Census
income sentinels and SAS-style "." missing markers.labels_supplement — named list
c(VAR = c(label_en =, label_fr =)) supplying
variable labels the source metadata leaves blank (e.g. CPSS 1
ships only a PDF codebook whose weight variable COVID_WT
has an empty Concept: line in both languages). Applied in
both Stage 3 and label_pumf_columns() /
pumf_var_labels(), and fills only NA labels,
so genuine source labels always win.When the registry has bsw_mask +
bsw_join_key + bsw_file_mask, the BSW file is
found, read (CSV or FWF), and left-joined onto the main data by the join
key before numeric conversion.
apply_numeric_conversion() converts character columns
typed "numeric" in variables.csv:
as.numeric() on the raw character values.[missing_low, missing_high] become NA. This
handles SPSS-declared MISSING VALUES blocks.na_values fixup — additional raw
string sentinels from the registry (data_fixups$na_values)
are set to NA via trimws(raw) %in% na_values.
Used for undeclared income sentinels in older Census files.The two mechanisms complement each other: the SPSS
MISSING VALUES block catches sentinels declared in the
command file; na_values catches those that StatCan omitted
from the command file but documents in the user guide.
Census income sentinel widths (confirmed from SPSS DATA LIST sections):
| Census years | Income field width | Sentinels (na_values) |
|---|---|---|
| 2016, 2021 | 8 chars | "99999999", "88888888" |
| 1991–2011 | 7 chars | "9999999", "8888888" |
| 1986 and earlier | unverified | none applied |
The two sets are kept separate: applying the 7-digit sentinel to an
8-char field would incorrectly NA out a valid $9,999,999 income value
(stored as " 9999999" which trims to
"9999999").
apply_code_labels() maps raw character values to R
factors using codes.csv. The factor levels are the complete
ordered set from the codes table, not just the values present in the
data.
Unmatched raw values become NA with a warning showing
the first five offending values. An exception is made for values that
appear in codes.csv with label_en = NA
(injected via codes_supplement): these are treated as
intentionally NA and silently produce NA factor
entries without a warning.
When lang = "fra", any missing French label falls back
per-row to the English label.
The labelled data frame is written to DuckDB with
dbWriteTable(). Factor columns are stored as DuckDB
ENUM types. DuckDB >= 1.5.2 does this automatically; for
older versions ensure_enum_columns() runs
ALTER TABLE ... ALTER COLUMN ... TYPE ENUM(...) for each
factor column.
A separate DuckDB table is created per language (table names
eng and fra, or
eng_<layout_mask> for surveys with multiple file
types). The write connection is shut down before
pumf_open_duckdb() re-opens the file in read-only mode,
preventing in-process lock conflicts when building both language tables
in the same session.
Some surveys ship several linked files that share a respondent key
and are meant to be joined for analysis (GSS cycle 16 / Aging and Social
Support 2002, the GSS Time Use cycles, the Survey of Household Spending
2017, the Giving/Volunteering/ Participating cycles).
canpumf models these as several tables inside one
DuckDB file — not separate databases, which could not be joined
on a single connection.
A registry entry declares
modules = list(MAIN = ..., CG4 = ...); each module carries
its own layout_mask, file_mask,
data_fixups, and bootstrap-weight config. One module is the
primary (the respondent-level file that carries the
survey weight); its config is auto-derived to the entry’s top level so
all the single-table code paths above keep working unchanged. The entry
also records module_key — the shared key the modules join
on (it varies: RECID, PUMFID,
MICRO_ID, CASEID, IDNUM).
pumf_run_pipeline() loops the modules, running Stage 2
and Stage 3 once per module so every table lands in the
one DuckDB file. Each module parses its metadata into
metadata/<module>/ (the primary uses
metadata/) and joins its own bootstrap
weights, so e.g. the SHS Interview replicate weights are not mis-joined
onto the Diary table. The primary module’s tbl is returned.
User-facing, get_pumf() returns the primary module and
emits a one-time message listing the sibling modules;
pumf_module(tbl, "<module>") opens a sibling on the
same connection so the two are joinable. The dedicated
Working with multi-module PUMF
surveys vignette covers the user-facing workflow in full.
The LFS is handled by lfs_get_pumf() (delegated directly
from get_pumf() without going through
get_pumf_connection()). Instead of one DuckDB per version,
all LFS versions share a single
<cache_path>/LFS/LFS.duckdb with accumulating tables
lfs_eng and lfs_fra.
Key differences from the standard pipeline:
Schema evolution — when a new LFS version adds a
variable absent from earlier versions, the column is added via
ALTER TABLE ADD COLUMN. When a variable changes type
(e.g. VARCHAR → ENUM), ALTER COLUMN SET DATA TYPE is
used.
Annual supersedes monthly — if annual and monthly versions for the same year are both loaded, the annual version supersedes the monthly rows for that year.
Version tracking — a lfs_versions
table in the shared DuckDB records which versions have been downloaded
and parsed, so refresh = "auto" downloads only new
versions.
Read-only fast path — when the requested version
is already in the database, lfs_get_pumf() opens only a
read-only connection and returns immediately. No write lock is acquired
unless new data actually needs to be written.
get_pumf() return — when a specific
version is requested, the function applies a
dplyr::filter() on SURVYEAR (and
SURVMNTH for monthly requests) over the full shared table.
Calling get_pumf("LFS") without a version returns the
unfiltered table.
label_pumf_columns() for LFS —
because the shared schema is the union of all loaded versions, variables
introduced in later years (e.g. GENDER added ~2020) are
absent from older versions’ variables.csv.
label_pumf_columns() therefore reads and merges metadata
from every loaded version directory in chronological order,
with the most-recent label winning on conflicts.
get_pumf() registers
(series, version, cache_path, lang) in a package-level
environment keyed by the DuckDB connection’s C++ external-pointer
address:
.pumf_con_registry <- new.env(hash = TRUE, parent = emptyenv())
key = format(con@conn_ref) # stable across R-level S4 copies
This key survives dplyr tbl transformations and
select()/filter() calls because those
operations do not create new connections.
label_pumf_columns() uses .pumf_lookup_con()
to retrieve the provenance; close_pumf() removes the entry
and disconnects.
This internal provenance registry is distinct from the
RStudio Connections pane. Whether the DuckDB connection
is advertised to that pane is controlled separately by the
register_connection argument to get_pumf()
(default getOption("canpumf.register_connection", TRUE));
set it to FALSE to keep the pane from being spammed when
opening and closing many connections programmatically.
pumf_registry_lookup(series, version) returns a named
list that controls every per-survey choice in the pipeline. Surveys
without an entry use auto-detection with defaults (see
Newest-sibling inheritance below for the one exception).
| Field | Purpose | Default |
|---|---|---|
file_mask |
regex to select the data file | NULL (auto) |
layout_mask |
SPSS file disambiguator for split-file surveys | NULL |
data_encoding |
encoding of the raw data file | "CP1252" |
metadata_encoding |
encoding of SPSS/SAS command files | "CP1252" |
bsw_mask |
layout_mask for BSW-specific SPSS files |
NULL |
bsw_file_mask |
filename pattern for the BSW data file | NULL |
bsw_join_key |
column(s) to join BSW onto the main data | NULL |
bsw_drop_cols |
BSW columns to drop before joining | character(0) |
data_fixups |
list of str_pad, rename,
cols_swap, force_numeric,
force_character, force_integer,
force_bigint, codes_supplement,
na_values, labels_supplement transforms |
list() |
missing_supplement |
named list of c(lo, hi) pairs — explicit missing-range
overrides for sentinels no generic pattern can classify
(e.g. non-integer sentinels like 999.5) |
NULL |
doc_mask |
regex applied to PDF filenames to filter a shared documentation
directory to the relevant file type
(e.g. "Family\|Familles" for 1986 Census families) |
NULL |
modules / module_key |
for multi-module surveys: per-module config
(layout_mask, file_mask,
data_fixups, BSW) and the shared respondent key the modules
join on (see Multi-module surveys above) |
NULL |
Surveys without a registry entry normally fall back to pure
auto-detection, with one exception. When the requested version is a bare
four-digit year and the same series already has at least one other
year-keyed entry, pumf_registry_lookup() inherits the
configuration of the newest registered sibling whose year is <= the
requested year (or the oldest sibling if the requested year predates
them all). This lets a freshly released year deposited in the cache
reuse the prior year’s config — which works cleanly now that recent
file_masks use a generic \d{4} year
placeholder rather than a hard-coded year.
A message() fires once per session so the implicit reuse
is discoverable; a genuinely changed release (new file layout, codes, or
BSW join) still needs its own explicit entry. Inheritance is
skipped for multi-part versions (e.g. Census
2021 (individuals)) and for LFS, which has its own shared
registry entry.