The heaping package provides tools for detecting and correcting heaping (digit preference) in survey data at the individual record level. Heaping occurs when respondents disproportionately report values at round numbers—for example, ages ending in 0 or 5, incomes rounded to the nearest thousand, or weights rounded to the nearest 5 pounds.
The package provides:
Heaping is a common data quality issue where certain values appear more frequently than expected due to rounding or digit preference. This is especially common with self-reported data.
Let’s create some example data to illustrate, using the same distribution parameters as in Templ (2024):
# Generate realistic age data using a log-normal distribution
# These parameters produce a realistic age distribution
set.seed(123)
age_true <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age_true <- round(age_true[age_true < 93])
# Create heaped version where 27% of respondents round their age to multiples of 5
# This heap_ratio is typical for survey data with moderate heaping
heap_ratio <- 0.27
n <- length(age_true)
indices_to_heap <- sample(n, round(heap_ratio * n))
age_heaped <- age_true
age_heaped[indices_to_heap] <- round(age_heaped[indices_to_heap] / 5) * 5Let’s visualize the difference:
oldpar <- par(mfrow = c(1, 2))
# Define breaks to cover the full range
age_breaks <- seq(0, max(age_true, age_heaped) + 1, by = 1)
# True ages
hist(age_true, breaks = age_breaks, col = "steelblue",
main = "True Ages (No Heaping)", xlab = "Age",
border = "white")
# Heaped ages
hist(age_heaped, breaks = age_breaks, col = "coral",
main = "Heaped Ages", xlab = "Age",
border = "white")Notice the spikes at ages ending in 0 and 5 in the heaped data.
Before correcting heaping, we need to detect and quantify it. The package provides several established indices.
The most commonly used index. It compares the frequency of ages ending in 0 or 5 to the expected frequency.
# Standard Whipple index (100 = no heaping, 500 = maximum heaping)
whipple(age_true, method = "standard")
#> [1] 97.83184
whipple(age_heaped, method = "standard")
#> [1] 205.1824
# Modified Whipple index (0 = no heaping, 1 = maximum heaping)
whipple(age_true, method = "modified")
#> [1] 0.1114825
whipple(age_heaped, method = "modified")
#> [1] 0.2397257Myers’ blended index measures preferences for each terminal digit (0-9):
Similar to Myers’ but uses a different blending technique:
Evaluates heaping for a specific terminal digit:
# Calculate all indices
indices_true <- heaping_indices(age_true)
indices_heaped <- heaping_indices(age_heaped)
# Compare
data.frame(
Index = names(indices_true),
True = round(unlist(indices_true), 3),
Heaped = round(unlist(indices_heaped), 3)
)
#> Index True Heaped
#> whipple_standard whipple_standard 97.832 205.182
#> whipple_modified whipple_modified 0.111 0.240
#> myers myers 6.907 19.428
#> bachi bachi 2.952 20.942
#> spoorenberg spoorenberg 0.324 4.210
#> noumbissi_0 noumbissi_0 1.036 1.998
#> noumbissi_5 noumbissi_5 0.984 2.089The key innovation of this package is correcting heaping at the individual level. Traditional methods only smooth aggregated frequency distributions, which destroys individual records needed for regression analysis.
The correctHeaps() function corrects heaping by
replacing a proportion of heaped values with draws from truncated
distributions:
# Correct heaping using log-normal distribution
age_corrected <- correctHeaps(age_heaped,
heaps = "5year", # heaps at 0, 5, 10, 15, ...
method = "lnorm",
seed = 42) # for reproducibility
# Compare Whipple indices
c(Original = whipple(age_true),
Heaped = whipple(age_heaped),
Corrected = whipple(age_corrected))
#> Original Heaped Corrected
#> 97.83184 205.18244 105.23533Let’s visualize the correction:
oldpar <- par(mfrow = c(1, 2))
hist(age_heaped, breaks = age_breaks, col = "coral",
main = "Before Correction", xlab = "Age", border = "white")
hist(age_corrected, breaks = age_breaks, col = "forestgreen",
main = "After Correction", xlab = "Age", border = "white")Four distribution methods are available:
# Log-normal (default) - good for right-skewed data like age
age_lnorm <- correctHeaps(age_heaped, method = "lnorm", seed = 42)
# Normal - good for symmetric data
age_norm <- correctHeaps(age_heaped, method = "norm", seed = 42)
# Uniform - simplest approach, baseline comparison
age_unif <- correctHeaps(age_heaped, method = "unif", seed = 42)
# Kernel - nonparametric, adapts to local data shape
age_kernel <- correctHeaps(age_heaped, method = "kernel", seed = 42)
# Compare results
data.frame(
Method = c("Original", "Heaped", "Log-normal", "Normal", "Uniform", "Kernel"),
Whipple = c(whipple(age_true), whipple(age_heaped),
whipple(age_lnorm), whipple(age_norm),
whipple(age_unif), whipple(age_kernel))
)
#> Method Whipple
#> 1 Original 97.83184
#> 2 Heaped 205.18244
#> 3 Log-normal 156.83716
#> 4 Normal 156.59197
#> 5 Uniform 164.40855
#> 6 Kernel 157.09812Use verbose = TRUE to get detailed information about the
correction:
result <- correctHeaps(age_heaped, heaps = "5year",
method = "lnorm", seed = 42, verbose = TRUE)
# Number of values changed
result$n_changed
#> [1] 1883
# Changes by heap age
head(result$changes_by_heap, 10)
#> 5 10 15 20 25 30 35 40 45 50
#> 531 332 240 142 120 82 92 54 41 42
# Heaping ratios (ratio > 1 indicates heaping)
head(result$ratios, 10)
#> 0 5 10 15 20 25 30 35
#> NA 2.654206 2.720207 3.181818 2.670588 2.764706 2.831461 3.577465
#> 40 45
#> 2.426667 2.322581For data with heaping only at ages ending in 0:
# Create data with 10-year heaping
age_heap10 <- age_true
heapers10 <- sample(c(TRUE, FALSE), length(age_true),
replace = TRUE, prob = c(0.3, 0.7))
age_heap10[heapers10] <- round(age_heap10[heapers10] / 10) * 10
# Correct 10-year heaping
age_corrected10 <- correctHeaps(age_heap10, heaps = "10year",
method = "lnorm", seed = 42)
c(Heaped = whipple(age_heap10),
Corrected = whipple(age_corrected10))
#> Heaped Corrected
#> 205.57395 91.54176For non-standard heaping patterns, specify exact positions:
To correct heaping at just one specific age:
# Add artificial heap at age 40
age_with_40heap <- c(age_true, rep(40, 500))
# Correct only the heap at 40
age_fixed_40 <- correctSingleHeap(age_with_40heap,
heap = 40,
before = 3, # range: 37-43
after = 3,
method = "lnorm",
seed = 42)
# Check the counts at age 40
c(Before = sum(age_with_40heap == 40),
After = sum(age_fixed_40 == 40))
#> Before After
#> 544 123When age is correlated with other variables (income, education, etc.), simple correction can distort these relationships. Model-based correction uses covariates to make smarter corrections.
# Create a dataset with correlated variables following the paper's approach
set.seed(123)
# Generate age from log-normal distribution
age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age <- round(age[age < 93 & age >= 18])
n <- length(age)
# Simulate covariates correlated with age
age_scaled <- scale(age)
# Income as a function of age with noise
income <- exp(3 + 0.5 * age_scaled + rnorm(n, 0, 0.5))
# Marital status influenced by age
marital_status <- ifelse(plogis(-3 + 0.05 * age_scaled + rnorm(n, 0, 0.5)) > 0.5,
"married", "single")
# Education level with age-dependent probabilities
get_education_prob <- function(a) {
if (a < 25) return(c(0.5, 0.35, 0.15))
else if (a < 40) return(c(0.3, 0.45, 0.25))
else return(c(0.2, 0.4, 0.4))
}
education <- sapply(age, function(a) {
sample(c("High School", "Bachelor", "Master"), 1, prob = get_education_prob(a))
})
data_example <- data.frame(
age = age,
income = income,
marital_status = marital_status,
education = education
)
# Introduce heaping with 27% heap ratio (as in the paper)
heap_ratio <- 0.27
indices_to_heap <- sample(n, round(heap_ratio * n))
data_example$age_heaped <- data_example$age
data_example$age_heaped[indices_to_heap] <- round(data_example$age_heaped[indices_to_heap] / 5) * 5Apply model-based correction:
# Model-based correction using income, education and marital status
if (requireNamespace("ranger", quietly = TRUE) &&
requireNamespace("VIM", quietly = TRUE)) {
# Also apply simple correction for comparison
data_example$age_simple <- correctHeaps(
data_example$age_heaped,
heaps = "5year",
method = "lnorm",
seed = 42
)
# Model-based correction using covariates
data_example$age_corrected <- correctHeaps(
data_example$age_heaped,
heaps = "5year",
method = "lnorm",
model = age_heaped ~ income + marital_status + education,
dataModel = data_example,
seed = 42
)
# Compare correlations with log(income)
log_income <- log(data_example$income)
cor_original <- cor(data_example$age, log_income)
cor_heaped <- cor(data_example$age_heaped, log_income)
cor_simple <- cor(data_example$age_simple, log_income)
cor_corrected <- cor(data_example$age_corrected, log_income)
cat("Correlation of age with log(income):\n")
cat(" Original (true):", round(cor_original, 4), "\n")
cat(" Heaped:", round(cor_heaped, 4), "\n")
cat(" Simple correction:", round(cor_simple, 4), "\n")
cat(" Model-based correction:", round(cor_corrected, 4), "\n")
}
#> Correlation of age with log(income):
#> Original (true): 0.7087
#> Heaped: 0.7086
#> Simple correction: 0.7084
#> Model-based correction: 0.71Let’s visualize the model-based correction:
if (requireNamespace("ranger", quietly = TRUE) &&
requireNamespace("VIM", quietly = TRUE)) {
oldpar <- par(mfrow = c(2, 2))
# Age distributions
age_breaks <- seq(min(data_example$age) - 1, max(data_example$age) + 1, by = 1)
hist(data_example$age_heaped, breaks = age_breaks, col = "coral",
main = "Heaped Ages", xlab = "Age", border = "white")
hist(data_example$age_corrected, breaks = age_breaks, col = "forestgreen",
main = "Model-Based Corrected Ages", xlab = "Age", border = "white")
# Age vs log(Income) relationships
log_income <- log(data_example$income)
plot(data_example$age_heaped, log_income,
pch = 16, col = adjustcolor("coral", 0.3), cex = 0.5,
main = "Heaped: Age vs log(Income)",
xlab = "Age", ylab = "log(Income)")
abline(lm(log_income ~ data_example$age_heaped), col = "darkred", lwd = 2)
plot(data_example$age_corrected, log_income,
pch = 16, col = adjustcolor("forestgreen", 0.3), cex = 0.5,
main = "Corrected: Age vs log(Income)",
xlab = "Age", ylab = "log(Income)")
abline(lm(log_income ~ data_example$age_corrected), col = "darkgreen", lwd = 2)
par(oldpar)
}The model-based correction preserves the relationship between age and income better than simple correction because it uses the covariate information to determine the direction of adjustments.
Because correction involves random sampling, you can create multiple corrected datasets for proper uncertainty quantification:
# Create 5 corrected datasets with different seeds
m <- 5
corrected_datasets <- lapply(1:m, function(i) {
correctHeaps(age_heaped, heaps = "5year",
method = "lnorm", seed = i * 100)
})
# Calculate Whipple index for each
whipple_values <- sapply(corrected_datasets, whipple)
cat("Whipple indices across imputations:\n")
#> Whipple indices across imputations:
cat(" Mean:", round(mean(whipple_values), 2), "\n")
#> Mean: 105.98
cat(" SD:", round(sd(whipple_values), 2), "\n")
#> SD: 2.59
cat(" Range:", round(min(whipple_values), 2), "-",
round(max(whipple_values), 2), "\n")
#> Range: 102.86 - 109.73If some ages are known to be accurate (e.g., verified from administrative records), exclude them from correction:
# Assume first 100 observations are verified
verified_indices <- 1:100
age_protected <- correctHeaps(age_heaped,
heaps = "5year",
method = "lnorm",
fixed = verified_indices, # don't change these
seed = 42)
# Verify protected observations unchanged
all(age_heaped[verified_indices] == age_protected[verified_indices])
#> [1] TRUEIf you have 5-year age group data and need single-year ages, use the Sprague multipliers:
# Example: population counts in 5-year groups
pop_5year <- c(
1971990, 2095820, 2157190, 2094110, 2116580, # 0-4, 5-9, ..., 20-24
2003840, 1785690, 1502990, 1214170, 796934, # 25-29, ..., 45-49
627551, 530305, 488014, 364498, 259029, # 50-54, ..., 70-74
158047, 125941 # 75-79, 80+
)
# Disaggregate to single years
pop_single <- sprague(pop_5year)
# First 15 ages
head(pop_single, 15)
#> 0 1 2 3 4 5 6 7
#> 383576.4 388941.4 394401.8 399858.4 405212.1 410363.7 415213.9 419663.7
#> 8 9 10 11 12 13 14
#> 423613.8 426965.0 430053.3 433214.8 434174.3 431962.3 427785.2
# Total is preserved
c(Sum_5year = sum(pop_5year), Sum_single = sum(pop_single))
#> Sum_5year Sum_single
#> 20292699 20292699For data quality assessment at older ages, specialized indices are available:
# Create old-age data with heaping
set.seed(42)
old_ages <- c(sample(85:105, 3000, replace = TRUE),
rep(c(90, 95, 100), each = 200)) # heaping at round ages
# Coale-Li index (designed for ages 60+)
coale_li(old_ages, digit = 0, ageMin = 85)
#> [1] 1.815958
# Jdanov index (for very old ages like 95, 100, 105)
jdanov(old_ages, Agei = c(95, 100, 105))
#> [1] 176.6885
# Kannisto index (for a single old age)
kannisto(old_ages, Agei = 90)
#> [1] 2.003626
kannisto(old_ages, Agei = 95)
#> [1] 1.963752The heaping package provides a complete toolkit for handling heaping in survey data:
| Function | Purpose |
|---|---|
whipple() |
Standard and modified Whipple index |
myers() |
Myers’ blended index |
bachi() |
Bachi’s index |
noumbissi() |
Single-digit Noumbissi index |
spoorenberg() |
Total Modified Whipple index |
coale_li() |
Coale-Li index for old ages |
jdanov() |
Jdanov index for very old ages |
kannisto() |
Kannisto index for single old age |
heaping_indices() |
All common indices at once |
correctHeaps() |
Correct regular heaping patterns |
correctSingleHeap() |
Correct a single heap |
sprague() |
Disaggregate 5-year to single-year ages |
Key advantages over traditional methods:
Myers, R. J. (1940). Errors and bias in the reporting of ages in census data. Transactions of the Actuarial Society of America, 41, 395-415.
Spoorenberg, T. and Dutreuilh, C. (2007). Quality of age reporting: extension and application of the modified Whipple’s index. Population, 62(4), 729-741.
Sprague, T. B. (1880). Explanation of a new formula for interpolation. Journal of the Institute of Actuaries, 22, 270-285.