Introduction to the heaping Package

The Problem: What is Heaping?

Heaping is a common data quality issue where certain values appear more frequently than expected due to rounding or digit preference. This is especially common with self-reported data.

Let’s create some example data to illustrate, using the same distribution parameters as in Templ (2024):

# Generate realistic age data using a log-normal distribution
# These parameters produce a realistic age distribution
set.seed(123)
age_true <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age_true <- round(age_true[age_true < 93])

# Create heaped version where 27% of respondents round their age to multiples of 5
# This heap_ratio is typical for survey data with moderate heaping
heap_ratio <- 0.27
n <- length(age_true)
indices_to_heap <- sample(n, round(heap_ratio * n))
age_heaped <- age_true
age_heaped[indices_to_heap] <- round(age_heaped[indices_to_heap] / 5) * 5

Let’s visualize the difference:

oldpar <- par(mfrow = c(1, 2))

# Define breaks to cover the full range
age_breaks <- seq(0, max(age_true, age_heaped) + 1, by = 1)

# True ages
hist(age_true, breaks = age_breaks, col = "steelblue",
     main = "True Ages (No Heaping)", xlab = "Age",
     border = "white")

# Heaped ages
hist(age_heaped, breaks = age_breaks, col = "coral",
     main = "Heaped Ages", xlab = "Age",
     border = "white")

Comparison of true ages vs heaped ages

par(oldpar)

Notice the spikes at ages ending in 0 and 5 in the heaped data.

Detecting Heaping: Heaping Indices

Before correcting heaping, we need to detect and quantify it. The package provides several established indices.

Whipple’s Index

The most commonly used index. It compares the frequency of ages ending in 0 or 5 to the expected frequency.

# Standard Whipple index (100 = no heaping, 500 = maximum heaping)
whipple(age_true, method = "standard")
#> [1] 97.83184
whipple(age_heaped, method = "standard")
#> [1] 205.1824

# Modified Whipple index (0 = no heaping, 1 = maximum heaping)
whipple(age_true, method = "modified")
#> [1] 0.1114825
whipple(age_heaped, method = "modified")
#> [1] 0.2397257

Myers’ Index

Myers’ blended index measures preferences for each terminal digit (0-9):

# Myers' index (0 = no heaping, 90 = maximum)
myers(age_true)
#> [1] 6.906823
myers(age_heaped)
#> [1] 19.42773

Bachi’s Index

Similar to Myers’ but uses a different blending technique:

# Bachi's index (0 = no heaping, 90 = maximum)
bachi(age_true)
#> [1] 2.951748
bachi(age_heaped)
#> [1] 20.94171

Noumbissi’s Index

Evaluates heaping for a specific terminal digit:

# Noumbissi's index for digit 0 (1.0 = no heaping)
noumbissi(age_true, digit = 0)
#> [1] 1.035966
noumbissi(age_heaped, digit = 0)
#> [1] 1.997654

# Noumbissi's index for digit 5
noumbissi(age_true, digit = 5)
#> [1] 0.9842154
noumbissi(age_heaped, digit = 5)
#> [1] 2.089136

Convenience Function: All Indices at Once

# Calculate all indices
indices_true <- heaping_indices(age_true)
indices_heaped <- heaping_indices(age_heaped)

# Compare
data.frame(
  Index = names(indices_true),
  True = round(unlist(indices_true), 3),
  Heaped = round(unlist(indices_heaped), 3)
)
#>                             Index   True  Heaped
#> whipple_standard whipple_standard 97.832 205.182
#> whipple_modified whipple_modified  0.111   0.240
#> myers                       myers  6.907  19.428
#> bachi                       bachi  2.952  20.942
#> spoorenberg           spoorenberg  0.324   4.210
#> noumbissi_0           noumbissi_0  1.036   1.998
#> noumbissi_5           noumbissi_5  0.984   2.089

Using Sampling Weights

All indices support sampling weights for survey data:

# Create example weights
weights <- runif(length(age_heaped), 0.5, 2)

# Weighted Whipple index
whipple(age_heaped, weight = weights)
#> [1] 205.1042

Correcting Heaping

The key innovation of this package is correcting heaping at the individual level. Traditional methods only smooth aggregated frequency distributions, which destroys individual records needed for regression analysis.

Basic Correction

The correctHeaps() function corrects heaping by replacing a proportion of heaped values with draws from truncated distributions:

# Correct heaping using log-normal distribution
age_corrected <- correctHeaps(age_heaped,
                              heaps = "5year",  # heaps at 0, 5, 10, 15, ...
                              method = "lnorm",
                              seed = 42)        # for reproducibility

# Compare Whipple indices
c(Original = whipple(age_true),
  Heaped = whipple(age_heaped),
  Corrected = whipple(age_corrected))
#>  Original    Heaped Corrected 
#>  97.83184 205.18244 105.23533

Let’s visualize the correction:

oldpar <- par(mfrow = c(1, 2))

hist(age_heaped, breaks = age_breaks, col = "coral",
     main = "Before Correction", xlab = "Age", border = "white")

hist(age_corrected, breaks = age_breaks, col = "forestgreen",
     main = "After Correction", xlab = "Age", border = "white")

Before and after heaping correction


par(oldpar)

Correction Methods

Four distribution methods are available:

# Log-normal (default) - good for right-skewed data like age
age_lnorm <- correctHeaps(age_heaped, method = "lnorm", seed = 42)

# Normal - good for symmetric data
age_norm <- correctHeaps(age_heaped, method = "norm", seed = 42)

# Uniform - simplest approach, baseline comparison
age_unif <- correctHeaps(age_heaped, method = "unif", seed = 42)

# Kernel - nonparametric, adapts to local data shape
age_kernel <- correctHeaps(age_heaped, method = "kernel", seed = 42)

# Compare results
data.frame(
  Method = c("Original", "Heaped", "Log-normal", "Normal", "Uniform", "Kernel"),
  Whipple = c(whipple(age_true), whipple(age_heaped),
              whipple(age_lnorm), whipple(age_norm),
              whipple(age_unif), whipple(age_kernel))
)
#>       Method   Whipple
#> 1   Original  97.83184
#> 2     Heaped 205.18244
#> 3 Log-normal 156.83716
#> 4     Normal 156.59197
#> 5    Uniform 164.40855
#> 6     Kernel 157.09812

Verbose Output and Diagnostics

Use verbose = TRUE to get detailed information about the correction:

result <- correctHeaps(age_heaped, heaps = "5year",
                       method = "lnorm", seed = 42, verbose = TRUE)

# Number of values changed
result$n_changed
#> [1] 1883

# Changes by heap age
head(result$changes_by_heap, 10)
#>   5  10  15  20  25  30  35  40  45  50 
#> 531 332 240 142 120  82  92  54  41  42

# Heaping ratios (ratio > 1 indicates heaping)
head(result$ratios, 10)
#>        0        5       10       15       20       25       30       35 
#>       NA 2.654206 2.720207 3.181818 2.670588 2.764706 2.831461 3.577465 
#>       40       45 
#> 2.426667 2.322581

10-Year Heaping

For data with heaping only at ages ending in 0:

# Create data with 10-year heaping
age_heap10 <- age_true
heapers10 <- sample(c(TRUE, FALSE), length(age_true),
                    replace = TRUE, prob = c(0.3, 0.7))
age_heap10[heapers10] <- round(age_heap10[heapers10] / 10) * 10

# Correct 10-year heaping
age_corrected10 <- correctHeaps(age_heap10, heaps = "10year",
                                method = "lnorm", seed = 42)

c(Heaped = whipple(age_heap10),
  Corrected = whipple(age_corrected10))
#>    Heaped Corrected 
#> 205.57395  91.54176

Custom Heap Positions

For non-standard heaping patterns, specify exact positions:

# Heaping at specific ages (e.g., 18, 21, 30, 40, 50, 65)
custom_positions <- c(18, 21, 30, 40, 50, 65)
age_custom <- correctHeaps(age_heaped,
                           heaps = custom_positions,
                           method = "lnorm", seed = 42)

Single Heap Correction

To correct heaping at just one specific age:

# Add artificial heap at age 40
age_with_40heap <- c(age_true, rep(40, 500))

# Correct only the heap at 40
age_fixed_40 <- correctSingleHeap(age_with_40heap,
                                   heap = 40,
                                   before = 3,  # range: 37-43
                                   after = 3,
                                   method = "lnorm",
                                   seed = 42)

# Check the counts at age 40
c(Before = sum(age_with_40heap == 40),
  After = sum(age_fixed_40 == 40))
#> Before  After 
#>    544    123

Model-Based Correction

When age is correlated with other variables (income, education, etc.), simple correction can distort these relationships. Model-based correction uses covariates to make smarter corrections.

# Create a dataset with correlated variables following the paper's approach
set.seed(123)

# Generate age from log-normal distribution
age <- rlnorm(10000, meanlog = 2.466869, sdlog = 1.652772)
age <- round(age[age < 93 & age >= 18])
n <- length(age)

# Simulate covariates correlated with age
age_scaled <- scale(age)

# Income as a function of age with noise
income <- exp(3 + 0.5 * age_scaled + rnorm(n, 0, 0.5))

# Marital status influenced by age
marital_status <- ifelse(plogis(-3 + 0.05 * age_scaled + rnorm(n, 0, 0.5)) > 0.5,
                         "married", "single")

# Education level with age-dependent probabilities
get_education_prob <- function(a) {
  if (a < 25) return(c(0.5, 0.35, 0.15))
  else if (a < 40) return(c(0.3, 0.45, 0.25))
  else return(c(0.2, 0.4, 0.4))
}
education <- sapply(age, function(a) {
  sample(c("High School", "Bachelor", "Master"), 1, prob = get_education_prob(a))
})

data_example <- data.frame(
  age = age,
  income = income,
  marital_status = marital_status,
  education = education
)

# Introduce heaping with 27% heap ratio (as in the paper)
heap_ratio <- 0.27
indices_to_heap <- sample(n, round(heap_ratio * n))
data_example$age_heaped <- data_example$age
data_example$age_heaped[indices_to_heap] <- round(data_example$age_heaped[indices_to_heap] / 5) * 5

Apply model-based correction:

# Model-based correction using income, education and marital status
if (requireNamespace("ranger", quietly = TRUE) &&
    requireNamespace("VIM", quietly = TRUE)) {

  # Also apply simple correction for comparison
  data_example$age_simple <- correctHeaps(
    data_example$age_heaped,
    heaps = "5year",
    method = "lnorm",
    seed = 42
  )

  # Model-based correction using covariates
  data_example$age_corrected <- correctHeaps(
    data_example$age_heaped,
    heaps = "5year",
    method = "lnorm",
    model = age_heaped ~ income + marital_status + education,
    dataModel = data_example,
    seed = 42
  )

  # Compare correlations with log(income)
  log_income <- log(data_example$income)
  cor_original <- cor(data_example$age, log_income)
  cor_heaped <- cor(data_example$age_heaped, log_income)
  cor_simple <- cor(data_example$age_simple, log_income)
  cor_corrected <- cor(data_example$age_corrected, log_income)

  cat("Correlation of age with log(income):\n")
  cat("  Original (true):", round(cor_original, 4), "\n")
  cat("  Heaped:", round(cor_heaped, 4), "\n")
  cat("  Simple correction:", round(cor_simple, 4), "\n")
  cat("  Model-based correction:", round(cor_corrected, 4), "\n")
}
#> Correlation of age with log(income):
#>   Original (true): 0.7087 
#>   Heaped: 0.7086 
#>   Simple correction: 0.7084 
#>   Model-based correction: 0.71

Let’s visualize the model-based correction:

if (requireNamespace("ranger", quietly = TRUE) &&
    requireNamespace("VIM", quietly = TRUE)) {

  oldpar <- par(mfrow = c(2, 2))

  # Age distributions
  age_breaks <- seq(min(data_example$age) - 1, max(data_example$age) + 1, by = 1)
  hist(data_example$age_heaped, breaks = age_breaks, col = "coral",
       main = "Heaped Ages", xlab = "Age", border = "white")

  hist(data_example$age_corrected, breaks = age_breaks, col = "forestgreen",
       main = "Model-Based Corrected Ages", xlab = "Age", border = "white")

  # Age vs log(Income) relationships
  log_income <- log(data_example$income)
  plot(data_example$age_heaped, log_income,
       pch = 16, col = adjustcolor("coral", 0.3), cex = 0.5,
       main = "Heaped: Age vs log(Income)",
       xlab = "Age", ylab = "log(Income)")
  abline(lm(log_income ~ data_example$age_heaped), col = "darkred", lwd = 2)

  plot(data_example$age_corrected, log_income,
       pch = 16, col = adjustcolor("forestgreen", 0.3), cex = 0.5,
       main = "Corrected: Age vs log(Income)",
       xlab = "Age", ylab = "log(Income)")
  abline(lm(log_income ~ data_example$age_corrected), col = "darkgreen", lwd = 2)

  par(oldpar)
}

Comparison of heaped ages and model-based corrected ages

The model-based correction preserves the relationship between age and income better than simple correction because it uses the covariate information to determine the direction of adjustments.

Multiple Imputation Framework

Because correction involves random sampling, you can create multiple corrected datasets for proper uncertainty quantification:

# Create 5 corrected datasets with different seeds
m <- 5
corrected_datasets <- lapply(1:m, function(i) {
  correctHeaps(age_heaped, heaps = "5year",
               method = "lnorm", seed = i * 100)
})

# Calculate Whipple index for each
whipple_values <- sapply(corrected_datasets, whipple)

cat("Whipple indices across imputations:\n")
#> Whipple indices across imputations:
cat("  Mean:", round(mean(whipple_values), 2), "\n")
#>   Mean: 105.98
cat("  SD:", round(sd(whipple_values), 2), "\n")
#>   SD: 2.59
cat("  Range:", round(min(whipple_values), 2), "-",
    round(max(whipple_values), 2), "\n")
#>   Range: 102.86 - 109.73

Protecting Specific Observations

If some ages are known to be accurate (e.g., verified from administrative records), exclude them from correction:

# Assume first 100 observations are verified
verified_indices <- 1:100

age_protected <- correctHeaps(age_heaped,
                              heaps = "5year",
                              method = "lnorm",
                              fixed = verified_indices,  # don't change these
                              seed = 42)

# Verify protected observations unchanged
all(age_heaped[verified_indices] == age_protected[verified_indices])
#> [1] TRUE

Working with Aggregated Data: Sprague Multipliers

If you have 5-year age group data and need single-year ages, use the Sprague multipliers:

# Example: population counts in 5-year groups
pop_5year <- c(
  1971990, 2095820, 2157190, 2094110, 2116580,  # 0-4, 5-9, ..., 20-24
  2003840, 1785690, 1502990, 1214170, 796934,   # 25-29, ..., 45-49
  627551, 530305, 488014, 364498, 259029,       # 50-54, ..., 70-74
  158047, 125941                                 # 75-79, 80+
)

# Disaggregate to single years
pop_single <- sprague(pop_5year)

# First 15 ages
head(pop_single, 15)
#>        0        1        2        3        4        5        6        7 
#> 383576.4 388941.4 394401.8 399858.4 405212.1 410363.7 415213.9 419663.7 
#>        8        9       10       11       12       13       14 
#> 423613.8 426965.0 430053.3 433214.8 434174.3 431962.3 427785.2

# Total is preserved
c(Sum_5year = sum(pop_5year), Sum_single = sum(pop_single))
#>  Sum_5year Sum_single 
#>   20292699   20292699

Indices for Old Ages

For data quality assessment at older ages, specialized indices are available:

# Create old-age data with heaping
set.seed(42)
old_ages <- c(sample(85:105, 3000, replace = TRUE),
              rep(c(90, 95, 100), each = 200))  # heaping at round ages

# Coale-Li index (designed for ages 60+)
coale_li(old_ages, digit = 0, ageMin = 85)
#> [1] 1.815958

# Jdanov index (for very old ages like 95, 100, 105)
jdanov(old_ages, Agei = c(95, 100, 105))
#> [1] 176.6885

# Kannisto index (for a single old age)
kannisto(old_ages, Agei = 90)
#> [1] 2.003626
kannisto(old_ages, Agei = 95)
#> [1] 1.963752

Summary

The heaping package provides a complete toolkit for handling heaping in survey data:

Function	Purpose
`whipple()`	Standard and modified Whipple index
`myers()`	Myers’ blended index
`bachi()`	Bachi’s index
`noumbissi()`	Single-digit Noumbissi index
`spoorenberg()`	Total Modified Whipple index
`coale_li()`	Coale-Li index for old ages
`jdanov()`	Jdanov index for very old ages
`kannisto()`	Kannisto index for single old age
`heaping_indices()`	All common indices at once
`correctHeaps()`	Correct regular heaping patterns
`correctSingleHeap()`	Correct a single heap
`sprague()`	Disaggregate 5-year to single-year ages

Key advantages over traditional methods:

Preserves individual records for regression and machine learning
Multiple correction methods (log-normal, normal, uniform, kernel)
Model-based correction preserves covariate relationships
Supports sampling weights for survey data
Multiple imputation compatible for uncertainty quantification