Understanding Model Types

Overview

The bayesiansurpriser package provides five model types that represent different hypotheses about how data might be distributed. Bayesian Surprise measures how much our beliefs about which model is correct change after observing data.

The Five Model Types

1. Uniform Model (`bs_model_uniform`)

The uniform model assumes all observations are equally likely regardless of any contextual factors.

model_uniform <- bs_model_uniform()
print(model_uniform)
#> <bs_model: uniform >
#>   Name: Uniform 
#>   Parameters:
#>     n_regions :

Use when: You want to detect any deviation from uniform distribution.

Likelihood: Each observation has probability 1/n where n is the number of observations.

2. Base Rate Model (`bs_model_baserate`)

The base rate model assumes observations follow a known baseline distribution (e.g., population proportions).

nc <- st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)

# Create base rate model from birth counts
model_baserate <- bs_model_baserate(nc$BIR74)
print(model_baserate)
#> <bs_model: baserate >
#>   Name: Base Rate 
#>   Parameters:
#>     expected : [ 100 values]
#>     expected_prop : [ 100 values]
#>     normalize : TRUE

Use when: You have expected values (population, baseline rates) and want to detect deviations from them.

Likelihood: Probability proportional to expected values (normalized to sum to 1).

3. Gaussian Model (`bs_model_gaussian`)

The Gaussian model assumes observations follow a normal distribution, either with fixed parameters or fitted from the data.

# Fit from data
model_gaussian_fit <- bs_model_gaussian()

# Fixed parameters
model_gaussian_fixed <- bs_model_gaussian(mu = 100, sigma = 20, fit_from_data = FALSE)
print(model_gaussian_fixed)
#> <bs_model: gaussian >
#>   Name: Gaussian 
#>   Parameters:
#>     mu : 100 
#>     sigma : 20 
#>     fit_from_data : FALSE

Use when: You expect data to follow a normal distribution and want to detect outliers.

Likelihood: Based on normal PDF with given (or estimated) mean and standard deviation.

4. Sampled/KDE Model (`bs_model_sampled`)

The sampled model uses kernel density estimation (KDE) to create an empirical distribution from the data itself.

# Sample 20% of data for KDE
model_sampled <- bs_model_sampled(sample_frac = 0.2, kernel = "gaussian")
print(model_sampled)
#> <bs_model: sampled >
#>   Name: Sampled (KDE) 
#>   Parameters:
#>     sample_frac : 0.2 
#>     kernel : gaussian 
#>     bandwidth : nrd0 
#>     sample_indices :

Use when: You want a data-driven baseline without parametric assumptions.

Likelihood: Evaluated from kernel density estimate of sampled data.

5. de Moivre Funnel Model (`bs_model_funnel`)

The funnel model accounts for sampling variance, recognizing that smaller samples have higher natural variability.

# Create funnel model with sample sizes
model_funnel <- bs_model_funnel(nc$BIR74, type = "count")
print(model_funnel)
#> <bs_model: funnel >
#>   Name: de Moivre Funnel (paper) 
#>   Parameters:
#>     sample_size : [ 100 values]
#>     target_rate :  
#>     type : count 
#>     formula : paper 
#>     control_limits : 2, 3

Use when: Observations come from samples of different sizes and you want to account for sampling error.

Likelihood: Based on z-scores computed from expected standard error given sample size.

Combining Models: Model Space

Models are combined into a “model space” with prior probabilities:

# Default model space (uniform + baserate + funnel)
space_default <- default_model_space(nc$BIR74)
print(space_default)
#> <bs_model_space>
#>   Models: 3 
#>    1. Uniform (prior: 0.3333)
#>    2. Base Rate (prior: 0.3333)
#>    3. de Moivre Funnel (paper) (prior: 0.3333)

# Custom model space with explicit prior
space_custom <- model_space(
  bs_model_uniform(),
  bs_model_baserate(nc$BIR74),
  bs_model_gaussian(),
  bs_model_funnel(nc$BIR74),
  prior = c(0.1, 0.4, 0.2, 0.3),
  names = c("Uniform", "Population", "Normal", "Funnel")
)
print(space_custom)
#> <bs_model_space>
#>   Models: 4 
#>    1. Uniform (prior: 0.1)
#>    2. Population (prior: 0.4)
#>    3. Normal (prior: 0.2)
#>    4. Funnel (prior: 0.3)

How Models Affect Surprise

Different model combinations highlight different aspects of the data:

# Only uniform model - highlights any non-uniform distribution
result_uniform <- surprise(nc, observed = SID74,
                          models = model_space(bs_model_uniform()))

# Uniform + baserate - highlights deviation from population proportion
result_pop <- surprise(nc, observed = SID74, expected = BIR74,
                      models = model_space(
                        bs_model_uniform(),
                        bs_model_baserate(nc$BIR74)
                      ))

# Full default - accounts for sampling variance too
result_full <- surprise(nc, observed = SID74, expected = BIR74)

# Compare max surprise values
cat("Uniform only, max surprise:", max(result_uniform$surprise), "\n")
#> Uniform only, max surprise: 0
cat("With base rate, max surprise:", max(result_pop$surprise), "\n")
#> With base rate, max surprise: 0.9999773
cat("Full model space, max surprise:", max(result_full$surprise), "\n")
#> Full model space, max surprise: 0.6008295

Global Model Updates

Per-region surprise scores use the model-space priors for each observation. If you want a single global posterior over models after observing the full dataset, call bayesian_update() explicitly:

space <- default_model_space(nc$BIR74)
updated_space <- bayesian_update(space, nc$SID74)

cat("Prior:", updated_space$prior, "\n")
#> Prior: 0.3333333 0.3333333 0.3333333
cat("Posterior:", updated_space$posterior, "\n")
#> Posterior: 0.4275831 0.5650043 0.007412583

Guidelines for Model Selection

Scenario	Recommended Models
Raw counts, no baseline	Uniform, Gaussian
Counts with population data	Uniform, Base Rate, Funnel
Rates/proportions	Uniform, Base Rate
Exploratory count/exposure analysis	Uniform, Base Rate, Funnel
Small samples prevalent	Include Funnel
Unknown distribution	Include Sampled/KDE