library(BinaryReplicates)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.6
#> ✔ forcats 1.0.1 ✔ stringr 1.6.0
#> ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
#> ✔ purrr 1.2.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(42)The BinaryReplicates package provides tools for the
analysis of binary replicates. The aim of this vignette is to provide a
quick overview of the main functions of the package.
For each individual \(i\) in the dataset, we observe \(n_i\) replicates, each taking the value \(0\) or \(1\). These replicates are noisy measurements of the true latent state \(T_i\), which is also binary. We denote \(S_i\) the number of replicates equal to \(1\) for individual \(i\). The probability of observing a \(1\) is \(1-q\) if \(T_i = 1\) and \(p\) if \(T_i = 0\). Thus,
We assume that the individuals are independent and identically distributed. The prevalence of state \(1\) in the population is \(\theta_T = P(T_i = 1)\).
The statistical model is thus \[ [T_i|\theta_T]\sim \text{Bernoulli}(\theta_T),\quad [S_i|T_i, p, q]\sim \text{Binomial}\big(n_i, T_i(1-q)+(1-T_i)p\big). \]
In the Bayesian paradigm, we need to set a prior on the fixed parameters: \(\theta_T\in(0,1)\), \(p\in(0,1/2)\) and \(q\in(0,1/2)\). Under the prior distribution, the three parameters are independent:
Hyperparameters \(a_T, b_T, a_{FP}, b_{FP}, a_{FN}, b_{FN}\) are set by the user to include external information in the analysis. In absence of prior information, we recommend \[ a_T = b_T = 1/2;\quad a_{FP} = b_{FP} = a_{FN} = b_{FN} = 2. \]
We rely on Stan to sample from the posterior distribution of the fixed parameters and the latent variables.
We start by generating a synthetic dataset according to the model:
theta <- .4
p <- q <- .22
n <- 50
ni <- sample(2:6, n, replace = TRUE)
ti <- rbinom(n, 1, theta)
si <- rbinom(n, ni, ti*(1-q) + (1-ti)*p)
synth_data <- data.frame(ni = ni, si = si, ti=ti)We will use this dataset to illustrate the main functions of the package.
Before scoring individuals, we can estimate the model parameters
\(\theta_T\), \(p\), and \(q\) using the EM algorithm. The
EMFit function computes the Maximum-A-Posteriori (MAP)
estimates:
These estimates can then be used with MAP_scoring to
compute likelihood-based scores without knowing the true parameter
values.
We provide various ways to score individuals in the dataset: each score is a number between \(0\) and \(1\). Higher values of the score indicate a higher probability of being in state \(1\). The scoring methods are:
Before computing the scores, we need to sample from the posterior distribution. With the default prior, this is done as follows:
fit <- BayesianFit(ni, si, chains = 4, iter = 5000, refresh = 0)
print(fit, pars = c("theta", "p", "q"))
#> Inference for Stan model: BayesianModel.
#> 4 chains, each with iter=5000; warmup=2500; thin=1;
#> post-warmup draws per chain=2500, total post-warmup draws=10000.
#>
#> mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
#> theta 0.37 0 0.11 0.16 0.29 0.37 0.45 0.59 4434 1
#> p 0.23 0 0.06 0.13 0.19 0.23 0.27 0.34 4631 1
#> q 0.21 0 0.07 0.08 0.16 0.21 0.26 0.36 4774 1
#>
#> Samples were drawn using NUTS(diag_e) at Tue Jan 20 16:58:29 2026.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at
#> convergence, Rhat=1).We can now add the scores to the dataset:
synth_data <- synth_data %>%
mutate(Y_A = average_scoring(ni, si),
Y_M = median_scoring(ni, si),
Y_L = likelihood_scoring(ni, si, list(theta = theta, p = p, q = q)),
Y_MAP = MAP_scoring(ni, si, fit_em),
Y_B = bayesian_scoring(ni, si, fit))Note that we have plugged in the true values of the parameters in the
likelihood scoring function (Y_L). In practice, the true
parameters are unknown and we use the MAP estimates from
EMFit instead, as shown with Y_MAP.
The Bayesian fit provides credible intervals for the model
parameters. By default, credint returns 90% credible
intervals:
credint(fit)
#> lower upper
#> p 0.14129328 0.3248397
#> q 0.09752001 0.3356835
#> theta 0.18738012 0.5578003You can also request different confidence levels:
The prevalence \(\theta_T\) can be
estimated from the scores using prevalence_estimate, or
directly from the Bayesian fit using
bayesian_prevalence_estimate:
theta_A <- prevalence_estimate(synth_data$Y_A)
theta_M <- prevalence_estimate(synth_data$Y_M)
theta_MAP <- prevalence_estimate(synth_data$Y_MAP)
theta_B <- bayesian_prevalence_estimate(fit)
cat("True prevalence: ", theta, "\n")
#> True prevalence: 0.4
cat("Average-based: ", round(theta_A, 3), "\n")
#> Average-based: 0.429
cat("Median-based: ", round(theta_M, 3), "\n")
#> Median-based: 0.42
cat("MAP-based: ", round(theta_MAP, 3), "\n")
#> MAP-based: 0.377
cat("Bayesian: ", round(theta_B, 3), "\n")
#> Bayesian: 0.371The latent state of an individual is either \(0\) or \(1\). But since binary replicates are noisy measurements, we allow three different values for the prediction:
We can compute the predictions for the synthetic dataset as follows:
v_L <- .35
v_U <- 1 - v_L
synth_data <- synth_data %>%
mutate(T_A = classify_with_scores(Y_A, v_L, v_U),
T_M = classify_with_scores(Y_M, v_L, v_U),
T_L = classify_with_scores(Y_L, v_L, v_U),
T_MAP = classify_with_scores(Y_MAP, v_L, v_U),
T_B = classify_with_scores(Y_B, v_L, v_U))We can now count how many times we decide in favor of \(0\), \(1\) or \(1/2\) (inconclusive) given the true state and the method to score the individuals as follows:
confusion <- synth_data %>%
mutate(
Status = ifelse(ti == 1, "T=1", "T=0"),
Average = T_A, Median = T_M, Likelihood = T_L, MAP = T_MAP, Bayesian = T_B) %>%
pivot_longer(cols = c(Average, Median, Likelihood, MAP, Bayesian),
names_to = "Method", values_to = "Decision") %>%
group_by(Status, Method, Decision) %>%
summarise(count = n(), .groups = "keep") %>%
ungroup() %>%
mutate(count = as.integer(count)) %>%
pivot_wider(names_from = Decision, values_from = count, values_fill = 0)
confusion
#> # A tibble: 10 × 5
#> Status Method `0` `0.5` `1`
#> <chr> <chr> <int> <int> <int>
#> 1 T=0 Average 24 4 3
#> 2 T=0 Bayesian 26 5 0
#> 3 T=0 Likelihood 26 2 3
#> 4 T=0 MAP 26 2 3
#> 5 T=0 Median 26 2 3
#> 6 T=1 Average 2 2 15
#> 7 T=1 Bayesian 2 5 12
#> 8 T=1 Likelihood 2 0 17
#> 9 T=1 MAP 2 0 17
#> 10 T=1 Median 2 0 17Based on the Bayesian fitting, we can also compute the Bayesian scores and the predictions on new data. Let us say we are interested in predicting the scores and the latent state of individuals with \(n_i = 8\) replicates and \(s_i \in \{0, 1, 2, 3, 4, 5, 6, 7, 8\}\). The new data are thus:
The scores and the predictions can be computed as follows:
new_data <- new_data %>%
mutate(Y_B = predict_scores(ni, si, fit),
T_B = classify_with_scores(Y_B, v_L, v_U))
new_data
#> ni si Y_B T_B
#> 1 8 0 0.0002131493 0.0
#> 2 8 1 0.0016552392 0.0
#> 3 8 2 0.0138130209 0.0
#> 4 8 3 0.0944144061 0.0
#> 5 8 4 0.3741387192 0.5
#> 6 8 5 0.7533352924 1.0
#> 7 8 6 0.9515953209 1.0
#> 8 8 7 0.9950104409 1.0
#> 9 8 8 0.9995601165 1.0In the above situation, the predictions are conclusive for all the individuals except the one where we observed as many \(0\)’s as \(1\)’s in the replicates.