Note: All code chunks have
eval = FALSEand are shown for illustration only. To run them interactively:
This vignette demonstrates complete dicepro deconvolution workflows on fully synthetic data, enabling end-to-end reproducibility without requiring access to proprietary data-sets.
Two simulation strategies are available and covered here:
| Function | Reference matrix | Cell-type names | Best used for |
|---|---|---|---|
simulation() |
Synthetic (generated internally) | Generic (CellType_k) |
Flexible benchmarking with custom parameters |
simulation_bluecode() |
BlueCode (bundled real reference) | Real biological names | Realistic benchmarking on human tissue cell types |
Both functions return the same standardized triplet –
$W, $p, $B – and are fully
interchangeable as input to dicepro().
The pipeline proceeds in four stages:
dicepro()
function with automated hyper-parameter optimization.simulation() generates a self-consistent triplet from
scratch: a synthetic reference matrix $W, ground-truth
proportion matrix $p, and noisy bulk matrix
$B, guaranteeing column-name alignment across all three
outputs.
library(dicepro)
set.seed(2101L)
sim_data <- simulation(
loi = "gauss",
scenario = "hierarchical",
nSample = 30,
nGenes = 200,
nCellsType = 10,
sigma_bio = 0.07,
sigma_tech = 0.07,
seed = 2101L
)
cat("Reference :", dim(sim_data$W), "\n")
cat("Proportions:", dim(sim_data$p), "\n")
cat("Bulk :", dim(sim_data$B), "\n")
cat("Row sums :", round(range(rowSums(sim_data$p)), 4), "\n")bulk_clean <- as.matrix(sim_data$W) %*% t(as.matrix(sim_data$p))
plot(
bulk_clean[seq_len(500)],
as.matrix(sim_data$B)[seq_len(500)],
xlab = "Clean bulk (first 500 entries)",
ylab = "Noisy bulk",
pch = 19, cex = 0.4, col = "#2c7bb6",
main = "Noise model: clean vs noisy bulk"
)
abline(0, 1, col = "firebrick", lwd = 1.5)out <- dicepro(
reference = as.matrix(sim_data$W)[, -c(1,5,10)],
bulk = as.matrix(sim_data$B),
methodDeconv = "FARDEEP",
W_prime = 0,
bulkName = "SimBulk",
refName = "SimRef",
hp_max_evals = 500,
algo_select = "random",
output_path = tempdir(),
hspaceTechniqueChoose = "gamma_dominant",
normalize = FALSE
)true_prop <- as.matrix(sim_data$p)
pred_prop <- as.matrix(out$H)
common_ct <- intersect(colnames(pred_prop), colnames(true_prop))
true_common <- true_prop[, common_ct, drop = FALSE]
pred_common <- pred_prop[, common_ct, drop = FALSE]
r_overall <- cor(as.vector(true_common), as.vector(pred_common))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall))
plot(
as.vector(true_common),
as.vector(pred_common),
xlab = "True proportions",
ylab = "Predicted proportions",
pch = 19, cex = 0.5, col = "#2c7bb6aa",
main = sprintf("True vs Predicted (r = %.3f)", r_overall)
)
abline(0, 1, col = "firebrick", lwd = 1.5)ct_cors <- vapply(common_ct, function(ct) {
cor(true_common[, ct], pred_common[, ct])
}, numeric(1L))
par(mar = c(5, 10, 3, 1))
barplot(
sort(ct_cors),
horiz = TRUE, las = 1,
col = ifelse(sort(ct_cors) > 0.7, "#2c7bb6", "#d7191c"),
xlab = "Pearson r",
main = "Per-cell-type correlation (synthetic)",
xlim = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")simulation_bluecode() uses the BlueCode
reference matrix bundled with the package – a curated signature matrix
covering 34 human cell types across five tissue compartments – as the
fixed reference $W. Proportions are simulated via a
two-level Dirichlet hierarchy that reflects the biological organisation
of human tissues.
This strategy is recommended when benchmarking should be grounded in a biologically realistic reference, while keeping full control over the ground-truth proportions.
The hierarchy encodes five major tissue compartments with their constituent cell types:
| Compartment | Cell types (n) | Default α |
|---|---|---|
| Immune | B naive/memory, T CD4/CD8, NK, Monocyte, Macrophage, Neutrophil | 4.0 |
| Stromal | Fibroblasts (×5), MSC, Chondrocyte, Osteoblast | 2.5 |
| Endothelial | Large vessel, microvascular (×2) | 1.8 |
| Epithelial | Mammary, renal, respiratory, keratinocyte, melanocyte | 1.8 |
| Muscle | Smooth muscle (×7), cardiac myocyte, myometrial | 1.5 |
library(dicepro)
sim_bc <- simulation_bluecode(
nSample = 30,
sigma_bio = 0.15,
sigma_tech = 0.02,
seed = 2101L
)
cat("Reference :", dim(sim_bc$W), "\n") # nGenes x 34
cat("Proportions:", dim(sim_bc$p), "\n") # 30 x 34
cat("Bulk :", dim(sim_bc$B), "\n") # nGenes x 30
cat("Row sums :", round(range(rowSums(sim_bc$p)), 4), "\n")
# Real cell-type names from BlueCode
head(colnames(sim_bc$p))The hierarchical model induces realistic between-compartment variation. Visualising the compartment-level totals confirms the expected dominance of the Immune compartment.
# Cell-type to compartment mapping (mirrors .bluecode_cell_groups)
compartment_map <- c(
rep("Immune", 9),
rep("Stromal", 8),
rep("Endothelial", 3),
rep("Epithelial", 5),
rep("Muscle", 9)
)
names(compartment_map) <- colnames(sim_bc$p)
# Aggregate proportions by compartment for each sample
comp_props <- t(apply(sim_bc$p, 1, function(row) {
tapply(row, compartment_map[names(row)], sum)
}))
boxplot(
comp_props,
col = c("#4393c3", "#74c476", "#fd8d3c", "#9ecae1", "#fb6a4a"),
ylab = "Compartment proportion",
main = "Simulated compartment proportions (BlueCode)",
las = 2
)bulk_clean_bc <- as.matrix(sim_bc$W) %*% t(as.matrix(sim_bc$p))
plot(
bulk_clean_bc[seq_len(500)],
as.matrix(sim_bc$B)[seq_len(500)],
xlab = "Clean bulk (first 500 entries)",
ylab = "Noisy bulk",
pch = 19, cex = 0.4, col = "#74c476",
main = "Noise model: clean vs noisy bulk (BlueCode)"
)
abline(0, 1, col = "firebrick", lwd = 1.5)true_prop_bc <- as.matrix(sim_bc$p)
pred_prop_bc <- as.matrix(out_bc$H)
common_ct_bc <- intersect(colnames(pred_prop_bc), colnames(true_prop_bc))
true_common_bc <- true_prop_bc[, common_ct_bc, drop = FALSE]
pred_common_bc <- pred_prop_bc[, common_ct_bc, drop = FALSE]
r_overall_bc <- cor(as.vector(true_common_bc), as.vector(pred_common_bc))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall_bc))
plot(
as.vector(true_common_bc),
as.vector(pred_common_bc),
xlab = "True proportions",
ylab = "Predicted proportions",
pch = 19, cex = 0.5, col = "#74c47699",
main = sprintf("True vs Predicted -- BlueCode (r = %.3f)", r_overall_bc)
)
abline(0, 1, col = "firebrick", lwd = 1.5)ct_cors_bc <- vapply(common_ct_bc, function(ct) {
cor(true_common_bc[, ct], pred_common_bc[, ct])
}, numeric(1L))
par(mar = c(5, 14, 3, 1))
barplot(
sort(ct_cors_bc),
horiz = TRUE, las = 1,
col = ifelse(sort(ct_cors_bc) > 0.7, "#2c7bb6", "#d7191c"),
xlab = "Pearson r",
main = "Per-cell-type correlation (BlueCode)",
xlim = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")When both strategies are run under the same conditions, their overall correlation scores can be compared directly.
These two strategies are complementary: simulation()
offers maximum flexibility for testing deconvolution under stress
conditions, with varying numbers of genes, cell types, or noise levels;
simulation_bluecode() anchors the benchmark to a
biologically grounded reference, making the results easier to interpret
in the context of real-world data from human tissues.