---
title: "dicepro - Simulated Data Workflow"
author: "dicepro Team"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{dicepro - Simulated Data Workflow}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse   = TRUE,
  comment    = "#>",
  fig.width  = 7,
  fig.height = 5,
  warning    = FALSE,
  message    = FALSE,
  eval       = FALSE
)
```

> **Note:** All code chunks have `eval = FALSE` and are shown for
> illustration only. To run them interactively:
> ```r
> library(dicepro)
> # copy-paste the chunks below into your R session
> ```

# Overview

This vignette demonstrates complete dicepro deconvolution workflows on
**fully synthetic data**, enabling end-to-end reproducibility without
requiring access to proprietary data-sets.

Two simulation strategies are available and covered here:

| Function | Reference matrix | Cell-type names | Best used for |
|---|---|---|---|
| `simulation()` | Synthetic (generated internally) | Generic (`CellType_k`) | Flexible benchmarking with custom parameters |
| `simulation_bluecode()` | BlueCode (bundled real reference) | Real biological names | Realistic benchmarking on human tissue cell types |

Both functions return the same standardized triplet -- `$W`, `$p`, `$B` --
and are fully interchangeable as input to `dicepro()`.

The pipeline proceeds in four stages:

1. **Simulate** a reference signature matrix and ground-truth cell-type
   proportions.
2. **Generate** a noisy bulk RNA-seq matrix from the simulated mixture.
3. **Deconvolve** using the main `dicepro()` function with automated
   hyper-parameter optimization.
4. **Evaluate** the recovered proportions against the ground truth.

---

# Strategy 1 -- Fully Synthetic Simulation

`simulation()` generates a self-consistent triplet from scratch: a synthetic
reference matrix `$W`, ground-truth proportion matrix `$p`, and noisy bulk
matrix `$B`, guaranteeing column-name alignment across all three outputs.

## Data Generation

```{r simulate-synthetic}
library(dicepro)
set.seed(2101L)

sim_data <- simulation(
  loi        = "gauss",
  scenario   = "hierarchical",
  nSample    = 30,
  nGenes     = 200,
  nCellsType = 10,
  sigma_bio  = 0.07,
  sigma_tech = 0.07,
  seed       = 2101L
)

cat("Reference  :", dim(sim_data$W), "\n")
cat("Proportions:", dim(sim_data$p), "\n")
cat("Bulk       :", dim(sim_data$B), "\n")
cat("Row sums   :", round(range(rowSums(sim_data$p)), 4), "\n")
```

## Noise Model Sanity Check

```{r sanity-check-synthetic}
bulk_clean <- as.matrix(sim_data$W) %*% t(as.matrix(sim_data$p))

plot(
  bulk_clean[seq_len(500)],
  as.matrix(sim_data$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#2c7bb6",
  main = "Noise model: clean vs noisy bulk"
)
abline(0, 1, col = "firebrick", lwd = 1.5)
```

## Deconvolution

```{r run-dicepro-synthetic}
out <- dicepro(
  reference             = as.matrix(sim_data$W)[, -c(1,5,10)],
  bulk                  = as.matrix(sim_data$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "SimBulk",
  refName               = "SimRef",
  hp_max_evals          = 500,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)
```

## Results

### Optimal Hyperparameters

```{r best-hp-synthetic}
cat("Best lambda :", out$hyperparameters$lambda, "\n")
cat("Best gamma  :", out$hyperparameters$gamma,  "\n")
cat("Loss        :", out$metrics$loss,           "\n")
cat("Constraint  :", out$metrics$constraint,     "\n")
```

### Hyperparameter Optimisation Report

```{r plot-hyperopt-synthetic, fig.height=9}
out$plot_hyperopt
```

### Pareto Frontier

```{r plot-pareto-synthetic}
out$plot
```

### Recovered vs True Proportions

```{r compare-props-synthetic}
true_prop <- as.matrix(sim_data$p)
pred_prop <- as.matrix(out$H)

common_ct   <- intersect(colnames(pred_prop), colnames(true_prop))
true_common <- true_prop[, common_ct, drop = FALSE]
pred_common <- pred_prop[, common_ct, drop = FALSE]

r_overall <- cor(as.vector(true_common), as.vector(pred_common))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall))

plot(
  as.vector(true_common),
  as.vector(pred_common),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#2c7bb6aa",
  main = sprintf("True vs Predicted  (r = %.3f)", r_overall)
)
abline(0, 1, col = "firebrick", lwd = 1.5)
```

### Per-Cell-Type Correlation

```{r per-ct-cor-synthetic, fig.height=4}
ct_cors <- vapply(common_ct, function(ct) {
  cor(true_common[, ct], pred_common[, ct])
}, numeric(1L))

par(mar = c(5, 10, 3, 1))
barplot(
  sort(ct_cors),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (synthetic)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")
```

### Quantitative Performance Metrics

```{r perf-metrics-synthetic}
perf <- makeTable1Tool(pred_mat = pred_common, obs_mat = true_common)
knitr::kable(perf$Perf, digits = 3,
             caption = "Performance metrics -- fully synthetic data")
```

---

# Strategy 2 -- BlueCode-Based Simulation

`simulation_bluecode()` uses the **BlueCode** reference matrix bundled with
the package -- a curated signature matrix covering 34 human cell types across
five tissue compartments -- as the fixed reference `$W`. Proportions are
simulated via a two-level Dirichlet hierarchy that reflects the biological
organisation of human tissues.

This strategy is recommended when benchmarking should be grounded in a
biologically realistic reference, while keeping full control over the
ground-truth proportions.

## Compartment Structure

The hierarchy encodes five major tissue compartments with their constituent
cell types:

| Compartment | Cell types (n) | Default α |
|---|---|---|
| Immune | B naive/memory, T CD4/CD8, NK, Monocyte, Macrophage, Neutrophil | 4.0 |
| Stromal | Fibroblasts (×5), MSC, Chondrocyte, Osteoblast | 2.5 |
| Endothelial | Large vessel, microvascular (×2) | 1.8 |
| Epithelial | Mammary, renal, respiratory, keratinocyte, melanocyte | 1.8 |
| Muscle | Smooth muscle (×7), cardiac myocyte, myometrial | 1.5 |

## Data Generation

```{r simulate-bluecode}
library(dicepro)

sim_bc <- simulation_bluecode(
  nSample    = 30,
  sigma_bio  = 0.15,
  sigma_tech = 0.02,
  seed       = 2101L
)

cat("Reference  :", dim(sim_bc$W), "\n")  # nGenes x 34
cat("Proportions:", dim(sim_bc$p), "\n")  # 30 x 34
cat("Bulk       :", dim(sim_bc$B), "\n")  # nGenes x 30
cat("Row sums   :", round(range(rowSums(sim_bc$p)), 4), "\n")

# Real cell-type names from BlueCode
head(colnames(sim_bc$p))
```

## Proportion Distribution by Compartment

The hierarchical model induces realistic between-compartment variation.
Visualising the compartment-level totals confirms the expected dominance of
the Immune compartment.

```{r compartment-overview}
# Cell-type to compartment mapping (mirrors .bluecode_cell_groups)
compartment_map <- c(
  rep("Immune",      9),
  rep("Stromal",     8),
  rep("Endothelial", 3),
  rep("Epithelial",  5),
  rep("Muscle",      9)
)
names(compartment_map) <- colnames(sim_bc$p)

# Aggregate proportions by compartment for each sample
comp_props <- t(apply(sim_bc$p, 1, function(row) {
  tapply(row, compartment_map[names(row)], sum)
}))

boxplot(
  comp_props,
  col  = c("#4393c3", "#74c476", "#fd8d3c", "#9ecae1", "#fb6a4a"),
  ylab = "Compartment proportion",
  main = "Simulated compartment proportions (BlueCode)",
  las  = 2
)
```

## Noise Model Sanity Check

```{r sanity-check-bluecode}
bulk_clean_bc <- as.matrix(sim_bc$W) %*% t(as.matrix(sim_bc$p))

plot(
  bulk_clean_bc[seq_len(500)],
  as.matrix(sim_bc$B)[seq_len(500)],
  xlab = "Clean bulk (first 500 entries)",
  ylab = "Noisy bulk",
  pch  = 19, cex = 0.4, col = "#74c476",
  main = "Noise model: clean vs noisy bulk (BlueCode)"
)
abline(0, 1, col = "firebrick", lwd = 1.5)
```

## Deconvolution

```{r run-dicepro-bluecode}
out_bc <- dicepro(
  reference             = as.matrix(sim_bc$W),
  bulk                  = as.matrix(sim_bc$B),
  methodDeconv          = "FARDEEP",
  W_prime               = 0,
  bulkName              = "BlueCodeBulk",
  refName               = "BlueCode",
  hp_max_evals          = 100,
  algo_select           = "random",
  output_path           = tempdir(),
  hspaceTechniqueChoose = "gamma_dominant",
  normalize             = FALSE
)
```

## Results

### Optimal Hyperparameters

```{r best-hp-bluecode}
cat("Best lambda :", out_bc$hyperparameters$lambda, "\n")
cat("Best gamma  :", out_bc$hyperparameters$gamma,  "\n")
cat("Loss        :", out_bc$metrics$loss,            "\n")
cat("Constraint  :", out_bc$metrics$constraint,      "\n")
```

### Hyperparameter Optimisation Report

```{r plot-hyperopt-bluecode, fig.height=9}
out_bc$plot_hyperopt
```

### Pareto Frontier

```{r plot-pareto-bluecode}
out_bc$plot
```

### Recovered vs True Proportions

```{r compare-props-bluecode}
true_prop_bc <- as.matrix(sim_bc$p)
pred_prop_bc <- as.matrix(out_bc$H)

common_ct_bc   <- intersect(colnames(pred_prop_bc), colnames(true_prop_bc))
true_common_bc <- true_prop_bc[, common_ct_bc, drop = FALSE]
pred_common_bc <- pred_prop_bc[, common_ct_bc, drop = FALSE]

r_overall_bc <- cor(as.vector(true_common_bc), as.vector(pred_common_bc))
cat(sprintf("Overall Pearson r: %.3f\n", r_overall_bc))

plot(
  as.vector(true_common_bc),
  as.vector(pred_common_bc),
  xlab = "True proportions",
  ylab = "Predicted proportions",
  pch  = 19, cex = 0.5, col = "#74c47699",
  main = sprintf("True vs Predicted -- BlueCode  (r = %.3f)", r_overall_bc)
)
abline(0, 1, col = "firebrick", lwd = 1.5)
```

### Per-Cell-Type Correlation

```{r per-ct-cor-bluecode, fig.height=6}
ct_cors_bc <- vapply(common_ct_bc, function(ct) {
  cor(true_common_bc[, ct], pred_common_bc[, ct])
}, numeric(1L))

par(mar = c(5, 14, 3, 1))
barplot(
  sort(ct_cors_bc),
  horiz = TRUE, las = 1,
  col   = ifelse(sort(ct_cors_bc) > 0.7, "#2c7bb6", "#d7191c"),
  xlab  = "Pearson r",
  main  = "Per-cell-type correlation (BlueCode)",
  xlim  = c(-0.2, 1)
)
abline(v = 0.7, lty = 2, col = "gray40")
```

### Quantitative Performance Metrics

```{r perf-metrics-bluecode}
perf_bc <- makeTable1Tool(pred_mat = pred_common_bc, obs_mat = true_common_bc)
knitr::kable(perf_bc$Perf, digits = 3,
             caption = "Performance metrics -- BlueCode simulation")
```

---

# Comparing Both Strategies

When both strategies are run under the same conditions, their overall
correlation scores can be compared directly.

```{r compare-strategies}
cat(sprintf(
  "Overall Pearson r\n  Synthetic  : %.3f\n  BlueCode   : %.3f\n",
  r_overall, r_overall_bc
))
```

These two strategies are complementary: `simulation()` offers maximum 
flexibility for testing deconvolution under stress conditions, with varying 
numbers of genes, cell types, or noise levels; `simulation_bluecode()` anchors 
the benchmark to a biologically grounded reference, making the results easier 
to interpret in the context of real-world data from human tissues.

---

# Session Info

```{r session-info}
sessionInfo()
```