---
title: "Building augmented data for multi-state models"
subtitle: "The **msmtools** workflow"
author: |
  | Francesco Grossetti
  | francesco.grossetti@unibocconi.it
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    number_sections: yes
    toc: yes
    toc_depth: 3
bibliography: references.bib
vignette: >
  %\VignetteIndexEntry{Building augmented data for multi-state models}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown_notangle}
---

```{r setup, message = FALSE}
library(msmtools)
library(data.table)
library(msm)
```

# Overview

**msmtools** prepares longitudinal data for multi-state models fitted with
**msm** [@jackson2011multi; @msm_cran]. The package exposes four public
functions:

* `augment()` builds transition-level data from repeated observations;
* `polish()` removes subjects with incompatible transitions at the same time;
* `survplot()` compares fitted and empirical survival curves;
* `prevplot()` compares observed and expected state prevalences.

The examples below use the bundled `hosp` dataset. It contains synthetic
hospital admissions for 10 subjects.

```{r data-preview}
data(hosp)
hosp[1:6, .(subj, adm_number, gender, age, label_3, dateIN, dateOUT, dateCENS)]
```

# Data Augmentation

`augment()` adds one row per transition endpoint and creates status variables
that can be used directly in an **msm** model.

```{r augment}
hosp_augmented <- augment(
  data = copy(hosp),
  data_key = subj,
  n_events = adm_number,
  pattern = label_3,
  t_start = dateIN,
  t_end = dateOUT,
  t_cens = dateCENS
)

hosp_augmented[
  1:8,
  .(subj, adm_number, label_3, augmented, augmented_int, status, status_num)
]
```

When the input time columns are `Date` values, `augment()` keeps the date-valued
transition time and adds an integer version. This is useful because **msm** works
with numeric time scales.

```{r augmented-columns}
names(hosp_augmented)
```

# Outcome Schema And Generated States

`pattern` and `state` describe different parts of the augmentation. `pattern`
is the terminal outcome schema observed in the input data. It can have two
values, alive and dead, or three values, alive, dead during a transition, and
dead after a transition.

`state` is the generated transition-state vocabulary. It must always contain
three labels: the state at `t_start`, the state at `t_end`, and the absorbing
state. This is why a two-value `pattern` still needs three `state` labels:
`augment()` uses the event times to infer whether death maps to the absorbing
state inside or outside the transition window.

By default, `augment()` uses `copy = FALSE` and follows **data.table**
by-reference semantics. This avoids unnecessary memory use on large longitudinal
datasets, but the input object can have its key changed and `n_events` can be
created when the argument is omitted. Use `copy = TRUE` when the original input
must remain unchanged.

# Duplicate Transition Cleanup

`polish()` removes entire subjects when different transitions occur at the same
time. The bundled data do not contain such conflicts, so this call leaves the
data unchanged. It also uses `copy = FALSE` by default; set `copy = TRUE` when
the original augmented data should not be keyed or otherwise touched by
reference.

```{r polish}
hosp_clean <- polish(
  data = copy(hosp_augmented),
  data_key = subj,
  pattern = label_3
)

nrow(hosp_augmented)
nrow(hosp_clean)
```

# Survival Plot

The plotting helpers work on fitted **msm** objects. This example uses a compact
three-state transition matrix matching the default `augment()` state labels.

```{r fit-model}
Qmat <- matrix(0, nrow = 3, ncol = 3, byrow = TRUE)
Qmat[1, 1:3] <- 1
Qmat[2, 1:3] <- 1
colnames(Qmat) <- c("IN", "OUT", "DEAD")
rownames(Qmat) <- c("IN", "OUT", "DEAD")

msm_model <- msm(
  status_num ~ augmented_int,
  subject = subj,
  data = hosp_augmented,
  exacttimes = TRUE,
  gen.inits = TRUE,
  qmatrix = Qmat,
  method = "BFGS",
  control = list(fnscale = 6e+05, trace = 0, REPORT = 1, maxit = 10000)
)
```

```{r survival-plot, fig.width = 7, fig.height = 4}
surv_p <- survplot(msm_model, km = TRUE, grid = 10)
surv_p
```

The fitted and Kaplan-Meier data tables are attached to the plot as
named fields, accessible with the standard `$` operator:

```{r survival-data}
surv_p$fitted[1:6]
surv_p$km[1:6]
```

# Prevalence Plot

`prevplot()` uses the output of `msm::prevalence.msm()` and returns a `ggplot`
object.

```{r prevalence-plot, fig.width = 7, fig.height = 4}
prev <- prevalence.msm(
  msm_model,
  covariates = "mean",
  ci = "normal",
  times = seq(
    min(hosp_augmented$augmented_int),
    max(hosp_augmented$augmented_int),
    length.out = 6
  )
)

prev_p <- prevplot(msm_model, prev, ci = TRUE, M = FALSE)
prev_p
```

The long-format prevalence data used to build the plot is attached as
`$prevalence`:

```{r prevalence-data}
prev_p$prevalence[1:6]
```

# Notes

The current 2.x series keeps the public API stable while modernizing
dependencies, documentation, tests, and CI. Larger internal changes to
`augment()` are intentionally deferred until after the maintenance releases.