---
title: "Solving an infinite-horizon semi-MDP"
author: "Lars Relund <lars@relund.dk>"
date: "`r Sys.Date()`"
bibliography: litt.bib
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{infinite-mdp}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

<style> 
p {text-align: justify;} 
//.sourceCode {background-color: white;}
pre {
  // border-style: solid;
  // border-width: 1px;
  // border-color: grey;
  //background-color: grey !important;
}
img {
   //width: 100%;
   border: 0;
}
</style>

<!-- scale math down -->
<script type="text/x-mathjax-config"> 
    MathJax.Hub.Config({ 
        "HTML-CSS": { scale: 80 }
        });
</script>

```{r setup, include=FALSE}
library(knitr)
options(rmarkdown.html_vignette.check_title = FALSE,
        # formatR.arrow = TRUE, 
        # scipen=999, 
        # digits=5,
        width=90) 
#thm <- knit_theme$get("edit-kwrite")   # whitengrey, bright, print, edit-flashdevelop, edit-kwrite
#knit_theme$set(thm)
knit_hooks$set(
   par = function(before, options, envir) {
      if (before && options$fig.show != 'none')
         par(mar = c(0, 0, 0, 0), # bottom, left, top, and right
             oma = c(0, 0, 0, 0))}
)
knitr::opts_chunk$set(
   # collapse = TRUE,
   comment = "#>",
   fig.align = 'center',
   fig.width = 9,
   fig.height = 5,
   fig.show = 'hold',
   out.extra = 'style="max-width:100%;"',
   # tidy = TRUE,
   # prompt=T,
   # comment=NA,
   cache = F
   # background = "red"
)
library(magrittr)
library(dplyr)
```


The `MDP2` package in R is a package for solving Markov decision processes (MDPs) with discrete 
time-steps, states and actions. Both traditional MDPs [@Puterman94], semi-Markov decision processes
(semi-MDPs) [@Tijms03] and hierarchical-MDPs (HMDPs) [@Kristensen00] can be solved under a finite
and infinite time-horizon.

The package implement well-known algorithms such as policy iteration and value iteration
under different criteria e.g. average reward per time unit and expected total discounted reward. The
model is stored using an underlying data structure based on the *state-expanded directed hypergraph*
of the MDP (@Relund06) implemented in `C++` for fast running times. <!-- Under development is also
support for MLHMP which is a Java implementation of algorithms for solving MDPs (@Kristensen03). 
-->

Building and solving an MDP is done in two steps. First, the MDP is built and saved in a set 
of binary files. Next, you load the MDP into memory from the binary files and apply various
algorithms to the model.

For building the MDP models see `vignette("building")`. In this vignette we focus on the second step, i.e. finding the optimal policy. Here we consider an infinite semi-MDP. 

```{r}
library(MDP2)
```



## An infinite-horizon semi-MDP 

An *infinite-horizon semi-MDP* considers a sequential decision problem over an infinite number of 
*stages*. Let $I$ denote the finite set of system states at stage $n$. Note we assume that the 
semi-MDP is *homogeneous*, i.e the state space is independent of stage number. When *state* $i \in 
I$ is observed, an *action* $a$ from the finite set of allowable actions $A(i)$ must be chosen which
generates *reward* $r(i,a)$. Moreover, let $\tau(i,a)$ denote the *stage length* of action $a$, i.e.
the expected time until the next decision epoch (stage $n+1$) given action $a$ and state $i$.
Finally, let $p_{ij}(a)$ denote the *transition probability* of obtaining state $j\in I$ at stage
$n+1$ given that action $a$ is chosen in state $i$ at stage $n$. A policy is a decision 
rule/function that assigns to each state in the process an action.

## Example

Let us consider example 6.1.1 in @Tijms03. At the beginning of each day a piece of equipment is
inspected to reveal its actual working condition. The equipment will be found in one of the working
conditions $i = 1,\ldots, N$ where the working condition $i$ is better than the working condition 
$i+1$. The equipment deteriorates in time. If the present working condition is $i$ and no repair is
done, then at the beginning of the next day the equipment has working condition $j$ with probability
$q_{ij}$. It is assumed that $q_{ij}=0$ for $j<i$ and $\sum_{j\geq i}q_{ij}=1$. The working
condition $i=N$ represents a malfunction that requires an enforced repair taking two days. For the
intermediate states $i$ with $1<i<N$ there is a choice between preventively repairing the equipment
and letting the equipment operate for the present day. A preventive repair takes only one day. A
repaired system has the working condition $i=1$. The cost of an enforced repair upon failure is
$C_{f}$ and the cost of a preemptive repair in working condition $i$ is $C_{p}(i)$. We wish to
determine a maintenance rule which minimizes the long-run average repair cost per day.

To formulate this problem as an infinite horizon semi-MDP the set of possible states of the system
is chosen as
$$
I=\{1,2,\ldots,N\}.
$$
State $i$ corresponds to the situation in which an inspection reveals working condition $i$. Define actions
$$
a=\left\{\begin{array}{ll}
nr & \text{if no repair.}\\
pr & \text{if preventive repair.}\\
fr & \text{if forced repair.}\\
\end{array}\right.
$$
The set of possible actions in state $i$ is chosen as $A(1)=\{nr\},\ A(i)=\{nr,pr\}$ for $1<i<N, 
A(N)=\{fr\}$. The one-step transition probabilities $p_{ij}(a)$ are given by $p_{ij}(0) = q_{ij}$ for
$1\leq i<N$, $p_{i1}(1) = 1$ for $1<i<N$, $p_{N1}(2)=1$  and zero otherwise. The one-step costs
$c_{i}(a)$ are given by $c_{i}(0)=0,\ c_{i}(1)=C_{p}(i)$ and $c_{N}(2)=C_{f}$. The stage length
until next decision epoch are $\tau(i,a) = 1, 0\leq i < N$ and $\tau(N,a) = 2$.


Assume that the number of possible working conditions equals $N=5$. The repair costs are given by $C_{f}=10,\ C_{p}(2)=7,\ C_{p}(3)=7$ and $C_{p}(4)=5$. The deterioration probabilities $q_{ij}$ are given by

```{r parameters,  include=FALSE}
N<-5; Cf<- -10; Cp<-c(0,-7,-7,-5) # use negative numbers since the MDP optimize based on rewards
Q <- matrix(c(
   0.90, 0.10, 0, 0, 0,
   0, 0.80, 0.10, 0.05, 0.05,
   0, 0, 0.70, 0.10, 0.20,
   0, 0, 0, 0.50, 0.50), nrow=4, byrow=T) 
```

```{r Qtable, results='asis', echo=FALSE}
rownames(Q)<-1:4
colnames(Q)<-1:5
knitr::kable(Q, row.names = T)
```

For building and saving the model see the `vignette("building")`. We load the model using 

```{r}
prefix <- paste0(system.file("models", package = "MDP2"), "/hct611-1_")
mdp <- loadMDP(prefix)
```

The variable `mdp` is a list with a pointer to the MDP object stored in memory. 

```{r}
mdp
```

For instance the total number of actions is `r mdp$actions` and the model use two weights applied 
to each action "Duration" and "Net reward". Information about the MDP can be retrieved using 
`getInfo()`:

```{r}
getInfo(mdp, withList = F, dfLevel = "action", asStringsActions = TRUE)  
```

Here the tibble has a row for each state and action. For instance the weight "Duration" equals 1 day 
except in state $i=5$ where a forced repair takes 2 days (row 13). States with no actions are also 
given.  

The state-expanded hypergraph representing the semi-MDP with infinite time-horizon can be plotted 
using

```{r plotHgf, par=TRUE}
plot(mdp, actionColor = "label", stateLabel = "sId|label")
```

Each node corresponds to a specific state in the MDP and is a *unique id* (`sId`) such that you can
identify all the states (**id always start from zero**). These ids are not equal to the ids used
when you built the model, since the order of the nodes in the hypergraph data structure is
optimized! A directed hyperarc is defined for each possible action. For
instance, the state/node with `sId = 6` corresponds to working condition $i=2$ and the two hyperarcs
with head in this node corresponds to the two actions preventive and no repair. Note the tails of a
hyperarc represent a possible transition ($p_{ij}(a)>0$).

Given the model in memory, we now can find the optimal policy under various policies. Let us first
try to optimize the average reward per time unit.

```{r solve1_ave, par=TRUE}
runPolicyIteAve(mdp,"Net reward","Duration")
getPolicy(mdp)
plot(mdp, actionsVisible = "policy")
```

Note it is optimal to do a preventive repair in state $i=4$. Let us try to optimize the expected 
total discounted reward with a discount factor of 0.5 using policy iteration:

```{r, par=TRUE}
runPolicyIteDiscount(mdp,"Net reward","Duration", discountFactor = 0.5)
getPolicy(mdp)
plot(mdp, actionsVisible = "policy")
```

Note given a discount factor of 0.5, it is optimal to not do a preventive repair in state $i=4$. 
The same results can be found using value iteration:

```{r}
runValueIte(mdp,"Net reward","Duration", discountFactor = 0.5, eps = 1e-10, maxIte = 1000)
getPolicy(mdp)
```



## References
