--- title: "Solving an infinite-horizon semi-MDP" author: "Lars Relund " date: "`r Sys.Date()`" bibliography: litt.bib output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{infinite-mdp} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r setup, include=FALSE} library(knitr) options(rmarkdown.html_vignette.check_title = FALSE, # formatR.arrow = TRUE, # scipen=999, # digits=5, width=90) #thm <- knit_theme$get("edit-kwrite") # whitengrey, bright, print, edit-flashdevelop, edit-kwrite #knit_theme$set(thm) knit_hooks$set( par = function(before, options, envir) { if (before && options$fig.show != 'none') par(mar = c(0, 0, 0, 0), # bottom, left, top, and right oma = c(0, 0, 0, 0))} ) knitr::opts_chunk$set( # collapse = TRUE, comment = "#>", fig.align = 'center', fig.width = 9, fig.height = 5, fig.show = 'hold', out.extra = 'style="max-width:100%;"', # tidy = TRUE, # prompt=T, # comment=NA, cache = F # background = "red" ) library(magrittr) library(dplyr) ``` The `MDP2` package in R is a package for solving Markov decision processes (MDPs) with discrete time-steps, states and actions. Both traditional MDPs [@Puterman94], semi-Markov decision processes (semi-MDPs) [@Tijms03] and hierarchical-MDPs (HMDPs) [@Kristensen00] can be solved under a finite and infinite time-horizon. The package implement well-known algorithms such as policy iteration and value iteration under different criteria e.g. average reward per time unit and expected total discounted reward. The model is stored using an underlying data structure based on the *state-expanded directed hypergraph* of the MDP (@Relund06) implemented in `C++` for fast running times. Building and solving an MDP is done in two steps. First, the MDP is built and saved in a set of binary files. Next, you load the MDP into memory from the binary files and apply various algorithms to the model. For building the MDP models see `vignette("building")`. In this vignette we focus on the second step, i.e. finding the optimal policy. Here we consider an infinite semi-MDP. ```{r} library(MDP2) ``` ## An infinite-horizon semi-MDP An *infinite-horizon semi-MDP* considers a sequential decision problem over an infinite number of *stages*. Let $I$ denote the finite set of system states at stage $n$. Note we assume that the semi-MDP is *homogeneous*, i.e the state space is independent of stage number. When *state* $i \in I$ is observed, an *action* $a$ from the finite set of allowable actions $A(i)$ must be chosen which generates *reward* $r(i,a)$. Moreover, let $\tau(i,a)$ denote the *stage length* of action $a$, i.e. the expected time until the next decision epoch (stage $n+1$) given action $a$ and state $i$. Finally, let $p_{ij}(a)$ denote the *transition probability* of obtaining state $j\in I$ at stage $n+1$ given that action $a$ is chosen in state $i$ at stage $n$. A policy is a decision rule/function that assigns to each state in the process an action. ## Example Let us consider example 6.1.1 in @Tijms03. At the beginning of each day a piece of equipment is inspected to reveal its actual working condition. The equipment will be found in one of the working conditions $i = 1,\ldots, N$ where the working condition $i$ is better than the working condition $i+1$. The equipment deteriorates in time. If the present working condition is $i$ and no repair is done, then at the beginning of the next day the equipment has working condition $j$ with probability $q_{ij}$. It is assumed that $q_{ij}=0$ for $j0$). Given the model in memory, we now can find the optimal policy under various policies. Let us first try to optimize the average reward per time unit. ```{r solve1_ave, par=TRUE} runPolicyIteAve(mdp,"Net reward","Duration") getPolicy(mdp) plot(mdp, actionsVisible = "policy") ``` Note it is optimal to do a preventive repair in state $i=4$. Let us try to optimize the expected total discounted reward with a discount factor of 0.5 using policy iteration: ```{r, par=TRUE} runPolicyIteDiscount(mdp,"Net reward","Duration", discountFactor = 0.5) getPolicy(mdp) plot(mdp, actionsVisible = "policy") ``` Note given a discount factor of 0.5, it is optimal to not do a preventive repair in state $i=4$. The same results can be found using value iteration: ```{r} runValueIte(mdp,"Net reward","Duration", discountFactor = 0.5, eps = 1e-10, maxIte = 1000) getPolicy(mdp) ``` ## References