Decision Makingin Robots and Autonomous

AgentsThe Markov Decision Process (MDP) model

- Subramanian Ramamoorthy
- School of Informatics
- 25 January, 2013

In the MAB Model

- We were in a single casino and the only decision

is to pull from a set of n arms - except perhaps in the very last slides, exactly

one state! - We asked the following,
- What if there is more than one state?
- So, in this state space, what is the effect of

the distribution of payout changing based on how

you pull arms? - What happens if you only obtain a net reward

corresponding to a long sequence of arm pulls (at

the end)?

Decision Making Agent-Environment Interface

Markov Decision Processes

- A model of the agent-environment system
- Markov property history doesnt matter, only

current state - If state and action sets are finite, it is a

finite MDP. - To define a finite MDP, you need to give
- state and action sets
- one-step dynamics defined by transition

probabilities - reward probabilities

An Example Finite MDP

Recycling Robot

- At each step, robot has to decide whether it

should (1) actively search for a can, (2) wait

for someone to bring it a can, or (3) go to home

base and recharge. - Searching is better but runs down the battery if

runs out of power while searching, has to be

rescued (which is bad). - Decisions made on basis of current energy level

high, low. - Reward number of cans collected

Recycling Robot MDP

Enumerated In Tabular Form

If you were given this much, what can you say

about the behaviour (over time) of the system?

A Very Brief Primer on Markov Chains and

Decisions

- A model, as originally developed in Operations

Research/Stochastic Control theory

Stochastic Processes

- A stochastic process is an indexed collection of

random variables . - e.g., collection of weekly demands for a product
- One type At a particular time t, labelled by

integers, system is found in exactly one of a

finite number of mutually exclusive and

exhaustive categories or states, labelled by

integers too - Process could be imbedded in that time points

correspond to occurrence of specific events (or

time may be equi-spaced) - Random variables may depend on others, e.g.,

Markov Chains

- The stochastic process is said to have a

Markovian property if - Markovian probability means that the conditional

probability of a future event given any past

events and current state, is independent of past

states and depends only on present - The conditional probabilities are transition

probabilities, - These are stationary if time invariant, called

pij,

Markov Chains

- Looking forward in time, n-step transition

probabilities, pij(n) - One can write a transition matrix,
- A stochastic process is a finite-state Markov

chain if it has, - Finite number of states
- Markovian property
- Stationary transition probabilities
- A set of initial probabilities PX0 i for all

i

Markov Chains

- n-step transition probabilities can be obtained

from 1-step transition probabilities recursively

(Chapman-Kolmogorov) - We can get this via the matrix too
- First Passage Time number of transitions to go

from i to j for the first time - If i j, this is the recurrence time
- In general, this itself is a random variable

Markov Chains

- n-step recursive relationship for first passage

time - For fixed i and j, these fij(n) are nonnegative

numbers so that - If, , that state is a

recurrent state, absorbing if n1

Markov Chains Long-Run Properties

- Consider the 8-step transition matrix of the

inventory example - Interesting property probability of being in

state j after 8 weeks appears independent of

initial level of inventory. - For an irreducible ergodic Markov chain, one has

limiting probability

Reciprocal gives you recurrence time mjj

Markov Decision Model

- Consider the following application machine

maintenance - A factory has a machine that deteriorates rapidly

in quality and output and is inspected

periodically, e.g., daily - Inspection declares the machine to be in four

possible states - 0 Good as new
- 1 Operable, minor deterioration
- 2 Operable, major deterioration
- 3 Inoperable
- Let Xt denote this observed state
- evolves according to some law of motion, so it

is a stochastic process - Furthermore, assume it is a finite state Markov

chain

Markov Decision Model

- Transition matrix is based on the following
- Once the machine goes inoperable, it stays there

until repairs - If no repairs, eventually, it reaches this state

which is absorbing! - Repair is an action a very simple maintenance

policy. - e.g., machine from from state 3 to state 0

Markov Decision Model

- There are costs as system evolves
- State 0 cost 0
- State 1 cost 1000
- State 2 cost 3000
- Replacement cost, taking state 3 to 0, is 4000

(and lost production of 2000), so cost 6000 - The modified transition probabilities are

Markov Decision Model

- Simple question
- What is the average cost of this maintenance

policy? - Compute the steady state probabilities
- (Long run) expected average cost per day,

How?

Markov Decision Model

- Consider a slightly more elaborate policy
- Repair when inoperable or needing major repairs,

replace - Transition matrix now changes a little bit
- Permit one more thing overhaul
- Go back to minor repairs state (1) for the next

time step - Not possible if truly inoperable, but can go from

major to minor - Key point about the system behaviour. It evolves

according to - Laws of motion
- Sequence of decisions made (actions from 1

none,2overhaul,3 replace) - Stochastic process is now defined in terms of

Xt and Dt - Policy, R, is a rule for making decisions
- Could use all history, although popular choice is

(current) state-based

Markov Decision Model

- There is a space of potential policies, e.g.,
- Each policy defines a transition matrix, e.g.,

for Rb

Which policy is best? Need costs.

0

0

Markov Decision Model

- Cik expected cost incurred during next

transition if system is in state i and decision k

is made - The long run average expected cost for each

policy may be computed using

State Dec. 1 2 3

0 0 0 4 6

1 1 1 4 6

2 2 3 4 6

3 3 8 8 6

Rb is best

Markov Decision Processes

- Solution using Dynamic Programming
- (some notation changes upcoming)

The RL Problem

- Main Elements
- States, s
- Actions, a
- State transition dynamics - often, stochastic

unknown - Reward (r) process - possibly stochastic
- Objective Policy pt(s,a)
- probability distribution over actions given

current state

Assumption Environment defines a finite-state

MDP

Back to Our Recycling Robot MDP

- Given an enumeration of transitions and

corresponding costs/rewards, what is the best

sequence of actions? - We want to maximize the criterion
- So, what must one do?

The Shortest Path Problem

Finite-State Systems and Shortest Paths

- state space sk is a finite set for each k
- ak can get you from sk to fk(sk, ak) at a cost

gk(xk, uk)

Length Cost Sum of length of

arcs

Solve this first

Vk(i) minj akij Vk1(j)

Value Functions

- The value of a state is the expected return

starting from that state depends on the agents

policy - The value of taking an action in a state under

policy p is the expected return starting from

that state, taking that action, and thereafter

following p

Recursive Equation for Value

The basic idea

So

Optimality in MDPs Bellman Equation

Policy Evaluation

- How to compute V(s) for an arbitrary policy p?

(Prediction problem) - For a given MDP, this yields a system of

simultaneous equations - as many unknowns as states (BIG, S linear

system!) - Solve iteratively, with a sequence of value

functions,

Policy Improvement

- Does it make sense to deviate from p(s) at any

state (following the policy everywhere else)? Let

us for now assume deterministic p(s)

- Policy Improvement Theorem Howard/Blackwell

Computing Better Policies

- Starting with an arbitrary policy, wed like to

approach truly optimal policies. So, we compute

new policies using the following, - Are we restricted to deterministic policies? No.
- With stochastic policies,

Grid-World Example

Iterative Policy Evaluation in Grid World

Note The value function can be searched greedily

to find long-term optimal actions