# Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model - PowerPoint PPT Presentation

Title:

## Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model

Description:

### Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 36
Provided by: acuk
Category:
Tags:
Transcript and Presenter's Notes

Title: Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model

1
Decision Makingin Robots and Autonomous
AgentsThe Markov Decision Process (MDP) model
• Subramanian Ramamoorthy
• School of Informatics
• 25 January, 2013

2
In the MAB Model
• We were in a single casino and the only decision
is to pull from a set of n arms
• except perhaps in the very last slides, exactly
one state!
• What if there is more than one state?
• So, in this state space, what is the effect of
the distribution of payout changing based on how
you pull arms?
• What happens if you only obtain a net reward
corresponding to a long sequence of arm pulls (at
the end)?

3
Decision Making Agent-Environment Interface
4
Markov Decision Processes
• A model of the agent-environment system
• Markov property history doesnt matter, only
current state
• If state and action sets are finite, it is a
finite MDP.
• To define a finite MDP, you need to give
• state and action sets
• one-step dynamics defined by transition
probabilities
• reward probabilities

5
An Example Finite MDP
Recycling Robot
• At each step, robot has to decide whether it
should (1) actively search for a can, (2) wait
for someone to bring it a can, or (3) go to home
base and recharge.
• Searching is better but runs down the battery if
runs out of power while searching, has to be
• Decisions made on basis of current energy level
high, low.
• Reward number of cans collected

6
Recycling Robot MDP
7
Enumerated In Tabular Form
If you were given this much, what can you say
about the behaviour (over time) of the system?
8
A Very Brief Primer on Markov Chains and
Decisions
• A model, as originally developed in Operations
Research/Stochastic Control theory

9
Stochastic Processes
• A stochastic process is an indexed collection of
random variables .
• e.g., collection of weekly demands for a product
• One type At a particular time t, labelled by
integers, system is found in exactly one of a
finite number of mutually exclusive and
exhaustive categories or states, labelled by
integers too
• Process could be imbedded in that time points
correspond to occurrence of specific events (or
time may be equi-spaced)
• Random variables may depend on others, e.g.,

10
Markov Chains
• The stochastic process is said to have a
Markovian property if
• Markovian probability means that the conditional
probability of a future event given any past
events and current state, is independent of past
states and depends only on present
• The conditional probabilities are transition
probabilities,
• These are stationary if time invariant, called
pij,

11
Markov Chains
• Looking forward in time, n-step transition
probabilities, pij(n)
• One can write a transition matrix,
• A stochastic process is a finite-state Markov
chain if it has,
• Finite number of states
• Markovian property
• Stationary transition probabilities
• A set of initial probabilities PX0 i for all
i

12
Markov Chains
• n-step transition probabilities can be obtained
from 1-step transition probabilities recursively
(Chapman-Kolmogorov)
• We can get this via the matrix too
• First Passage Time number of transitions to go
from i to j for the first time
• If i j, this is the recurrence time
• In general, this itself is a random variable

13
Markov Chains
• n-step recursive relationship for first passage
time
• For fixed i and j, these fij(n) are nonnegative
numbers so that
• If, , that state is a
recurrent state, absorbing if n1

14
Markov Chains Long-Run Properties
• Consider the 8-step transition matrix of the
inventory example
• Interesting property probability of being in
state j after 8 weeks appears independent of
initial level of inventory.
• For an irreducible ergodic Markov chain, one has
limiting probability

Reciprocal gives you recurrence time mjj
15
Markov Decision Model
• Consider the following application machine
maintenance
• A factory has a machine that deteriorates rapidly
in quality and output and is inspected
periodically, e.g., daily
• Inspection declares the machine to be in four
possible states
• 0 Good as new
• 1 Operable, minor deterioration
• 2 Operable, major deterioration
• 3 Inoperable
• Let Xt denote this observed state
• evolves according to some law of motion, so it
is a stochastic process
• Furthermore, assume it is a finite state Markov
chain

16
Markov Decision Model
• Transition matrix is based on the following
• Once the machine goes inoperable, it stays there
until repairs
• If no repairs, eventually, it reaches this state
which is absorbing!
• Repair is an action a very simple maintenance
policy.
• e.g., machine from from state 3 to state 0

17
Markov Decision Model
• There are costs as system evolves
• State 0 cost 0
• State 1 cost 1000
• State 2 cost 3000
• Replacement cost, taking state 3 to 0, is 4000
(and lost production of 2000), so cost 6000
• The modified transition probabilities are

18
Markov Decision Model
• Simple question
• What is the average cost of this maintenance
policy?
• Compute the steady state probabilities
• (Long run) expected average cost per day,

How?
19
Markov Decision Model
• Consider a slightly more elaborate policy
• Repair when inoperable or needing major repairs,
replace
• Transition matrix now changes a little bit
• Permit one more thing overhaul
• Go back to minor repairs state (1) for the next
time step
• Not possible if truly inoperable, but can go from
major to minor
• Key point about the system behaviour. It evolves
according to
• Laws of motion
• Sequence of decisions made (actions from 1
none,2overhaul,3 replace)
• Stochastic process is now defined in terms of
Xt and Dt
• Policy, R, is a rule for making decisions
• Could use all history, although popular choice is
(current) state-based

20
Markov Decision Model
• There is a space of potential policies, e.g.,
• Each policy defines a transition matrix, e.g.,
for Rb

Which policy is best? Need costs.
0
0
21
Markov Decision Model
• Cik expected cost incurred during next
transition if system is in state i and decision k
• The long run average expected cost for each
policy may be computed using

State Dec. 1 2 3
0 0 0 4 6
1 1 1 4 6
2 2 3 4 6
3 3 8 8 6
Rb is best
22
Markov Decision Processes
• Solution using Dynamic Programming
• (some notation changes upcoming)

23
The RL Problem
• Main Elements
• States, s
• Actions, a
• State transition dynamics - often, stochastic
unknown
• Reward (r) process - possibly stochastic
• Objective Policy pt(s,a)
• probability distribution over actions given
current state

Assumption Environment defines a finite-state
MDP
24
Back to Our Recycling Robot MDP
25
• Given an enumeration of transitions and
corresponding costs/rewards, what is the best
sequence of actions?
• We want to maximize the criterion
• So, what must one do?

26
The Shortest Path Problem
27
Finite-State Systems and Shortest Paths
• state space sk is a finite set for each k
• ak can get you from sk to fk(sk, ak) at a cost
gk(xk, uk)

Length Cost Sum of length of
arcs
Solve this first
Vk(i) minj akij Vk1(j)
28
Value Functions
• The value of a state is the expected return
starting from that state depends on the agents
policy
• The value of taking an action in a state under
policy p is the expected return starting from
that state, taking that action, and thereafter
following p

29
Recursive Equation for Value
The basic idea
So
30
Optimality in MDPs Bellman Equation
31
Policy Evaluation
• How to compute V(s) for an arbitrary policy p?
(Prediction problem)
• For a given MDP, this yields a system of
simultaneous equations
• as many unknowns as states (BIG, S linear
system!)
• Solve iteratively, with a sequence of value
functions,

32
Policy Improvement
• Does it make sense to deviate from p(s) at any
state (following the policy everywhere else)? Let
us for now assume deterministic p(s)
• Policy Improvement Theorem Howard/Blackwell

33
Computing Better Policies
• Starting with an arbitrary policy, wed like to
approach truly optimal policies. So, we compute
new policies using the following,
• Are we restricted to deterministic policies? No.
• With stochastic policies,

34
Grid-World Example
35
Iterative Policy Evaluation in Grid World
Note The value function can be searched greedily
to find long-term optimal actions