Title: Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model
1Decision Makingin Robots and Autonomous
AgentsThe Markov Decision Process (MDP) model
- Subramanian Ramamoorthy
- School of Informatics
- 25 January, 2013
2In the MAB Model
- We were in a single casino and the only decision
is to pull from a set of n arms - except perhaps in the very last slides, exactly
one state! - We asked the following,
- What if there is more than one state?
- So, in this state space, what is the effect of
the distribution of payout changing based on how
you pull arms? - What happens if you only obtain a net reward
corresponding to a long sequence of arm pulls (at
the end)?
3Decision Making Agent-Environment Interface
4Markov Decision Processes
- A model of the agent-environment system
- Markov property history doesnt matter, only
current state - If state and action sets are finite, it is a
finite MDP. - To define a finite MDP, you need to give
- state and action sets
- one-step dynamics defined by transition
probabilities - reward probabilities
5An Example Finite MDP
Recycling Robot
- At each step, robot has to decide whether it
should (1) actively search for a can, (2) wait
for someone to bring it a can, or (3) go to home
base and recharge. - Searching is better but runs down the battery if
runs out of power while searching, has to be
rescued (which is bad). - Decisions made on basis of current energy level
high, low. - Reward number of cans collected
6Recycling Robot MDP
7Enumerated In Tabular Form
If you were given this much, what can you say
about the behaviour (over time) of the system?
8A Very Brief Primer on Markov Chains and
Decisions
- A model, as originally developed in Operations
Research/Stochastic Control theory
9Stochastic Processes
- A stochastic process is an indexed collection of
random variables . - e.g., collection of weekly demands for a product
- One type At a particular time t, labelled by
integers, system is found in exactly one of a
finite number of mutually exclusive and
exhaustive categories or states, labelled by
integers too - Process could be imbedded in that time points
correspond to occurrence of specific events (or
time may be equi-spaced) - Random variables may depend on others, e.g.,
10Markov Chains
- The stochastic process is said to have a
Markovian property if - Markovian probability means that the conditional
probability of a future event given any past
events and current state, is independent of past
states and depends only on present - The conditional probabilities are transition
probabilities, - These are stationary if time invariant, called
pij,
11Markov Chains
- Looking forward in time, n-step transition
probabilities, pij(n) - One can write a transition matrix,
- A stochastic process is a finite-state Markov
chain if it has, - Finite number of states
- Markovian property
- Stationary transition probabilities
- A set of initial probabilities PX0 i for all
i
12Markov Chains
- n-step transition probabilities can be obtained
from 1-step transition probabilities recursively
(Chapman-Kolmogorov) - We can get this via the matrix too
- First Passage Time number of transitions to go
from i to j for the first time - If i j, this is the recurrence time
- In general, this itself is a random variable
13Markov Chains
- n-step recursive relationship for first passage
time - For fixed i and j, these fij(n) are nonnegative
numbers so that - If, , that state is a
recurrent state, absorbing if n1
14Markov Chains Long-Run Properties
- Consider the 8-step transition matrix of the
inventory example - Interesting property probability of being in
state j after 8 weeks appears independent of
initial level of inventory. - For an irreducible ergodic Markov chain, one has
limiting probability
Reciprocal gives you recurrence time mjj
15Markov Decision Model
- Consider the following application machine
maintenance - A factory has a machine that deteriorates rapidly
in quality and output and is inspected
periodically, e.g., daily - Inspection declares the machine to be in four
possible states - 0 Good as new
- 1 Operable, minor deterioration
- 2 Operable, major deterioration
- 3 Inoperable
- Let Xt denote this observed state
- evolves according to some law of motion, so it
is a stochastic process - Furthermore, assume it is a finite state Markov
chain
16Markov Decision Model
- Transition matrix is based on the following
- Once the machine goes inoperable, it stays there
until repairs - If no repairs, eventually, it reaches this state
which is absorbing! - Repair is an action a very simple maintenance
policy. - e.g., machine from from state 3 to state 0
17Markov Decision Model
- There are costs as system evolves
- State 0 cost 0
- State 1 cost 1000
- State 2 cost 3000
- Replacement cost, taking state 3 to 0, is 4000
(and lost production of 2000), so cost 6000 - The modified transition probabilities are
18Markov Decision Model
- Simple question
- What is the average cost of this maintenance
policy? - Compute the steady state probabilities
- (Long run) expected average cost per day,
How?
19Markov Decision Model
- Consider a slightly more elaborate policy
- Repair when inoperable or needing major repairs,
replace - Transition matrix now changes a little bit
- Permit one more thing overhaul
- Go back to minor repairs state (1) for the next
time step - Not possible if truly inoperable, but can go from
major to minor - Key point about the system behaviour. It evolves
according to - Laws of motion
- Sequence of decisions made (actions from 1
none,2overhaul,3 replace) - Stochastic process is now defined in terms of
Xt and Dt - Policy, R, is a rule for making decisions
- Could use all history, although popular choice is
(current) state-based
20Markov Decision Model
- There is a space of potential policies, e.g.,
- Each policy defines a transition matrix, e.g.,
for Rb
Which policy is best? Need costs.
0
0
21Markov Decision Model
- Cik expected cost incurred during next
transition if system is in state i and decision k
is made - The long run average expected cost for each
policy may be computed using
State Dec. 1 2 3
0 0 0 4 6
1 1 1 4 6
2 2 3 4 6
3 3 8 8 6
Rb is best
22Markov Decision Processes
- Solution using Dynamic Programming
- (some notation changes upcoming)
23The RL Problem
- Main Elements
- States, s
- Actions, a
- State transition dynamics - often, stochastic
unknown - Reward (r) process - possibly stochastic
- Objective Policy pt(s,a)
- probability distribution over actions given
current state
Assumption Environment defines a finite-state
MDP
24Back to Our Recycling Robot MDP
25- Given an enumeration of transitions and
corresponding costs/rewards, what is the best
sequence of actions? - We want to maximize the criterion
- So, what must one do?
26The Shortest Path Problem
27Finite-State Systems and Shortest Paths
- state space sk is a finite set for each k
- ak can get you from sk to fk(sk, ak) at a cost
gk(xk, uk)
Length Cost Sum of length of
arcs
Solve this first
Vk(i) minj akij Vk1(j)
28Value Functions
- The value of a state is the expected return
starting from that state depends on the agents
policy - The value of taking an action in a state under
policy p is the expected return starting from
that state, taking that action, and thereafter
following p
29Recursive Equation for Value
The basic idea
So
30Optimality in MDPs Bellman Equation
31Policy Evaluation
- How to compute V(s) for an arbitrary policy p?
(Prediction problem) - For a given MDP, this yields a system of
simultaneous equations - as many unknowns as states (BIG, S linear
system!) - Solve iteratively, with a sequence of value
functions,
32Policy Improvement
- Does it make sense to deviate from p(s) at any
state (following the policy everywhere else)? Let
us for now assume deterministic p(s)
- Policy Improvement Theorem Howard/Blackwell
33Computing Better Policies
- Starting with an arbitrary policy, wed like to
approach truly optimal policies. So, we compute
new policies using the following, - Are we restricted to deterministic policies? No.
- With stochastic policies,
34Grid-World Example
35Iterative Policy Evaluation in Grid World
Note The value function can be searched greedily
to find long-term optimal actions