Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model - PowerPoint PPT Presentation

About This Presentation



Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 36
Provided by: acuk


Transcript and Presenter's Notes

Title: Decision%20Making%20in%20Robots%20and%20Autonomous%20Agents%20The%20Markov%20Decision%20Process%20(MDP)%20model

Decision Makingin Robots and Autonomous
AgentsThe Markov Decision Process (MDP) model
  • Subramanian Ramamoorthy
  • School of Informatics
  • 25 January, 2013

In the MAB Model
  • We were in a single casino and the only decision
    is to pull from a set of n arms
  • except perhaps in the very last slides, exactly
    one state!
  • We asked the following,
  • What if there is more than one state?
  • So, in this state space, what is the effect of
    the distribution of payout changing based on how
    you pull arms?
  • What happens if you only obtain a net reward
    corresponding to a long sequence of arm pulls (at
    the end)?

Decision Making Agent-Environment Interface
Markov Decision Processes
  • A model of the agent-environment system
  • Markov property history doesnt matter, only
    current state
  • If state and action sets are finite, it is a
    finite MDP.
  • To define a finite MDP, you need to give
  • state and action sets
  • one-step dynamics defined by transition
  • reward probabilities

An Example Finite MDP
Recycling Robot
  • At each step, robot has to decide whether it
    should (1) actively search for a can, (2) wait
    for someone to bring it a can, or (3) go to home
    base and recharge.
  • Searching is better but runs down the battery if
    runs out of power while searching, has to be
    rescued (which is bad).
  • Decisions made on basis of current energy level
    high, low.
  • Reward number of cans collected

Recycling Robot MDP
Enumerated In Tabular Form
If you were given this much, what can you say
about the behaviour (over time) of the system?
A Very Brief Primer on Markov Chains and
  • A model, as originally developed in Operations
    Research/Stochastic Control theory

Stochastic Processes
  • A stochastic process is an indexed collection of
    random variables .
  • e.g., collection of weekly demands for a product
  • One type At a particular time t, labelled by
    integers, system is found in exactly one of a
    finite number of mutually exclusive and
    exhaustive categories or states, labelled by
    integers too
  • Process could be imbedded in that time points
    correspond to occurrence of specific events (or
    time may be equi-spaced)
  • Random variables may depend on others, e.g.,

Markov Chains
  • The stochastic process is said to have a
    Markovian property if
  • Markovian probability means that the conditional
    probability of a future event given any past
    events and current state, is independent of past
    states and depends only on present
  • The conditional probabilities are transition
  • These are stationary if time invariant, called

Markov Chains
  • Looking forward in time, n-step transition
    probabilities, pij(n)
  • One can write a transition matrix,
  • A stochastic process is a finite-state Markov
    chain if it has,
  • Finite number of states
  • Markovian property
  • Stationary transition probabilities
  • A set of initial probabilities PX0 i for all

Markov Chains
  • n-step transition probabilities can be obtained
    from 1-step transition probabilities recursively
  • We can get this via the matrix too
  • First Passage Time number of transitions to go
    from i to j for the first time
  • If i j, this is the recurrence time
  • In general, this itself is a random variable

Markov Chains
  • n-step recursive relationship for first passage
  • For fixed i and j, these fij(n) are nonnegative
    numbers so that
  • If, , that state is a
    recurrent state, absorbing if n1

Markov Chains Long-Run Properties
  • Consider the 8-step transition matrix of the
    inventory example
  • Interesting property probability of being in
    state j after 8 weeks appears independent of
    initial level of inventory.
  • For an irreducible ergodic Markov chain, one has
    limiting probability

Reciprocal gives you recurrence time mjj
Markov Decision Model
  • Consider the following application machine
  • A factory has a machine that deteriorates rapidly
    in quality and output and is inspected
    periodically, e.g., daily
  • Inspection declares the machine to be in four
    possible states
  • 0 Good as new
  • 1 Operable, minor deterioration
  • 2 Operable, major deterioration
  • 3 Inoperable
  • Let Xt denote this observed state
  • evolves according to some law of motion, so it
    is a stochastic process
  • Furthermore, assume it is a finite state Markov

Markov Decision Model
  • Transition matrix is based on the following
  • Once the machine goes inoperable, it stays there
    until repairs
  • If no repairs, eventually, it reaches this state
    which is absorbing!
  • Repair is an action a very simple maintenance
  • e.g., machine from from state 3 to state 0

Markov Decision Model
  • There are costs as system evolves
  • State 0 cost 0
  • State 1 cost 1000
  • State 2 cost 3000
  • Replacement cost, taking state 3 to 0, is 4000
    (and lost production of 2000), so cost 6000
  • The modified transition probabilities are

Markov Decision Model
  • Simple question
  • What is the average cost of this maintenance
  • Compute the steady state probabilities
  • (Long run) expected average cost per day,

Markov Decision Model
  • Consider a slightly more elaborate policy
  • Repair when inoperable or needing major repairs,
  • Transition matrix now changes a little bit
  • Permit one more thing overhaul
  • Go back to minor repairs state (1) for the next
    time step
  • Not possible if truly inoperable, but can go from
    major to minor
  • Key point about the system behaviour. It evolves
    according to
  • Laws of motion
  • Sequence of decisions made (actions from 1
    none,2overhaul,3 replace)
  • Stochastic process is now defined in terms of
    Xt and Dt
  • Policy, R, is a rule for making decisions
  • Could use all history, although popular choice is
    (current) state-based

Markov Decision Model
  • There is a space of potential policies, e.g.,
  • Each policy defines a transition matrix, e.g.,
    for Rb

Which policy is best? Need costs.
Markov Decision Model
  • Cik expected cost incurred during next
    transition if system is in state i and decision k
    is made
  • The long run average expected cost for each
    policy may be computed using

State Dec. 1 2 3
0 0 0 4 6
1 1 1 4 6
2 2 3 4 6
3 3 8 8 6
Rb is best
Markov Decision Processes
  • Solution using Dynamic Programming
  • (some notation changes upcoming)

The RL Problem
  • Main Elements
  • States, s
  • Actions, a
  • State transition dynamics - often, stochastic
  • Reward (r) process - possibly stochastic
  • Objective Policy pt(s,a)
  • probability distribution over actions given
    current state

Assumption Environment defines a finite-state
Back to Our Recycling Robot MDP
  • Given an enumeration of transitions and
    corresponding costs/rewards, what is the best
    sequence of actions?
  • We want to maximize the criterion
  • So, what must one do?

The Shortest Path Problem
Finite-State Systems and Shortest Paths
  • state space sk is a finite set for each k
  • ak can get you from sk to fk(sk, ak) at a cost
    gk(xk, uk)

Length Cost Sum of length of
Solve this first
Vk(i) minj akij Vk1(j)
Value Functions
  • The value of a state is the expected return
    starting from that state depends on the agents
  • The value of taking an action in a state under
    policy p is the expected return starting from
    that state, taking that action, and thereafter
    following p

Recursive Equation for Value
The basic idea
Optimality in MDPs Bellman Equation
Policy Evaluation
  • How to compute V(s) for an arbitrary policy p?
    (Prediction problem)
  • For a given MDP, this yields a system of
    simultaneous equations
  • as many unknowns as states (BIG, S linear
  • Solve iteratively, with a sequence of value

Policy Improvement
  • Does it make sense to deviate from p(s) at any
    state (following the policy everywhere else)? Let
    us for now assume deterministic p(s)
  • Policy Improvement Theorem Howard/Blackwell

Computing Better Policies
  • Starting with an arbitrary policy, wed like to
    approach truly optimal policies. So, we compute
    new policies using the following,
  • Are we restricted to deterministic policies? No.
  • With stochastic policies,

Grid-World Example
Iterative Policy Evaluation in Grid World
Note The value function can be searched greedily
to find long-term optimal actions
Write a Comment
User Comments (0)