Skip to content

Pythonic Markov Decision Process (MDP)

13 gennaio 2008

I’ve been studying a subject called probabilistic methods for decisions.
In this course there are a lot of interesting topics like Bayesian Networks (BN), inference and querying in BNs, probabilistic reasoning over time etc. and, in the book (Artificial Intelligence: A Modern Approach – Russell, Norvig) there are a lot of interesting pseudocode ;) to implement.

The simplest algorithm to implement is -IMHO- the Value-Iteration algorithm and my goal is to reproduce the graph that shows the evolution of the utilities shown in Figure 17.5.

MDP example
(image source)

What is a MDP

Wikipedia has an article about MDP.
Simplifying, we need three things:

  1. an initial state: S0
  2. a transition model: T(s, a, s’)
  3. a reward function: R(s)

In this example I assume the environment is fully observable and I consider the horizon to be infinite.

The scope is to find an optimal policy, namely a solution that specifies what we should do for any state other than the goal.

tools & libraries

I used

code for an MDP interface

In my toy problem I need (but is not mandatory) an interface to determine the MDP:

class MarkovDecisionProcess:
    def transition(self, from_state, action, to_state):
	raise NotImplementedError

    def initial_state(self):
	raise NotImplementedError

    def reward(self, state):
	raise NotImplementedError

    def discount(self, state):
	raise NotImplementedError

code for the modified dict class

I need also a slightly modified dict class, up ahead you will understand why:

class SumDict(dict):
    def __setitem__(self, key, value):
        if self.has_key(key):
            value += self.get(key)
        dict.__setitem__(self, key, value)

code for the Toy MDP

After this I can define the world for the agent, it is identical to the one in the book (Section 17.1):

import numpy as np

class ToyMDP(MarkovDecisionProcess):

    def __init__(self):
        #this is the same world of the book
        self.world = np.array([
            [-0.04, -0.04, -0.04,     1],
            [-0.04,  None, -0.04,    -1],
            [-0.04, -0.04, -0.04, -0.04],
        ])
        self.initial_state = (0, 0)
        self.finals = [(0,3), (1,3)]
        self.actions = ('l', 'r', 'u', 'd')

    def __iter__(self):
        class Iterator:
            def __init__(self, iterator, finals):
                self.iterator = iterator
                self.finals = finals
            def next(self):
                while True:
                    coords = self.iterator.coords
                    val = self.iterator.next()
                    if val and coords not in self.finals: break
                return coords, val
        return Iterator(self.world.flat, self.finals)


    def _move(self, state, action):
        """calculates the next state given an action"""
        shape = self.world.shape
        next = list(state)
        if state in self.finals:
            pass
        elif action == 'l' and \
            (state[1] > 0 and self.world[state[0]][state[1]-1] != None):
            next[1] -= 1
        #... other three elif for up, down and right
        return tuple(next)

    def successors(self, state, action):
        """this function returns a dict with all the successors of
        a state and the relative probability given an action,
        for example if you are in (1,0) and you want to go right
        the function returns {(1,0):0.8, (0,0):0.1, (2,0):0.1}"""

        #I'm using SumDict because if two or more "_move"
        #return the same state (and the state is the key of the dict)
        #we need to sum out the values not overwrite the old value
        d = SumDict()
        if action == 'l':
            d[self._move(state, 'l')] = 0.8
            d[self._move(state, 'u')] = 0.1
            d[self._move(state, 'd')] = 0.1
        #... other three elif for up, down and right
        return d

    # from_state, to_state: a tuple
    # action: l, r, u, d (left, right, up, down)
    def transition(self, from_state, action, to_state):
        return self.successors(from_state, action)[to_state]

    def initial_state(self):
        return self.initial_state

    def reward(self, state):
        return self.world[state[0]][state[1]]

    def discount(self):
        return 1

code for the Value-Iteration algorithm

Now we are ready to compute the utility of every state. The function is a little bit more complex if compared with another one that I have just found (really, just 5 minutes ago) :/

There is a little difference between my implementation and the one in the book.
The book uses U'[s]=R[s] + \gamma \max_{a}\sum_{s'}T(s, a, s')U[s']; the difference is in the summation: the book formula iterates over all the states, including the unreachable ones, while, using the successor, I iterate only over the reachable states.
I think this is correct: from a state s I can go only to successors(s) and I can iterate over them instead of over every state in the world.

def value_iteration(mdp, e, hook=lambda u, mdp: 1):
_u = np.zeros(mdp.world.shape)
for final in mdp.finals:
_u[final[0]][final[1]] = mdp.world.item(final)
while True:
u = _u.copy()
hook(u, mdp)
d = 0
for state, reward in mdp:
summ = max([sum([prob * u.item(next_state) for next_state,prob in
mdp.successors(state, action).items()]) for action in
mdp.actions])
_u[state[0]][state[1]] = reward + mdp.discount() * summ
diff = fabs(_u.item(state) – u.item(state))
if diff > d:
d = diff
if d <= e*(1-mdp.discount())/mdp.discount(): break return u [/sourcecode]

results

The output of the utility of the states matrix is equal to the one in the book:

vrde@pandora:~/other/blog/mdp$ python mdp.py 
[[ 0.81155822  0.86780822  0.91780822  1.        ]
 [ 0.76155822  0.          0.66027397 -1.        ]
 [ 0.70530822  0.65530822  0.61141553  0.38792491]]

and also the graph:
utility of the states evolution

That’s all folks, if you want to try it at home you can download here: mdp.tar.bz2 (or, if you want to use wget, here).

From → computer, English

6 commenti
  1. My spouse and I stumbled over here different page and thought I might check things out.
    I like what I see so now i am following you. Look forward to finding out about your
    web page again.

  2. After toys 5 yr old girl this process, the process of coloring the wooden
    toys are not very popular around the world treasure hand made wooden toys.

  3. Fantastic beat ! I would like to apprentice even as you amend your site, how can i subscribe for a bloig website?

    The account helped mee a appropriate deal.
    I have been a little bit familiar of this your broadcast provided vibrant clear concept

  4. Нi there, just became аlert to your blog through Google, and found that it is really informаtive.

    I’m gоnna watch out for brusѕels. I will appгeciate
    if you continue this in future. A lot of people will be benefited
    from your writing. Cheers!

  5. Hi, the code of function value_iteration is not in a good code format, could you set it more clearly? thank you!

Trackbacks & Pingbacks

  1. lars bendixen (lars.b) | Pearltrees

Lascia un commento

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...

%d blogger cliccano Mi Piace per questo: