lossolo 6 years ago

> Xavier works on enabling everyone to construct products that make a life changing impact on world scale.

I would remove/change that sentence if I were you. I suppose it was made to make what you do look impactful and serious but I have exact opposite impression after reading it.

  • thebillkidy 6 years ago

    Hi Lossolo, thanks a lot for this feedback, I took it into account and refactored this sentence a bit :) (deploy will be done in +-2min)

platz 6 years ago

there is a lot of info out there about markov chains, but very little about markov decision processes (MDP).

How popular are MDP? What are their strengths? weaknesses?

Is there really a connection between MDP and reinforcement learning as the author claims, as I've not heard this connection before?

This is a very welcome article.

  • clickok 6 years ago

    Consider reading "Reinforcement Learning, an Introduction" by Rich Sutton[0]. It is very accessible (and probably the most used textbook in the field), and goes into detail about the connection between RL and MDPs.

    If you want a high-level answer right now about how the two are related: reinforcement learning focuses on predicting or maximizing the discounted sum of rewards, G_t = R_{t+1} + γ R_{t+2} + γ^2 R_{t+3} + ... which is called the return Crucially, the process generating the rewards (the environment) is assumed to be memoryless[1]. So we can reformulate the return as a recursive equation: G_{t} = R_{t+1} + γ G_{t+1}, since the return from "t+1" doesn't depend on the reward you got from the previous time step ("R_{t+1}").

    This in turn allows you to define Bellman equations (which define optimal solutions to the problem of maximizing return, which is what you want to do). More generally assuming the problem is Markovian makes things substantially easier to reason about, because you don't have to worry about complex histories leading up to the agent's arrival in a state.

    While it's honestly a pretty strong assumption, it's one that we make across many fields and particularly in RL. It's actually not too inaccurate; for example most games are Markovian (or close), and in physical problems (e.g. robot locomotion) if you've got position+velocity then controlling the robot can be viewed as a (continuous-time) MDP.

    The weaknesses are obvious: not every environment is Markovian, sometimes there is long-term dependence on the past. For example, translating prose or poetry in such a way that preserves the point across is very tough[2], not really something you can formulate as an MDP. Other times, the weaknesses are easier to overcome (think of a poker game, where you can combine the present state information with the betting patterns from the past; this augmented state should be able to tell you everything you need to know).

    ---

    0. http://incompleteideas.net/book/bookdraft2017nov5.pdf

    1. Another way of phrasing this is that each state provides all the useful information, and knowing about previous states does not tell you anything new about the environment. Think of perfect information games, like Chess. It doesn't really matter how you got into a particular position, just the moves you make from there.

    2. Compare the poetry translations of Jerry Lettvin with what you'd get using existing translation software (https://sites.google.com/site/lettvingroup/Home/Projects-His...).

    • chris_st 6 years ago

      Wow, what a great answer -- thanks!

      I have a question -- in a card game, it appears that knowing which cards have been played is important (or can help, anyway). This seems to violate the idea of being memoryless.

      Can you "cheat" and make reason on the cards left instead? Thanks for any clues.

      • 60654 6 years ago

        For something that has state that is not always fully known (for example, some cards have been played so we know what they were, but we don't know what's left in the deck or in other players' hands), what we're describing is a _partially observable_ Markov decision process, or POMDP.

        In short, a regular MDP models moving from one state to future states with some probability. In a POMDP, we don't know which state we're in, but we have some guesses, i.e. we have a probability distribution over all states. So we model a sequence of taking actions and performing observations, and using those observations to update the probability distribution, so that we can reduce the uncertainty as much as possible.

        The typical example is robot navigation, where POMDPs are used extensively. For example, a blind robot starts out somewhere, it doesn't know where, so the probability distribution over all possible locations is uniform. Then it moves north, and discovers that it was successful; this observation modifies the PD to states that are reachable by moving north from previously likely states (conceptually: this will exclude all sorts of corners and dead ends from consideration). Then let's say it tries to drive north once again, but it discovers it bumped into a wall and was unsuccessful. Now the PD gets updated again, based on previous PD and the results of this action (conceptually: we bump the probability of any state that we can drive north to, and hit a dead end or a t-intersection), and so on. As this sequence of actions and observations continues, we hope the probability distribution will collapse to the specific location that we're in, or a small set of guesses.

        (This is a narrative description of course; for the nitty gritty details there are many more precise materials on POMDPs! :) )

        By the way, this is still a _Markov_ process: the trajectory of _how_ we got to some probability distribution doesn't matter, if those sequences of actions produced the same result they are treated as equivalent per Markov assumption.

      • nazgul17 6 years ago

        The state of the game does not have to be what you would physically see as a player. Generally, you do not model the state space as a graph, but model the state as a data structure. To make the Markov property apply, you just need to add the right kind of information to your data structure. Factually, you can chat by incorporating history in it. But the algorithms are happy regardless of the cheating: when landing a second time on the same data structure (aka state) the algorithms know that the reward distribution and the next state distribution are the same in both occurrences, so they have some degree of knowledge the second tone based on what happened the first time (and so on, with further experience improving knowledge and increasing the confidence on it).

      • platz 6 years ago

        Card games e.g. Poker isn't a perfect information game, so history does matter, so it's not markovian.

        • chris_st 6 years ago

          Sorry, should have given a more concrete example, more what I'm interested in.

          I was thinking Hearts or Spades, which aren't perfect information, but since the whole deck is dealt out, you can (with a lot of memory :-) know what cards are left.

          • clickok 6 years ago

            Yeah, so here we can "solve" the problem without using a POMDP by really expanding what constitutes the state space.

            Initially, your state is your hand plus the initial rules (who goes first, which suit is trump, etc depending on what kind of game you're playing). After each player's turn you transition into a new state, which is <<starting hand>> + <<history of cards played>>, so at each point in time the state has all the available information. Obviously this leads to a combinatorially huge state space, because there's O(52!) different possible sequences for a standard deck of cards.

            Typically you have to use some sort of approximation architecture because evaluating all different possible responses to a given sequence of cards would both take too long and even if you could do it, the memory requirements would be enormous. See for example DeepStack[0], which uses a neural net to approximate the quality of positions + possible responses (along with other techniques), rather than having a lookup table for what to do in each possible game state.

            So you can formulate such games as MDPs, but the question then becomes: is this a useful way to think about it? Sometimes it is, because your approach to the problem scales and so you can throw enough resources (training data, compute power) at it to learn a useful strategy; other times there is a better way of formulating the problem[1].

            I have some implementations of reinforcement learning algorithms and code for working with MDPs if you're interested in that sort of thing[2], and of course I highly recommend Rich's book.

            ---

            0. https://www.deepstack.ai/ -- note that it's not really an RL system, but it does illustrate using neural nets to learn complicated functions in a large state space.

            1. For example, I am not sure that such an MDP for hearts is well-posed: the convergence results that I'm familiar with rely on the MDP being ergodic and aperiodic, but the suggested construction is obviously periodic.

            2. The algorithms: https://github.com/rldotai/rl-algorithms An MDP library: https://github.com/rldotai/mdpy (check out notebooks/example_notebook.ipynb for a basic demo).

    • thebillkidy 6 years ago

      Thanks for this! I think you actually summarized the Bellman equations very well! This gave me some ideas for future blog posts on how to make it more clearer for everyone to understand it better :) most likely the next one is about Bellman equations ;)

      • clickok 6 years ago

        I'm glad to hear that. The main point I try to get across to people regarding Bellman equations is that they are very special-- these sorts of recursive equations allow us to express the value of an observation without knowing the past, and to improve our estimates of a state's value without having to wait for the future to unfold.

        In most other situations you're forced to "wait and see" when you want to learn how a given strategy will turn out. This is not the case if you're dealing with an MDP. If the current state is `s`, the next state is `s'`, and the reward you got for transitioning between the two is `r`, then for a given value function V(.) you can express the temporal-difference error (which is sort of a gradient for the value function) as: δ = r + γ v(s') - v(s) ≈ ∂v(s)

        Other formulations of rewards/objectives don't tend to permit such elegant constructions, which is why MDPs are so special (and reinforcement learning so successful).

        However I feel like it's a struggle getting that point across, so I'm interested in reading your next post to see how you convey things.

  • thebillkidy 6 years ago

    Hi Platz, MDPs are actually the basis that you should know towards creating the Bellman equations

  • pilooch 6 years ago

    MDPs are the core holistic framework for decision under uncertainty, especially pomdps and dec-pomdps. That and the various forms of the Bellman equation are with your attention. RL is a model free MDP.

glitchc 6 years ago

Hi Xavier, in the discount factor equation, shouldn't the last term of the expansion be gamma^n.Rn instead of gamma^2.Rn? Or am I missing something?

  • thebillkidy 6 years ago

    You are completely right! Thanks for spotting this! :)

    • thebillkidy 6 years ago

      Update: Fixed it, the caching might need to expire on your side though (ctrl + shift + r should do the trick)

TheCabin 6 years ago

For my taste, the article is a bit too high-level to build a meaningful intuition. I think a more complex example would be beneficial.

  • thebillkidy 6 years ago

    Thanks a lot for your feedback! Currently I am working on the high-level definitions of Reinforcement Learning, so that i can build on that to go deeper in the more advanced ones. I'd like to have a series that explains the OpenAI examples :)

antman 6 years ago

Nice work! Any good pages for pomdp?

bhaavan 6 years ago

Doesn't the typo "3th" annoy anyone when reading that article?

  • gautamdivgi 6 years ago

    No it doesn’t.

    • thebillkidy 6 years ago

      also fixed it to "third" :) keep the feedback coming