Skip to content

Markov Decision Processes (MDPs)

Prime #
471
Origin domain
Operations Research
Also from
Mathematics, Computer Science & Software Engineering
Aliases
MDP, MDPS, Stochastic Dynamic Programming, Sequential Decision Under Uncertainty
Related primes
Dynamic Programming, reinforcement learning, Linear Programming (LP), stochastic process, partially observable mdp, Bayesian Updating

Core Idea

A Markov Decision Process formalizes sequential decision-making under uncertainty where each action leads to probabilistic state transitions and rewards, enabling an optimal policy that maximizes cumulative expected return.

How would you explain it like I'm…

Step-by-step game math

Imagine a game where you're in a room and can pick a door. Each door might take you somewhere fun or scary, and sometimes you get a sticker. You want to pick doors that get you the most stickers over many rooms. A Markov Decision Process is just the math for that kind of step-by-step game.

Math for stepwise choice games

Picture a video game where you stand on a square, choose a move, and randomly land on a new square that gives you points. The rules say: where you go next depends only on where you stand now and what you chose, not your whole history. A Markov Decision Process writes down all the squares, all the moves, the chances of landing in each next square, the points, and how much you care about points later versus now. Then math finds the best plan.

Sequential Decisions Under Uncertainty

A Markov Decision Process (MDP) is a precise recipe for problems where you make a sequence of choices and the world reacts a bit randomly. You write down: every possible situation (states), every choice you can make (actions), the probability each choice leads to each next state, the reward you get for each transition, and a number between 0 and 1 saying how much you discount future rewards. The key trick is the Markov property: the next step depends only on the current state, not the whole path you took. That cleanness lets a method called dynamic programming compute the best long-run strategy.

 

A Markov Decision Process formalizes sequential decision-making under uncertainty as a tuple (S, A, P, R, gamma): a state space S, an action set A, a transition kernel P(s' | s, a) giving the probability of each next state, a reward function R, and a discount factor gamma in [0, 1] that down-weights future rewards. Its defining assumption is the Markov property: next-state and reward distributions depend only on the current state-action pair, not on prior history. A policy pi(a | s) maps states to action choices; the goal is to find one that maximizes expected discounted cumulative reward. The Markov property enables a recursive decomposition called the Bellman equation, which drives algorithms like value iteration and policy iteration (sweep-and-update methods that converge to optimal value functions and policies). When the transition and reward functions are unknown, reinforcement learning (e.g., Q-learning, policy gradient) learns a policy from experience. Variants include partially-observable MDPs (POMDPs) for hidden state and continuous-state MDPs for physical content control problems.

Broad Use

  • Robotics & AI Control: Agents select actions in uncertain environments, continuously updating states.

  • Inventory/Production Systems: Decide reorder or production quantities each period, with uncertain demand, to maximize profit or minimize cost.

  • Healthcare Treatment Protocols: Each choice of therapy shifts a patient's health state stochastically; MDP methods find the best policy over time.

  • Finance: Optimal trading or hedging decisions based on uncertain price transitions.

Clarity

Differentiates from static optimization by focusing on dynamic feedback—the system evolves in states, actions, and transitions, each step shaping future possibilities.

Manages Complexity

Offers a systematic approach to multi-step planning with stochastic state changes, letting algorithms like dynamic programming or reinforcement learning find policies that adapt over time.

Abstract Reasoning

Demonstrates the principle of Markov property (next state depends only on current state and action, not full history) and the concept of value functions for each state, bridging discrete math, probability, and optimization.

Knowledge Transfer

  • Adaptive Traffic Lights: MDP logic updates signals based on real-time traffic flow to reduce congestion.

  • Multi-Round Negotiations: At each stage, actions shift the negotiation state, with uncertain responses from the other party.

Example

A warehouse replenishment MDP chooses reorder sizes each week based on current inventory (state), facing demand uncertainty. The policy is a mapping from inventory levels to reorder decisions maximizing long-run expected profit.

Relationships to Other Primes

One-hop neighborhood: parents above, mutual partners to the right, children below.Markov DecisionProcesses (MDPs)composition: DecisionDecisioncomposition: ProbabilityProbabilitycomposition: State and State TransitionState and StateTransition

Parents (3) — more general patterns this builds on

  • Markov Decision Processes (MDPs) presupposes Decision — Markov Decision Processes presuppose Decision: an MDP is machinery for selecting policies, which are decision rules over states.
  • Markov Decision Processes (MDPs) presupposes Probability — Markov Decision Processes presuppose Probability: the transition kernel and expected-reward objective are defined as probabilistic objects.
  • Markov Decision Processes (MDPs) presupposes State and State Transition — Markov Decision Processes presupposes state and state transition because the MDP tuple is built on a state space with Markov-property transitions.

Path to root: Markov Decision Processes (MDPs)Probability

Not to Be Confused With

  • Markov Decision Processes is not Probability because Markov Decision Processes use probability as a tool to model state transitions and outcomes, while Probability is the mathematical measure of likelihood (the foundation theory).
  • Markov Decision Processes is not Bayesian Updating because Markov Decision Processes model sequential decision-making under uncertainty with a specific transition structure, while Bayesian Updating is the method of revising belief probabilities given new evidence.
  • Markov Decision Processes is not Prioritization because Markov Decision Processes compute optimal action sequences, while Prioritization is the process of ranking items by importance or urgency.