Reinforcement Learning Introduction
What is Reinforcement Learning?
Reinforcement learning (RL) is a class of methods for solving various kinds of sequential decision making problems.
RL is like teaching a dog new tricks - but instead of treats, we use numerical rewards, and instead of a dog, we have an algorithm.
Goal of RL: To design an agent that interacts with an external environment
The agent maintains an internal state s_t
The state s_t is passed to a policy \pi
Based on \pi, we choose an action a_t = \pi(s_t)
Environment gives us back an observation o_{t+1}
Finally, the agent updates its state s_{t+1} using a state-update function s_{t+1} = U(s_t, a_t, o_{t+1})
How to think of these key concepts

Agent - decision maker or learner. E.g. a robot trying to navigate a maze, the classic pacman game, or a self-driving car.
Environment - Everything outside the agent. E.g. The maze, the game board, the road.
State, s_t - The agent’s internal representation of a situation at a particular time, t. It’s a “mental map” of where it is and what the agent knows. E.g. The state might include the coordinates in the maze, what direction it is facing and what it has learnt so far.
Policy, \pi - The strategy (or rule) that guides decision making (that is maps states to actions). It is the decision making function. E.g. if the robot is a wall in front, turn right or left.
Action, a_t = \pi(s_t) - Choice made by the agent. E.g. Robot turning right or moving forward.
Observation - The information received from environment by an agent. E.g. Robot observes “I bumped into a wall” or “I found a clear path.”
State-update function - The function that the agent uses to update what it knows. It is how the agent learns and adapts. E.g. Our robot updates its mental map of the maze after each move, incorporating new information like “there’s a wall here” or “there’s a reward there.”
How do we determine what a good policy is?
Goal of the agent: To choose a policy that maximizes the sum of expected rewards. We introduce a reward function to guide the choice of a policy.
Reward function, R(s_t, a_t), is the value for performing an action in a given state.
Value function, V_{\pi}(s_0) is for policy \pi evaluated at the agent’s initial state, s_0.
V_{\pi}(s_0) = {E}_p(a_0, s_1, ..., a_T, s_T|s_0, \pi) \left[ \sum_{t=0}^T R(s_t, a_t) | s_0 \right]
- Initial state that the agent is in \pi could correspond to a random choice.
\text{Sample 1: } a^{(1)}_0, s^{(1)}_1, ... , a^{(1)}_T, s^{(1)}_T \implies \sum R(s_t^{(1)},a_t^{(1)})
\text{Sample 2: } a^{(2)}_0, s^{(2)}_1, ... , a^{(2)}_T, s^{(2)}_T \implies \sum R(s_t^{(2)},a_t^{(2)})
:
\text{Sample N: }a^{(N)}_0, s^{(N)}_1, ... , a^{(N)}_T, s^{(N)}_T \implies \sum R(s_t^{(N)},a_t^{(N)})
- Average reward over different (random) choices of trajectories of interacting with an environment.
V_{\pi}(s_0) = \frac{1}{N} \sum_{i=1}^N{R^{(i)}}
We can define the Optimal policy:
\pi^* = \arg\max_{\pi}{E_{p(s_0)}[V_{\pi}(s_0)]}
Note that, this is one way of designing an optimal policy - Maximum expected utility principle. Policy will vary depending on the assumptions we make about the environment and the form of an agent.
Multi-Armed Bandit Problem
When there are a finite number of possible actions, this is called a multi-armed bandit.
Imagine you’re in a Vegas casino with a row of slot machines (the “bandits” - they’re called one-armed bandits because of the lever). Each machine has different odds of paying out, but you don’t know which ones are better. You have limited money and time.
Your challenge: figure out which machines are the best while still making money. This is the multi-armed bandit problem - how do you balance exploring new options to find the best one, versus exploiting what you already know works?
Examine Scenario
Let’s say you have 3 coffee shops near your office:
- Joe’s Java: You’ve been there 5 times, it’s decent (7/10 rating)
- Bean Scene: You tried it once, it was amazing (9/10)
- Café Mystery: You’ve never tried it
Each morning, you face the bandit problem:
- Keep going to Bean Scene (exploit your best known option)?
- Try Café Mystery (explore to potentially find something better)?
- Give Joe’s more chances (gather more data)?
If you only exploit, you might miss an even better coffee shop. If you only explore, you waste money on potentially bad coffee.
Real-World Applications
Here are some examples of reinforcement learning applications. Neptune.AI has a great list of examples with visualizations and links to papers that I found very interesting. You should check it out.
- Online Advertising: Which ad version gets the most clicks? Companies test different versions while still showing profitable ones.
- Netflix Recommendations: Which movies to suggest? Netflix balances showing you movies similar to what you liked (exploit) vs. new genres you might enjoy (explore).
- Clinical Trials: Which treatment works best? Doctors must balance giving patients proven treatments vs. testing new ones.
- Restaurant Apps: DoorDash decides which restaurants to show first - popular ones you’ll likely order from, or new ones you might love.
The key insight is that information has value - sometimes it’s worth taking a suboptimal choice now to learn something that helps you make better choices later.
Now, let me challenge an implicit assumption in how this problem is often framed: Is pure optimization always the goal?
In real life, variety itself has value. Humans get bored. The “best” coffee shop might not be best every day. Perhaps the real problem isn’t just finding the optimal choice, but maintaining a satisfying portfolio of choices over time.