Certainly, RL gives helpful options to quite a lot of sequential decision-making issues. Temporal-Distinction Studying (TD studying) strategies are a well-liked subset of RL algorithms. TD studying strategies mix key elements of Monte Carlo and Dynamic Programming strategies to speed up studying with out requiring an ideal mannequin of the atmosphere dynamics.
On this article, we’ll evaluate totally different sorts of TD algorithms in a customized Grid World. The design of the experiment will define the significance of steady exploration in addition to the particular person traits of the examined algorithms: Q-learning, Dyna-Q, and Dyna-Q+.
The define of this put up incorporates:
- Description of the atmosphere
- Temporal-Distinction (TD) Studying
- Mannequin-free TD strategies (Q-learning) and model-based TD strategies (Dyna-Q and Dyna-Q+)
- Parameters
- Efficiency comparisons
- Conclusion
The total code permitting the copy of the outcomes and the plots is out there right here: https://github.com/RPegoud/Temporal-Difference-learning
The Surroundings
The atmosphere we’ll use on this experiment is a grid world with the next options:
- The grid is 12 by 8 cells.
- The agent begins within the backside left nook of the grid, the goal is to achieve the treasure positioned within the high proper nook (a terminal state with reward 1).
- The blue portals are related, going by means of the portal positioned on the cell (10, 6) results in the cell (11, 0). The agent can not take the portal once more after its first transition.
- The purple portal solely seems after 100 episodes however permits the agent to achieve the treasure quicker. This encourages regularly exploring the atmosphere.
- The pink portals are traps (terminal states with reward 0) and finish the episode.
- Bumping right into a wall causes the agent to stay in the identical state.
This experiment goals to check the behaviour of Q-learning, Dyna-Q, and Dyna-Q+ brokers in a altering atmosphere. Certainly, after 100 episodes, the optimum coverage is sure to vary and the optimum variety of steps throughout a profitable episode will lower from 17 to 12.

Introduction to Temporal-Distinction Studying:
Temporal-Distinction Studying is a mixture of Monte Carlo (MC) and Dynamic Programming (DP) strategies:
- Like MC strategies, TD strategies can study from expertise with out requiring a mannequin of the atmosphere’s dynamics.
- Like DP strategies, TD strategies replace estimates after each step primarily based on different discovered estimates with out ready for the final result (that is referred to as bootstrapping).
One particularity of TD strategies is that they replace their worth estimate each time step, versus MC strategies that wait till the tip of an episode.
Certainly, each strategies have totally different replace targets. MC strategies purpose to replace the return Gt, which is simply obtainable on the finish of an episode. As an alternative, TD strategies goal:

The place V is an estimate of the true worth operate Vπ.
Due to this fact, TD strategies mix the sampling of MC (by utilizing an estimate of the true worth) and the bootstrapping of DP (by updating V primarily based on estimates counting on additional estimates).
The best model of temporal-difference studying is named TD(0) or one-step TD, a sensible implementation of TD(0) would appear to be this:

When transitioning from a state S to a brand new state S’, the TD(0) algorithm will compute a backed-up worth and replace V(S) accordingly. This backed-up worth is named TD error, the distinction between the noticed reward R plus the discounted worth of the brand new state γV(St+1) and the present worth estimate V(S) :

In conclusion, TD strategies current a number of benefits:
- They don’t require an ideal mannequin of the atmosphere’s dynamics p
- They’re applied in a web based style, updating the goal after every time step
- TD(0) is assured to converge for any mounted coverage π if α (the studying fee or step measurement) follows stochastic approximation situations (for extra element see web page 55 ”Monitoring a Nonstationary Downside” of [4])
Implementation particulars:
The next sections discover a number of TD algorithms’ primary traits and efficiency son the grid world.
The identical parameters have been used for all fashions, for the sake of simplicity:
- Epsilon (ε) = 0.1: likelihood of choosing a random motion in ε-greedy insurance policies
- Gamma (γ)= 0.9: low cost issue utilized to future rewards or worth estimates
- Aplha (α) = 0.25: studying fee limiting the Q worth updates
- Planning steps = 100: for Dyna-Q and Dyna-Q+, the variety of planning steps executed for every direct interplay
- Kappa (κ)= 0.001: for Dyna-Q+, the burden of bonus rewards utilized throughout planning steps
The performances of every algorithm are first introduced for a single run of 400 episodes (sections: Q-learning, Dyna-Q, and Dyna-Q+) after which averaged over 100 runs of 250 episodes within the “abstract and algorithms comparability” part.
Q-learning
The primary algorithm we implement right here is the well-known Q-learning (Watkins, 1989):

Q-learning is named an off-policy algorithm as its aim is to approximate the optimum worth operate straight, as an alternative of the worth operate of π, the coverage adopted by the agent.
In apply, Q-learning nonetheless depends on a coverage, sometimes called the ‘conduct coverage’ to pick out which state-action pairs are visited and up to date. Nevertheless, Q-learning is off-policy becauseit updates its Q-values primarily based on the finest estimate of future rewards, no matter whether or not the chosen actions comply with the present coverage π.
In comparison with the earlier TD studying pseudo-code, there are three primary variations:
- We have to initialize the Q operate for all states and actions and Q(terminal) needs to be 0
- The actions are chosen from a coverage primarily based on the Q values (as an illustration the ϵ-greedy coverage with respect to the Q values)
- The replace targets the motion worth operate Q fairly than the state worth operate V

Now that we have now our first algorithm studying for testing, we will begin the coaching section. Our agent will navigate the Grid World utilizing its ε-greedy coverage, with respect to the Q values. This coverage selects the motion with the highest Q-value with a likelihood of (1 – ε) and chooses a random motion with a likelihood of ε. After every motion, the agent will replace its Q-value estimates.
We are able to visualize the evolution of the estimated most action-value Q(S, a) of every cell of the Grid World utilizing a heatmap. Right here the agent performs 400 episodes. As there is just one replace per episode, the evolution of the Q values is sort of gradual and a big a part of the states stay unmapped:

Upon completion of the 400 episodes, an evaluation of the overall visits to every cell gives us with an honest estimate of the agent’s common route. As depicted on the right-hand plot under, the agent appears to have converged to a sub-optimal route, avoiding cell (4,4) and constantly following the decrease wall.

On account of this sub-optimal technique, the agent reaches a minimal of 21 steps per episode, following the trail outlined within the “variety of whole visits” plot. Variations in step counts might be attributed to the ε-greedy coverage, which introduces a ten% likelihood of random actions. Given this coverage, following the decrease wall is an honest technique to restrict potential disruptions brought on by random actions.

In conclusion, the Q-learning agent converged to a sub-optimal technique as talked about beforehand. Furthermore, a portion of the atmosphere stays unexplored by the Q-function, which prevents the agent from discovering the brand new optimum path when the purple portal seems after the one hundredth episode.
These efficiency limitations might be attributed to the comparatively low variety of coaching steps (400), limiting the chances of interplay with the atmosphere and the exploration induced by the ε-greedy coverage.
Planning, an integral part of model-based reinforcement studying strategies is especially helpful to enhance pattern effectivity and estimation of motion values. Dyna-Q and Dyna-Q+ are good examples of TD algorithms incorporating planning steps.
Dyna-Q
The Dyna-Q algorithm (Dynamic Q-learning) is a mixture of model-based RL and TD studying.
Mannequin-based RL algorithms depend on a mannequin of the atmosphere to include planning as their major means of updating worth estimates. In distinction, model-free algorithms depend on direct studying.
”A mannequin of the atmosphere is something that an agent can use to foretell how the atmosphere will reply to its actions” — Reinforcement Studying: an introduction.
Within the scope of this text, the mannequin might be seen as an approximation of the transition dynamics p(s’, r|s, a). Right here, p returns a single next-state and reward pair given the present state-action pair.
In environments the place p is stochastic, we distinguish distribution fashions and pattern fashions, the previous returns a distribution of the subsequent states and actions whereas the latter returns a single pair, sampled from the estimated distribution.
Fashions are particularly helpful to simulate episodes, and subsequently practice the agent by changing real-world interactions with planning steps, i.e. interactions with the simulated atmosphere.
Brokers implementing the Dyna-Q algorithm are a part of the category of planning brokers, brokers that mix direct reinforcement studying and mannequin studying. They use direct interactions with the atmosphere to replace their worth operate (as in Q-learning) and likewise to study a mannequin of the atmosphere. After every direct interplay, they will additionally carry out planning steps to replace their worth operate utilizing simulated interactions.
A fast Chess instance
Think about enjoying recreation of chess. After enjoying every transfer, the response of your opponent means that you can assess the high quality of your transfer. That is just like receiving a optimistic or damaging reward, which lets you “replace” your technique. In case your transfer results in a blunder, you in all probability wouldn’t do it once more, supplied with the identical configuration of the board. To this point, that is corresponding to direct reinforcement studying.
Now let’s add planning to the combination. Think about that after every of your strikes, whereas the opponent is pondering, you mentally return over every of your earlier strikes to reassess their high quality. You may discover weaknesses that you simply uncared for at first sight or discover out that particular strikes have been higher than you thought. These ideas can also can help you replace your technique. That is precisely what planning is about, updating the worth operate with out interacting with the actual atmosphere however fairly a mannequin of mentioned atmosphere.

Dyna-Q subsequently incorporates some further steps in comparison with Q-learning:
After every direct replace of the Q values, the mannequin shops the state-action pair and the reward and next-state that have been noticed. This step is named mannequin coaching.
- After mannequin coaching, Dyna-Q performs n planning steps:
- A random state-action pair is chosen from the mannequin buffer (i.e. this state-action pair was noticed throughout direct interactions)
- The mannequin generates the simulated reward and next-state
- The worth operate is up to date utilizing the simulated observations (s, a, r, s’)

We now replicate the training course of with the Dyna-Q algorithm utilizing n=100. Because of this after every direct interplay with the atmosphere, we use the mannequin to carry out 100 planning steps (i.e. updates).
The next heatmap reveals the quick convergence of the Dyna-Q mannequin. In actual fact, it solely takes the algorithm round 10 episodes to seek out an optimum path. This is because of the truth that each step results in 101 updates of the Q values (as an alternative of 1 for Q-learning).

One other good thing about planning steps is a greater estimation of motion values throughout the grid. Because the oblique updates goal random transitions saved contained in the mannequin, states which are far-off from the aim additionally get up to date.
In distinction, the motion values slowly propagate from the aim in Q-learning, resulting in an incomplete mapping of the grid.

Utilizing Dyna-Q, we discover an optimum path permitting the decision of the grid world in 17 steps, as depicted on the plot under by pink bars. Optimum performances are attained recurrently, regardless of the occasional interference of ε-greedy actions for the sake of exploration.
Lastly, whereas Dyna-Q might seem extra convincing than Q-learning resulting from its incorporation of planning, it’s important to keep in mind that planning introduces a tradeoff between computational prices and real-world exploration.

Dyna-Q+
Thus far, neither of the examined algorithms managed to seek out the optimum path showing after step 100 (the purple portal). Certainly, each algorithms quickly converged to an optimum answer that remained mounted till the tip of the coaching section. This highlights the necessity for steady exploration all through coaching.
Dyna-Q+ is basically just like Dyna-Q however provides a small twist to the algorithm. Certainly, Dyna-Q+ continuously tracks the variety of time steps elapsed since every state-action pair was tried in actual interplay with the atmosphere.
Particularly, take into account a transition yielding a reward r that has not been tried in τ time steps. Dyna-Q+ would carry out planning as if the reward for this transition was r + κ √τ, with κ small enough (0.001 within the experiment).
This modification in reward design encourages the agent to repeatedly discover the atmosphere. It assumes that the longer a state-action pair hasn’t been tried, the better the probabilities that the dynamics of this pair have modified or that the mannequin is inaccurate.

As depicted by the next heatmap, Dyna-Q+ is way more lively with its updates in comparison with the earlier algorithms. Earlier than episode 100, the agent explores the entire grid and finds the blue portal and the primary optimum route.
The motion values for the remainder of the grid lower earlier than slowly rising once more, as states-action pairs within the high left nook aren’t explored for a while.
As quickly because the purple portal seems in episode 100, the agent finds the brand new shortcut and the worth for the entire space rises. Till the completion of the 400 episodes, the agent will constantly replace the motion worth of every state-action pair whereas sustaining occasional exploration of the grid.

Due to the bonus added to mannequin rewards, we lastly get hold of a full mapping of the Q operate (every state or cell has an motion worth).
Mixed with steady exploration, the agent manages to seek out the brand new finest route (i.e. optimum coverage) because it seems, whereas retaining the earlier answer.

Nevertheless, the exploration-exploitation trade-off in Dyna-Q+ certainly comes with a price. When state-action pairs haven’t been visited for a enough length, the exploration bonus encourages the agent to revisit these states, which may briefly lower its fast efficiency. This exploration conduct prioritizes updating the mannequin to enhance long-term decision-making.
This explains why some episodes performed by Dyna-Q+ might be as much as 70 steps lengthy, in comparison with at most 35 and 25 steps for Q-learning and Dyna-Q, respectively. The longer episodes in Dyna-Q+ mirror the agent’s willingness to speculate additional steps in exploration to assemble extra details about the atmosphere and refine its mannequin, even when it ends in short-term efficiency reductions.
In distinction, Dyna-Q+ recurrently achieves optimum efficiency (depicted by inexperienced bars on the plot under) that earlier algorithms didn’t attain.

Abstract and Algorithms Comparability
As a way to evaluate the important thing variations between the algorithms, we use two metrics (needless to say the outcomes rely on the enter parameters, which have been equivalent amongst all fashions for simplicity):
- Variety of steps per episode: this metric characterizes the speed of convergence of the algorithms in direction of an optimum answer. It additionally describes the conduct of the algorithm after convergence, significantly when it comes to exploration.
- Common cumulative reward: the proportion of episodes resulting in a optimistic reward
Analyzing the variety of steps per episode (see plot under), reveals a number of elements of model-based and model-free strategies:
- Mannequin-Based mostly Effectivity: Mannequin-based algorithms (Dyna-Q and Dyna-Q+) are usually extra sample-efficient on this specific Grid World (this property can also be noticed extra typically in RL). It is because they will plan forward utilizing the discovered mannequin of the atmosphere, which may result in faster convergence to close optimum or optimum options.
- Q-Studying Convergence: Q-learning, whereas finally converging to a close to optimum answer, requires extra episodes (125) to take action. It’s necessary to focus on that Q-learning performs only one replace per step, which contrasts with the a number of updates carried out by Dyna-Q and Dyna-Q+.
- A number of Updates: Dyna-Q and Dyna-Q+ execute 101 updates per step, which contributes to their quicker convergence. Nevertheless the tradeoff for this sample-efficiency is computational price (see the runtime part within the desk under).
- Advanced Environments: In additional complicated or stochastic environments, the benefit of model-based strategies may diminish. Fashions can introduce errors or inaccuracies, which may result in suboptimal insurance policies. Due to this fact, this comparability needs to be seen as an overview of the strengths and weaknesses of various approaches fairly than a direct efficiency comparability.

We now introduce the common cumulative reward (ACR), which represents the proportion of episodes the place the agent reaches the aim (because the reward is 1 for reaching the aim and 0 for triggering a entice), the ACR is then just by:

With N the variety of episodes (250) and Okay the variety of impartial runs (100) and Rn,okay the cumulative reward for episode n in run okay.
Right here’s a breakdown of the efficiency of all algorithms:
- Dyna-Q converges quickly and achieves the very best general return, with an ACR of 87%. Because of this it effectively learns and reaches the aim in a good portion of episodes.
- Q-learning additionally reaches an identical degree of efficiency however requires extra episodes to converge, explaining its barely decrease ACR, at 70%.
- Dyna-Q+ promptly finds coverage, reaching a cumulative reward of 0.8 after solely 15 episodes. Nevertheless, the variability and exploration induced by the bonus reward reduces efficiency till step 100. After 100 steps, it begins to enhance because it discovers the brand new optimum path. Nevertheless, the short-term exploration compromises its efficiency, leading to an ACR of 79%, which is decrease than Dyna-Q however larger than Q-learning.

Conclusion
On this article, we introduced the elemental ideas of Temporal Distinction studying and utilized Q-learning, Dyna-Q, and Dyna-Q+ to a customized grid world. The design of this grid world helps emphasize the significance of continuous exploration as a option to uncover and exploit new optimum insurance policies in altering environments. The distinction in performances (evaluated utilizing the variety of steps per episode and the cumulative reward) illustrate the strengths and weaknesses of those algorithms.
In abstract, model-based strategies (Dyna-Q, Dyna-Q+) profit from elevated pattern effectivity in comparison with model-based strategies (Q-learning), at the price of computation effectivity. Nevertheless, in stochastic or extra complicated environments, inaccuracies within the mannequin may hinder performances and result in sub-optimal insurance policies.

References:
[1] Demis Hassabis, AlphaFold reveals the structure of the protein universe (2022), DeepMind
[2] Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun &Davide Scaramuzza, Champion-level drone racing using deep reinforcement learning (2023), Nature
[3] Nathan Lambert, LouisCastricato, Leandro von Werra, Alex Havrilla, Illustrating Reinforcement Learning from Human Feedback (RLHF), HuggingFace
[4] Sutton, R. S., & Barto, A. G. . Reinforcement Learning: An Introduction (2018), Cambridge (Mass.): The MIT Press.
[5] Christopher J. C. H. Watkins & Peter Dayan, Q-learning (1992), Machine Studying, Springer Hyperlink