publishing my previous post on benchmarking tabular reinforcement studying (RL) strategies, I couldn’t shake the sensation that one thing wasn’t fairly proper. The outcomes regarded off, and I wasn’t totally glad with how they turned out.
Nonetheless, I continued with the put up sequence, shifting focus to multi-player video games and approximate resolution strategies. To help this, I’ve been steadily refactoring the unique framework we constructed. The brand new model is cleaner, extra basic, and simpler to make use of. Within the course of, it additionally helped uncover just a few bugs and edge-case points in among the earlier algorithms (extra on that later).
On this put up, I’ll introduce the up to date framework, spotlight the errors I made, share corrected outcomes, and replicate on key classes realized, setting the stage for extra advanced experiments to come back.
The up to date code will be discovered on GitHub.
Framework
The largest change from the earlier model of the code is that RL resolution strategies are actually carried out as courses. These courses expose frequent strategies like act()
(for choosing actions) and replace()
(for adjusting mannequin parameters).
Complementing this, a unified coaching script manages the interplay with the setting: it generates episodes and feeds them into the suitable technique for studying—utilizing the shared interface supplied by these class strategies.
This refactoring considerably simplifies and standardizes the coaching course of. Beforehand, every technique had its personal standalone coaching logic. Now, coaching is centralized, and every technique’s function is clearly outlined and modular.
Earlier than diving into the strategy courses intimately, let’s first have a look at the coaching loop for single-player environments:
def train_single_player(
env: ParametrizedEnv,
technique: RLMethod,
max_steps: int = 100,
callback: Callable | None = None,
) -> tuple[bool, int]:
"""Trains a way on single-player environments.
Args:
env: env to make use of
technique: technique to make use of
max_steps: maximal variety of replace steps
callback: callback to find out if technique already solves the given downside
Returns:
tuple of success, discovered coverage, variety of replace steps
"""
for step in vary(max_steps):
commentary, _ = env.env.reset()
terminated = truncated = False
episode = []
cur_episode_len = 0
whereas not terminated and never truncated:
motion = technique.act(commentary, step)
observation_new, reward, terminated, truncated, _ = env.step(
motion, commentary
)
episode.append(ReplayItem(commentary, motion, reward))
technique.replace(episode, step)
commentary = observation_new
# NOTE: that is extremely depending on setting dimension
cur_episode_len += 1
if cur_episode_len > env.get_max_num_steps():
break
episode.append(ReplayItem(observation_new, -1, reward, []))
technique.finalize(episode, step)
if callback and callback(technique, step):
return True, step
env.env.shut()
return False, step
Let’s visualize what a accomplished episode seems like—and when the replace()
and finalize()
strategies are referred to as throughout the course of:
After every replay merchandise is processed—consisting of a state, the motion taken, and the reward acquired—the strategy’s replace()
perform is named to regulate the mannequin’s inner parameters. The precise conduct of this perform relies on the algorithm getting used.
To offer you a concrete instance, let’s take a fast have a look at how this works for Q-learning.
Recall the Q-learning replace rule:

When the second name to replace()
happens, we have now St = s1, At = a1 and Rt+1 = r2.
Utilizing this data, the Q-learning agent updates its worth estimates accordingly.
Unsupported Strategies
Dynamic programming (DP) strategies don’t match above launched construction – since they’re based mostly upon iterating over all states of the setting. For that motive, we go away their code untouched and deal with coaching them otherwise.
Additional, we fully take away the help for Prioritized Sweeping. Additionally, right here we have to iterate over states ultimately to search out predecessor states, which, once more – doesn’t match into our replace coaching construction – and, extra importantly, isn’t possible for extra advanced multi-player video games, the place the variety of states is far bigger and tougher to iterate.
Since this technique in any case didn’t produce good outcomes, we concentrate on the remaining ones. Word: an analogous reasoning was finished for DP strategies: these can’t be prolonged so simply to multi-player video games, and thus might be of lesser curiosity sooner or later.
Bugs
Bugs occur — in all places, and this mission isn’t any exception. On this part, I’ll spotlight a very impactful bug that made its approach into the outcomes of the earlier put up, together with some minor adjustments and enhancements. I’ll additionally clarify how these affected earlier outcomes.
Incorrect Motion Chance Calculation
Some strategies require the chance of a selected motion throughout the replace step. Within the earlier code model, we had:
def _get_action_prob(Q: np.ndarray) -> float:
return (
Q[observation_new, a] / sum(Q[observation_new, :])
if sum(Q[observation_new, :])
else 1
)
This labored just for strictly optimistic Q-values, however broke down when Q-values had been adverse — making the normalization invalid.
The corrected model handles each optimistic and adverse Q-values correctly utilizing a softmax strategy:
def _get_action_prob(self, commentary: int, motion: int) -> float:
probs = [self.Q[observation, a] for a in vary(self.env.get_action_space_len())]
probs = np.exp(probs - np.max(probs))
return probs[action] / sum(probs)
This bug considerably impacted Anticipated SARSA and n-step Tree Backup, as their updates relied closely on motion possibilities.
Tie-Breaking in Grasping Motion Choice
Beforehand, when producing episodes, we both chosen the grasping motion or sampled randomly with ε-greedy logic:
def get_eps_greedy_action(q_values: np.ndarray, eps: float = 0.05) -> int:
if random.uniform(0, 1) < eps or np.all(q_values == q_values[0]):
return int(np.random.alternative([a for a in range(len(q_values))]))
else:
return int(np.argmax(q_values))
Nevertheless, this didn’t correctly deal with ties, i.e., when a number of actions shared the identical most Q-value. The up to date act()
technique now contains truthful tie-breaking:
def act(
self, state: int, step: int | None = None, masks: np.ndarray | None = None
) -> int:
allowed_actions = self.get_allowed_actions(masks)
if self._train and step and random.uniform(0, 1) < self.env.eps(step):
return random.alternative(allowed_actions)
else:
q_values = [self.Q[state, a] for a in allowed_actions]
max_q = max(q_values)
max_actions = [a for a, q in zip(allowed_actions, q_values) if q == max_q]
return random.alternative(max_actions)
A small change, however probably fairly related – since this e.g. stimulates a extra explorative motion choice in the beginning of every coaching, the place all Q-values are equal.
This small change might have a noticeable impression—particularly early in coaching, when all Q-values are initialized equally. It encourages a extra various exploration technique throughout the vital early section.
As beforehand mentioned—and as we’ll see once more beneath—RL strategies exhibit excessive variance, making the impression of such adjustments tough to measure exactly. Nevertheless, this adjustment appeared to barely enhance the efficiency of a number of strategies: Sarsa, Q-learning, Double Q-learning, and Sarsa-n.
Up to date Outcomes
Let’s now study the up to date outcomes — for completeness, we embody all strategies, not simply the improved ones.
However first, a fast reminder of the duty we’re fixing: we’re working with Gymnasium’s GridWorld setting [2] — basically a maze-solving activity:

The agent should navigate from the top-left to the bottom-right of the grid whereas avoiding icy lakes.
To judge every technique’s efficiency, we scale the gridworld dimension and measure the variety of replace steps till convergence.
Monte Carlo Strategies
These strategies weren’t affected by the current implementation adjustments, so we observe outcomes according to our earlier findings:
- Each are able to fixing environments as much as 25×25 in dimension.
- On-policy MC performs barely higher than off-policy.

Temporal Distinction Strategies
For these, we measure the next outcomes:

For these, we instantly discover that Anticipated Sarsa now fares significantly better, attributable to fixing above talked about bug about computing the motion possibilities.
But additionally the opposite strategies carry out higher: as talked about above, this might simply be likelihood / variance – or be a consequence of the opposite minor enhancements we did, particularly the higher dealing with of ties throughout motion choice.
TD-n
For TD-n strategies, our outcomes look a lot completely different:

Sarsa-n additionally has improved, in all probability for comparable causes as mentioned within the final part – however particularly n-step tree backup now performs very well – proving that with appropriate motion choice this certainly is a really highly effective resolution technique.
Planning
For planning, we solely have Dyna-Q left – which additionally appears to have improved barely:

Evaluating the Finest Answer Strategies on Bigger Environments
With that, let’s visualize the best-performing strategies from all classes in a single diagram. As a result of elimination of some strategies like DP, I now chosen on-policy MC, Sarsa, Q-learning, Sarsa-n, n-step tree backup and Dyna-Q.
We start by exhibiting outcomes for grid worlds as much as dimension 50 x 50:

We observe on-policy MC to carry out surprisingly properly — according to earlier findings. Its energy doubtless stems from its simplicity and unbiased estimates, which work properly for short- to medium-length episodes.
Nevertheless, not like the earlier put up, n-step Tree Backup clearly emerges because the top-performing technique. This aligns with idea: its use of anticipated multi-step backups allows clean and secure worth propagation, combining the strengths of off-policy updates with the steadiness of on-policy studying.
Subsequent, we observe a center cluster: Sarsa, Q-learning, and Dyna-Q — with Sarsa barely outperforming the others.
It’s considerably shocking that the model-based updates in Dyna-Q don’t result in higher efficiency. This would possibly level to limitations within the mannequin accuracy or the variety of planning steps used. Q-learning tends to underperform because of the elevated variance launched by its off-policy nature.
The worst-performing technique on this experiment is Sarsa-n, according to earlier observations. We suspect the degradation in efficiency comes from the elevated variance and bias attributable to n-step sampling with out expectation over actions.
It’s nonetheless considerably surprising that MC strategies outperform TD on this setting — historically, TD strategies are anticipated to do higher in giant environments. Nevertheless, that is mitigated in our setup by the reward shaping technique: we offer a small optimistic reward at every step because the agent strikes nearer to the purpose. This alleviates one among MC’s main weaknesses — poor efficiency in sparse reward settings.
TODO: doubtlessly embody new outcomes as much as dimension 100 x 100 (might be finished earlier than time of publication)
Conclusion and Learnings
On this put up, we shared updates to the RL framework developed over this sequence. Alongside numerous enhancements, we mounted some bugs — which considerably enhanced algorithm efficiency.
We then utilized the up to date strategies to more and more bigger GridWorld environments, with the next findings:
- n-step Tree Backup emerged as the perfect technique general, because of its anticipated multi-step updates that mix the advantages of each on- and off-policy studying.
- Monte Carlo strategies adopted, exhibiting surprisingly sturdy efficiency attributable to their unbiased estimates and the intermediate rewards guiding studying.
- A cluster of TD strategies — Q-learning, Sarsa, and Dyna-Q — adopted. Regardless of Dyna-Q’s model-based updates, it didn’t considerably outperform its model-free counterparts.
- Sarsa-n carried out worst, doubtless because of the compounded bias and variance launched by sampling n-step returns.
Thanks for studying this replace! Keep tuned for additional content material — subsequent up, we cowl multi-player video games and environments.