Benchmarking Tabular Reinforcement Learning Algorithms

posts, we explored Half I of the seminal guide Reinforcement Studying by Sutton and Barto [1] (*). In that part, we delved into the three elementary strategies underlying almost each fashionable Reinforcement Studying (RL) algorithm: Dynamic Programming (DP), Monte Carlo strategies (MC), and Temporal Distinction Studying (TD). We not solely mentioned algorithms from every discipline in depth but in addition applied each in Python.

Half I of the guide focuses on tabular resolution strategies — approaches suited to issues sufficiently small to be represented in desk kind. As an illustration, with Q-learning, we will compute and retailer an entire Q-table containing each attainable state-action pair. In distinction, Half II of Sutton’s guide tackles approximate resolution strategies. When state and motion areas change into too giant — and even infinite — we should generalize. Think about the problem of taking part in Atari video games: the state area is just too huge to mannequin exhaustively. As an alternative, deep neural networks are used to compress the state right into a latent vector, which then serves as the premise for an approximated worth perform [2].

Whereas we’ll enterprise into Half II in upcoming posts, I’m excited to announce a brand new collection: we are going to benchmark all of the algorithms launched in Half I towards each other. This submit serves each as a abstract and an introduction to our benchmarking framework. We’ll consider every algorithm primarily based on how shortly and successfully it will probably remedy more and more bigger Gridworld environments. In future posts, we plan to increase our experiments to more difficult situations, equivalent to two-player video games, the place the variations between these strategies will likely be much more obvious.

Summarized, on this submit, we are going to:

Introduce the benchmark activity and talk about our comparability standards.
Present a chapter-by-chapter abstract of the strategies launched in Sutton’s guide together with preliminary benchmarking outcomes.
Determine the best-performing strategies from every group and deploy them in a larger-scale benchmarking experiment.

Desk of Contents

Introducing the Benchmark Job and Experiment Planning

On this submit we’ll work with the Gymnasium [3] setting “Gridworld”. It’s basically a maze-finding activity, by which the agent has to get from the top-left nook of the maze to the bottom-right tile (the current) — with out falling into any of the icy lakes:

Picture by writer

The state area is a quantity between 0 and N — 1, the place N is the maximal variety of tiles (16 in our case). There are 4 actions the agent can execute in each step: going left, proper, up or down. Reaching the purpose yields reward 1, falling into the lake ends the episode with none reward.

The great factor about this setting is that one can generate random worlds of arbitrary measurement. Thus, what we’ll do with all strategies, is plot the variety of steps / updates wanted to resolve the setting versus the setting measurement. In reality, Sutton does this in some components of the guide, too, so we will consult with that.

Preliminaries

I’d like to start out with some basic notes — hoping these will information you thru my thought course of.

It’s not straightforward to match algorithms in a “honest” method. When implementing the algorithms I primarily appeared for correctness, but in addition simplicity — that’s, I needed readers to simply have the ability to join the Python code with pseudocode from Sutton’s guide. For “actual” use circumstances, one would certainly optimize the code extra, and likewise use a big array of methods widespread in RL, equivalent to utilizing decaying exploration, optimistic initialization, studying charge tuning, and extra. Additional, one would take nice care to tune the hyperparameters of the deployed algorithm.

Utilized RL Methods

As a result of giant variety of algorithms beneath investigation, I can not do that right here. As an alternative, I recognized two vital mechanisms, and measured their effectiveness on an algorithm identified to work fairly properly: Q-learning. These are:

Intermediate rewards: as a substitute of solely rewarding the agent for reaching the purpose, we reward it for progress alongside the maze. We quantify this by the (normalized) distinction in x and y coordinates between present and former state. This makes use of the truth that the purpose in every Gridworld setting is all the time on the backside proper, and thus greater x / y coordinates often are higher (though one might nonetheless get “caught”, in case an icy lake is between the agent and the purpose). Since this distinction is normalized by the variety of states, its contribution is small, s.t. it doesn’t overshadow the reward of reaching the purpose.
Decaying exploration: all through this submit collection, the exploration-exploration dilemma got here up ceaselessly. It describes the trade-off of exploiting states / actions already identified to be good, and exploring much less explored states — probably discovering even higher options, on the threat of losing time in much less optimum areas. One widespread method addressing that is to decay exploration: beginning with a excessive quantity of exploration early, and slowly decaying this right down to decrease ranges. We do that by linearly decaying ε from 1 (100% random exploration) to 0.05 over 1000 steps.

Let’s take a look at how Q-learning performs with these strategies:

As we will see, within the baseline setup the variety of steps wanted shortly grows, and reaches the maximal variety of allowed steps (100.000 ) — which means the algorithm didn’t remedy the setting within the allotted variety of steps. Additionally decaying ε alone didn’t contribute a lot. Nonetheless, including intermediate rewards proved to be extraordinarily efficient — and the mix of this and decaying ε carried out finest.

Thus, for many strategies to return we begin with the “naïve” setting, the baseline implementation. Later we present outcomes for the “improved” setting consisting of intermediate rewards and decaying exploration.

Comparability Standards

As was seen within the earlier part I selected the variety of steps wanted till the discovered coverage solves the Gridworld setting because the default method of evaluating strategies. This appears a bit extra honest than simply measuring elapsed time, since time is dependent upon the concrete implementation (just like the idea of Big O notation)— and, as talked about above, I didn’t optimize for pace. Nonetheless, you will need to observe that additionally steps could be deceptive, as e.g. one step in DP strategies accommodates a loop over all states, whereas one step in MC and TD strategies is the technology in a single episode (really for TD strategies we often depend one step as one worth replace, i.e. an episode technology consists of a number of steps — nonetheless I made this extra akin to MC strategies on goal). On account of this we additionally present elapsed time generally.

Experiment Construction

To cut back the variance, for every Gridworld measurement we run every methodology thrice, after which save the bottom variety of steps wanted.

The code required to run all following benchmarking could be discovered on GitHub.

Recap and Benchmarking of All Algorithms

With the basics out of the way in which, let’s correctly get began. On this part we are going to recap all launched algorithms from Half I of Sutton’s guide. Additional, we are going to benchmark them towards one another on the beforehand launched Gridworld activity.

Dynamic Programming

We begin with Chapter 4 of Sutton’s guide, describing strategies from DP. These could be utilized to all kinds of issues, all the time constructing on the precept of iteratively constructing bigger options from smaller subproblems. Within the context of RL, DP strategies keep a Q-table which is stuffed out incrementally. For this, we require a mannequin of the setting, after which, utilizing this mannequin, replace the anticipated worth of states or state-action pairs relying on the attainable successor states. Sutton introduces two strategies we picked up in our corresponding submit: coverage and worth iteration.

Let’s begin with coverage iteration. This consists of two interleaved steps, specifically coverage analysis and coverage enchancment. Coverage analysis makes use of DP to — because the identify says — consider the present coverage: we incrementally replace the state estimates by utilizing the mannequin and coverage. Subsequent comes coverage enchancment, which employs a elementary idea of RL: in keeping with the coverage enchancment theorem, any coverage will get higher when altering the anticipated motion in a single state to a greater motion. Following this, we assemble the brand new coverage from the Q-table in grasping style. That is repeated, till the coverage has converged.

The corresponding pseudocode appears as follows:

Let’s come to worth iteration. That is similar to coverage iteration, however with a easy, but essential modification: in each loop, just one step of coverage analysis is run. It may be proven that this nonetheless converges to the optimum coverage, and total does so quicker than coverage iteration:

For extra particulars, here’s my corresponding post about DP methods.

Outcomes

Now it’s time to see what these first two algorithms are actually product of. We run the experiment sketched above, and get the next plot:

Each strategies are capable of remedy all created Gridworld sizes within the minimal variety of steps, 100. Shocking? Nicely, this really exhibits each a energy and in addition to a weak point of DP strategies, as concurrently of our methodology: DP strategies are “thorough”, they require an entire mannequin of the world, after which iterate over all states — yielding a great resolution with just some passes over all states. Nonetheless, which means that all states must be estimated until convergence — despite the fact that a few of them is perhaps a lot much less attention-grabbing — and this scales fairly badly with the setting measurement. In reality, one measured step right here accommodates a full run over all states — indicating that for these strategies time is a greater measure.

For this, we get the next graph:

Now, we will see enhance in compute wanted w.r.t. to the variety of states. And, we will additionally see that, as claimed, worth iteration converges a lot quicker, and scales significantly better. Word that the x-axis labels denote n, with the Gridworld having a measurement of n x n.

Monte Carlo Strategies

Subsequent in our submit collection on RL we lined MC strategies. These can study from expertise alone, i.e. one can run them in any sort of setting, with out having a mannequin of it — which is a shocking realization, and really helpful: usually, we don’t have this mannequin, different instances, it could be too advanced and impractical to make use of. Think about the sport of Blackjack: whereas we will actually mannequin all attainable outcomes and corresponding chances, it’s a very tedious activity — and studying to play by simply doing that may be a very tempting thought. On account of not utilizing a mannequin, MC strategies are unbiased — however on the draw back their expectation has a excessive variance.

One situation when implementing these strategies is ensuring that each one state-action pairs are constantly visited, and thus up to date. On account of not having a mannequin, we can not merely iterate over all attainable combos (examine e.g. DP strategies), however (in a method) randomly discover the setting. If because of this we missed some states totally, we’d unfastened theoretical convergence ensures, which might translate into apply.

A method of satisfying that is the exploring begins assumption (ES): we begin every episode in a random state and select the primary motion randomly, too. Other than that, MC strategies could be applied fairly merely: we merely play out full episodes, and set the anticipated worth of state-action pairs to the typical obtained reward.

MC with ES appears as follows:

To take away the belief of ES, we will resort to 2 courses of algorithms: on- and off-policy strategies. Let’s begin with the on-policy one.

That is really not too totally different from the ES algorithm, we merely use an ε-greedy coverage for producing episodes. That’s, we take away the belief of ES and use a “tender” as a substitute of a “laborious” coverage for producing episodes: the used coverage in each iteration will not be totally grasping, however ε-greedy — which ensures that within the restrict we see all attainable state-action pairs:

Off-policy strategies comply with the concept of splitting exploration and studying in two insurance policies. We keep a coverage π, which we wish to optimize, and a conduct coverage, b.

Nonetheless, we will’t merely use b in all places in our algorithm. When producing an episode and computing returns, we receive:

I.e., the ensuing worth is the anticipated worth of b, not π.

That is the place significance sampling is available in. We will repair this expectation with the precise ratio:

This ratio is outlined by:

In our case, we receive the next formulation:

(Word that this makes use of weighted significance sampling, as a substitute of “naïve” significance sampling.)

We might in fact compute these ratios naively in each step. Nonetheless, Sutton introduces a intelligent scheme updating these values (denoted by W) incrementally, which is far more environment friendly. In reality, in my authentic submit I confirmed the naive model, too — I imagine this helps with understanding. Nonetheless, since right here we primarily care about benchmarking, and the “naïve” and the “incremental” model are equivalent, as much as efficiency — we right here solely listing the marginally extra advanced incremental model.

In pseudocode the corresponding algorithm appears as follows:

Word that, against our preliminary submit introducing these strategies, the place the conduct coverage was merely randomly, right here we decide a greater one — specifically an ε-greedy one w.r.t. to the present Q-table.

For extra particulars here’s my corresponding post on MC methods.

Outcomes

With that, let’s examine these three algorithms on small Gridworld environments. Word that one step right here denotes one full episode generated:

We observe off-policy MC to already outing at a Gridword measurement of 5×5, and, despite the fact that MC with ES and on-policy MC carry out higher, additionally they begin to wrestle with bigger sizes.

This is perhaps considerably stunning, and disappointing for MC followers. Don’t fear, we are going to handle to spice up this — nonetheless it exhibits a weak point of those algorithms: in “giant” environments with sparse rewards, MC strategies principally need to hope to bump into the purpose by likelihood — which decreases exponentially with the dimensions of the setting.

Thus, let’s attempt to make the duty simpler for the mannequin, and use the beforehand launched methods empirically discovered to assist efficiency of TD-learning: including intermediate rewards, and ε-decay — our “improved” setup.

In reality, with this all strategies fare significantly better:

Nonetheless, now MC ES is inflicting issues. Thus, let’s put this apart and proceed with out it: ES anyhow was a theoretical idea on the way in which of growing MC strategies, and clunky to make use of / implement (some may keep in mind how I applied having the setting begin in random states …):

Right here, no less than we get near the outcomes of DP. Word that I capped the maximal variety of steps to 100.000, so each time this quantity exhibits up within the graph it signifies that the algorithm couldn’t remedy this setting within the given step restrict. On-policy MC really appears to carry out very well, the variety of steps wanted barely will increase— however off-policy MC appears to carry out worse.

Dialogue

To me, MC strategies carry out surprisingly properly — since they basically stumble across the setting randomly to start with, hoping to seek out the purpose by exploration alone. Nonetheless, in fact this isn’t totally true — their efficiency (talking of on-policy MC) turns into actually good solely after enabling intermediate rewards — which information the mannequin in direction of the purpose. On this setup it appears MC strategies carry out very well — one potential purpose being that they’re unbiased — and fewer delicate to hyperparameter tuning and co.

Temporal-Distinction Studying

Let’s come to TD strategies. These could be seen as combining the strengths of each approaches beforehand launched: just like MC, they don’t want a mannequin of the setting — however nonetheless they construct upon earlier estimates, they bootstrap — as in DP.

Let’s recap DP and MC fashions:

DP strategies flip the Bellman equation into an replace rule, and compute the worth of a state primarily based on the estimated values of its successor states:

MC strategies, then again, play out full episodes after which replace their worth estimates primarily based on the noticed return:

TD strategies mix these two concepts. They play out full episodes, however after each step replace worth estimates with the noticed return, and the earlier estimate:

Among the most elementary RL algorithms stem from this discipline — and we are going to talk about them within the following.

Let’s start with Sarsa. First, we modify above launched replace rule to work with state-action pairs:

With this, Sarsa is definitely launched fairly shortly: we play episodes, and replace values following our present coverage. The identify comes from the tuples used within the updates:

In pseudocode this appears as follows:

Subsequent up we now have Q-learning. That is similar to Sarsa, with one key distinction: it’s an off-policy algorithm. As an alternative of merely following the executed transition in the course of the replace, we take the utmost Q-value of all successor states:

You’ll be able to image this as making a conduct coverage b, which is the same as π, besides being grasping within the transitions beneath query.

The pseudocode appears like this:

One other algorithm is Anticipated Sarsa, which (you guessed it) — is an extension of Sarsa. As an alternative of following the one transition executed by the coverage, we account for all attainable successor states, and weigh them by how probably they’re given the present coverage:

The final algorithm on this chapter is an extension of Q-learning. Q-learning suffers from an issue often called maximization bias: because it makes use of a most over anticipated values, the ensuing estimate can have optimistic bias. We will handle this by utilizing two Q-tables: for every replace we use one for choosing a value-maximizing motion, and the opposite for computing the replace goal. Which is used the place is set by a coin flip. The algorithm is named Double Q-learning:

Outcomes

Let’s take a look on the outcomes, beginning with the naïve setting:

We will see that each Q-learning strategies begin to get issues with Gridworld sizes of 11 x 11.

Thus let’s apply our identified methods, yielding the “improved” setup:

All strategies can now discover options considerably faster — simply Anticipated Sarsa falls out. This might very properly be — it’s considerably much less used than Q-learning or Sarsa, and perhaps extra a theoretical idea.

Thus, let’s proceed with out this methodology and see how giant world sizes we will remedy:

Q-learning can now additionally remedy grid sizes of 25 x 25 with out issues — however Sarsa and Double Q-learning begin to degrade.

Extra particulars could be present in my introductory post about TD methods.

Dialogue

Within the improved setup, TD strategies basically carry out properly. We solely eradicated Anticipated Sarsa early, which anyhow will not be such a standard algorithm.

“Easy” Sarsa and Double Q-learning wrestle for bigger setting sizes, whereas Q-learning performs properly total. The latter is considerably stunning, since Double Q-learning ought to handle a number of the shortcomings of normal Q-learning, specifically the excessive variance. Doubtlessly, we already cut back the variance by operating every experiment n instances. One other speculation may very well be that Double Q-learning takes longer to converge, because the variety of parameters has additionally doubled — which might point out that the facility of Double Q-learning exhibits higher for extra advanced issues with extra time.

As talked about performs Q-learning higher than Sarsa. This mirrors what can see in analysis / literature, specifically that Q-learning is considerably extra in style. This will in all probability defined by it being off-policy, which often yields extra highly effective resolution strategies. Sarsa then again performs higher for stochastic or “harmful” duties: since in Sarsa the precise chosen motion is taken into consideration within the worth replace, it higher understands the consequences of its actions, which is useful for stochastic environments and / or environments the place one can, e.g., fall off a cliff. Regardless of the latter being the case right here, the setting might be not advanced or giant sufficient, that this impact comes into play.

TD-n

TD-n strategies in a method marry classical TD studying and MC strategies. As Sutton so properly places it, they “free us from the tyranny of the timestep” [1]. In MC strategies, we’re pressured to attend a full episode earlier than making any updates. In TD strategies, we replace estimates in each step — however are additionally pressured to solely look one step sooner or later.

Thus, it is sensible to introduce n-step returns:

With that, we will merely introduce Sarsa-n:

We play episodes following the present coverage, after which replace the worth estimates with the n-step return.

In my corresponding submit, we additionally introduce an off-policy model of this. Nonetheless, to not blow up this submit too lengthy, and adverse expertise with off-policy MC strategies, we give attention to the “classics” — equivalent to Sarsa-n — and tree-n tree backup, which we introduce subsequent.

n-step tree backup is an extension of the beforehand seen Anticipated Sarsa. When computing the n-step return, the corresponding transition tree appears as follows:

I.e., there’s a single path down the tree akin to the precise motion taken. Simply as in Anticipated Sarsa, we now wish to weigh actions in keeping with their chance decided by the coverage. However since now we now have a tree of depth > 1, the cumulative worth of later ranges is weighted by the chance of the motion taken to succeed in these ranges:

The pseudocode appears as follows:

Right here’s my corresponding submit on n-step TD methods.

Outcomes

As regular, we begin with the “naïve” setting, and procure the next outcomes:

Sarsa-n begins to wrestle already with smaller grid world sizes. Let’s see if the improved setup modifications this:

Now certainly Sarsa-n performs significantly better, however n-step tree backup doesn’t.

Dialogue

I discovered this discovery sudden and considerably laborious to clarify. I’d love to listen to your ideas on this — however within the meantime I used to be chatting with my chat agent of alternative, and got here to this speculation: intermediate rewards probably confuse the tree algorithm, because it must study an identical return distribution over all attainable actions. Additional, the extra ε decays, the extra the anticipated distribution may differ from the conduct coverage.

Mannequin-Based mostly Reinforcement Studying / Planning

Within the earlier chapter we mentioned the subject “planning” — within the RL context with this we primarily consult with model-based strategies. That’s: we now have (or construct) a mannequin of the setting, and use this mannequin to discover additional “nearly”, and specifically use these explorations for extra and higher updates / learnings of the worth perform. The next picture shows the mixing of planning into studying very properly:

Within the top-right nook we see the “classical” RL coaching loop (additionally dubbed “direct” RL): beginning with some worth perform / coverage we act within the (actual) setting, and use this expertise to replace our price perform (or coverage within the case of policy-gradient strategies). When incorporating planning, we moreover additionally study a mannequin of the world from this expertise, after which use this mannequin to generate additional (digital) expertise, and replace our price or coverage perform from this.

This really is strictly the Dyna-Q algorithm, which appears as follows in pseudocode:

Steps (a) — (d) are our classical Q-learning, whereas the remainder of the algorithm provides the novel planning performance, specifically the world mannequin studying.

One other associated algorithm is Prioritized Sweeping, which modifications how we pattern states for the “planning loop”: we discover and play in the actual setting, whereas studying the mannequin, and save state-action pairs with giant anticipated worth modifications to a queue. Solely with this queue we begin the “planning loop”, i.e. one thing to the steps (e) and (f) above:

Extra particulars could be present in my earlier submit on model-based RL methods.

Outcomes

Let’s begin with the naïve setting:

Dyna Q performs fairly properly, whereas Prioritized Sweeping struggles early on.

Within the improved setting we see an analogous factor:

Dialogue

Prioritized sweeping already carried out poorly within the corresponding introductory submit — I think there both is a few situation, or extra probably this merely is a “tuning” factor — i.e. utilizing a mistaken sampling distribution.

Dyna-Q yields strong outcomes.

Benchmarking the Finest Algorithms

Now we have now seen the efficiency of all algorithms from Half I of Sutton’s guide by benchmarking them per chapter and on Gridworlds of as much as measurement 25 x 25. Already right here we noticed higher and worse performing algorithms, and specifically already discarded just a few candidates not suited to bigger environments.

Now we wish to benchmark the remaining ones — the most effective ones from every chapter — towards each other, on Gridworlds as much as measurement 50 x 50.

These algorithms are:

worth iteration
on-policy MC
Q-learning
Sarsa-n
Dyna-Q

Outcomes

Right here’s how they carry out on Gridworld, this time with a maximal step restrict of 200.000:

Let’s additionally plot the corresponding time wanted (observe that I plot unsuccessful runs — runs reaching the maximal variety of steps with out producing a possible coverage — at 500s):

We will observe a number of attention-grabbing details from these figures:

The variety of steps vs. time wanted is extremely correlated.
Worth iteration performs exceptionally properly, fixing even Gridworlds of measurement 50 x 50 with ease, and doing so magnitudes quicker than the next-best algorithm.
The rating for the remaining algorithms is (higher to worse): On-policy MC, Dyna-Q, Q-learning, Sarsa-n.

Within the subsequent part we talk about these in additional particulars.

Dialogue

1. Steps vs. Time

We began this submit with a dialogue on which metrics / measurement to make use of, and — specifically — whether or not to make use of variety of steps or time wanted to resolve the issue. Wanting again, we will say that this dialogue was not so related in spite of everything, and — considerably surprisingly — these two numbers are extremely correlated. That’s, although, as initially described, one “step” can differ relying on the algorithm.

2. Worth Iteration Dominates

Worth Iteration carried out remarkably properly, fixing even giant Gridworlds (as much as 50×50) with ease—outpacing all different algorithms by a large margin. This is perhaps stunning, contemplating that DP strategies are sometimes thought of theoretical instruments, hardly ever utilized in apply. Actual-world purposes are inclined to favor strategies like Q-learning [2], PPO [4], or MCTS [5].

So why does such a “textbook” methodology dominate right here? As a result of this setting is tailored for it:

The mannequin is totally identified.
The dynamics are easy and deterministic.
The state area is comparatively small.

These are precisely the circumstances beneath which DP thrives. In distinction, model-free strategies like Q-learning are designed for settings the place such data is not accessible. Their energy lies in generality and scalability, not in exploiting small, well-defined issues. Q-learning incurs excessive variance and requires many episodes to converge—disadvantages which can be magnified in small-scale environments. In brief, there’s a transparent trade-off between effectivity and generality. We’ll revisit this level in a future submit once we introduce perform approximation, the place Q-learning has extra room to shine.

3. A Rating Emerges

Past Worth Iteration, we noticed the next efficiency rating: On-policy MC > Dyna-Q > Q-learning > Sarsa-n

On-policy Monte Carlo emerged because the best-performing model-free algorithm. This matches with our earlier reasoning: MC strategies are easy, unbiased, and well-suited to issues with deterministic targets—particularly when episodes are comparatively quick. Whereas not scalable to giant or steady issues, MC strategies appear to be fairly efficient in small to medium-sized duties like Gridworld.

Dyna-Q comes subsequent. This end result reinforces our expectations: Dyna-Q blends model-based planning with model-free studying. Though the mannequin is discovered (not given, as in Worth Iteration), it’s nonetheless easy and deterministic right here—making the discovered mannequin helpful. This boosts efficiency considerably over pure model-free approaches.

Q-learning, whereas nonetheless highly effective, underperforms on this context for the explanations mentioned above: it’s a general-purpose algorithm that isn’t capable of totally leverage the construction of easy environments.

Sarsa-n landed in final place. A probable rationalization is the added bias launched by way of bootstrapping in its multi-step updates. In contrast to Monte Carlo strategies, which estimate returns from full trajectories (unbiased), Sarsa-n makes use of bootstrapped estimates of future rewards. In small environments, this bias can outweigh the advantages of decreased variance.

Lastly, let’s examine our outcomes vs. those from Sutton:

Word that Sutton lists the whole variety of steps on the x-axis, whereas we listing n, with the whole variety of states being n x n. For 376 states, Sutton report ~100k steps earlier than the optimum resolution is discovered, whereas we report 75k for 400 states (20 x 20), contemplating Dyna-Q. The numbers are extremely comparable and supply a reassuring validation of our setup and implementation.

Conclusion

This submit served each as a recap of our collection on Half I of Sutton and Barto’s Reinforcement Studying [1]and as an extension past the guide’s scope—by benchmarking all launched algorithms on more and more bigger Gridworld environments.

We started by outlining our benchmarking setup, then revisited the core chapters of Half I: Dynamic Programming, Monte Carlo strategies, Temporal-Distinction studying, and Mannequin-Based mostly RL / Planning. In every part, we launched key algorithms equivalent to Q-learning, offered full Python implementations, and evaluated their efficiency on Gridworlds as much as measurement 25×25. The purpose of this preliminary spherical was to determine prime performers from every algorithmic household. Based mostly on our experiments, the standouts have been:
Worth Iteration, On-policy MC, Q-learning, Sarsa-n, and Dyna-Q. Python code to breed these outcomes, and specifically implementations of all mentioned strategies, is on the market on GitHub.

Subsequent, we stress-tested these high-performers on bigger environments (as much as 50×50) and noticed the next rating:
Worth Iteration > On-policy MC > Dyna-Q > Q-learning > Sarsa-n

Whereas this end result could also be stunning—given the widespread use of Q-learning and the comparatively uncommon software of Worth Iteration and MC strategies—it is sensible in context. Easy, fully-known, deterministic environments are perfect for Worth Iteration and MC strategies. In distinction, Q-learning is designed for extra advanced, unknown, and high-variance environments the place perform approximation turns into needed. As we mentioned, there’s a trade-off between effectivity in structured duties and generality in advanced ones.

That brings us to what’s subsequent. In upcoming posts, we’ll push the boundaries additional:

First, by benchmarking these strategies in more difficult environments equivalent to two-player video games, the place direct competitors will expose their variations extra starkly.
Then, we’ll dive into Half II of Sutton’s guide, the place perform approximation is launched. This unlocks the flexibility to scale reinforcement studying to environments far past what tabular strategies can deal with.

Should you’ve made it this far—thanks for studying! I hope you loved this deep dive, and I’d like to have you ever again for the subsequent installment within the collection.

Different Posts on this Collection

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://arxiv.org/abs/1312.5602

[3] https://gymnasium.farama.org/index.html

[4] https://arxiv.org/abs/1707.06347

[5] https://arxiv.org/abs/1911.08265

(*) Photos from [1] used with permission from the authors.

Source link

Three OpenClaw Mistakes to Avoid and How to Fix Them

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

How to Build An AI Agent with Function Calling and GPT-5

How We Are Testing Our Agents in Dev

Why Your Conversational AI Needs Good Utterance Data?

The looming crackdown on AI companionship

OpenAI lanserar GPT-5 – AI nyheter

Most Popular

Attaining LLM Certainty with AI Decision Circuits

Amazons ”House of David” använde över 350 AI-scener

5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

Our Picks

Three OpenClaw Mistakes to Avoid and How to Fix Them

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

How AI is turning the Iran conflict into theater

Benchmarking Tabular Reinforcement Learning Algorithms

Introducing the Benchmark Job and Experiment Planning

Preliminaries

Utilized RL Methods

Comparability Standards

Experiment Construction

Recap and Benchmarking of All Algorithms

Dynamic Programming

Monte Carlo Strategies

Temporal-Distinction Studying

TD-n

Mannequin-Based mostly Reinforcement Studying / Planning

Benchmarking the Finest Algorithms

Conclusion

Different Posts on this Collection

References

Related Posts