Demystifying Policy Optimization in RL: An Introduction to PPO and GRPO

Introduction

studying (RL) has achieved exceptional success in educating brokers to resolve advanced duties, from mastering Atari video games and Go to coaching useful language fashions. Two necessary methods behind many of those advances are coverage optimization algorithms referred to as Proximal Coverage Optimization (PPO) and the newer Generalized Reinforcement Coverage Optimization (GRPO). On this article, we’ll clarify what these algorithms are, why they matter, and the way they work – in beginner-friendly phrases. We’ll begin with a fast overview of reinforcement studying and Policy Gradient strategies, then introduce GRPO (together with its motivation and core concepts), and dive deeper into PPO’s design, math, and benefits. Alongside the way in which, we’ll evaluate PPO (and GRPO) with different common RL algorithms like DQN, A3C, TRPO, and DDPG. Lastly, we’ll have a look at some code to see how PPO is utilized in follow. Let’s get began!

Background: Reinforcement Studying and Coverage Gradients

Reinforcement studying is a framework the place an agent learns by interacting with an atmosphere by way of trial and error. The agent observes the state of the atmosphere, takes an motion, after which receives a reward sign and probably a brand new state in return. Over time, by attempting actions and observing rewards, the agent adapts its behaviour to maximise the cumulative reward it receives. This loop of state → motion → reward → subsequent state is the essence of RL, and the agent’s purpose is to find a superb coverage (a method of selecting actions primarily based on states) that yields excessive rewards.

In policy-based RL strategies (often known as coverage gradient strategies), we instantly optimize the agent’s coverage. As a substitute of studying “worth” estimates for every state or state-action (as in value-based strategies like Q-learning), coverage gradient algorithms alter the parameters of a coverage (usually a neural community) within the route that improves efficiency. A traditional instance is the REINFORCE algorithm, which updates the coverage parameters in proportion to the reward-weighted gradient of the log-policy. In follow, to scale back variance, we use an benefit perform (the additional reward of taking motion a in state s in comparison with common) or a baseline (like a price perform) when computing the gradient. This results in actor-critic strategies, the place the “actor” is the coverage being discovered, and the “critic” is a price perform that estimates how good states (or state-action pairs) are to supply a baseline for the actor’s updates. Many superior algorithms, together with Ppo, fall into this actor-critic household: they keep a coverage (actor) and use a discovered worth perform (critic) to help the coverage replace.

Generalized Reinforcement Coverage Optimization (GRPO)

One of many newer developments in coverage optimization is Generalized Reinforcement Coverage Optimization (GRPO) – generally referred to in literature as Group Relative Coverage Optimization. GRPO was launched in current analysis (notably by the DeepSeek staff) to handle some limitations of PPO when coaching giant fashions (equivalent to language fashions for reasoning). At its core, GRPO is a variant of coverage gradient RL that eliminates the necessity for a separate critic/worth community and as an alternative optimizes the coverage by evaluating a group of motion outcomes towards one another.

Motivation: Why take away the critic? In advanced environments (e.g. lengthy textual content era duties), coaching a price perform may be arduous and resource-intensive. By “foregoing the critic,” GRPO avoids the challenges of studying an correct worth mannequin and saves roughly half the reminiscence/computation since we don’t keep additional mannequin parameters for the critic. This makes RL coaching less complicated and extra possible in memory-constrained settings. Actually, GRPO was proven to chop the compute necessities for Reinforcement Learning from human suggestions practically in half in comparison with PPO.

Core thought: As a substitute of counting on a critic to inform us how good every motion was, GRPO evaluates the coverage by evaluating a number of actions’ outcomes relative to one another. Think about the agent (coverage) generates a set of attainable outcomes for a similar state (or immediate) a group of responses. These are all evaluated by the atmosphere or a reward perform, yielding rewards. GRPO then computes a bonus for every motion primarily based on how its reward compares to the others. One easy approach is to take every motion’s reward minus the common reward of the group (optionally dividing by the group’s reward normal deviation for normalization). This tells us which actions did higher than common and which did worse. The coverage is then up to date to assign larger chance to the better-than-average actions and decrease chance to the more severe ones. In essence, “the mannequin learns to turn into extra just like the solutions marked as right and fewer just like the others”.

How does this look in follow? It seems the loss/goal in GRPO appears to be like similar to PPO’s. GRPO nonetheless makes use of the thought of a “surrogate” goal with chance ratios (we’ll clarify this beneath PPO) and even makes use of the identical clipping mechanism to restrict how far the coverage strikes in a single replace. The important thing distinction is that the benefit is computed from these group-based relative rewards slightly than a separate worth estimator. Additionally, implementations of GRPO usually embrace a KL-divergence time period within the loss to maintain the brand new coverage near a reference (or outdated) coverage, much like PPO’s non-obligatory KL penalty.

PPO vs. GRPO — High: In PPO, the agent’s Coverage Mannequin is skilled with the assistance of a separate Worth Mannequin (critic) to estimate benefit, together with a Reward Mannequin and a set Reference Mannequin (for KL penalty). Backside: GRPO removes the worth community and as an alternative computes benefits by evaluating a bunch of sampled outcomes reward scores for a similar enter through a easy “group computation.” The coverage replace then makes use of these relative scores because the benefit alerts. By dropping the worth mannequin, GRPO considerably simplifies the coaching pipeline and reduces reminiscence utilization, at the price of utilizing extra samples per replace (to type the teams)

picture sourced from https://arxiv.org/pdf/2402.03300

In abstract, GRPO may be seen as a PPO-like method with no discovered critic. It trades off some pattern effectivity (because it wants a number of samples from the identical state to check rewards) in alternate for better simplicity and stability when worth perform studying is tough. Initially designed for big language mannequin coaching with human suggestions (the place getting dependable worth estimates is difficult), GRPO’s concepts are extra usually relevant to different RL situations the place relative comparisons throughout a batch of actions may be made. By understanding GRPO at a excessive stage, we additionally set the stage for understanding PPO, since GRPO is basically constructed on PPO’s basis.

Proximal Coverage Optimization (PPO)

Now let’s flip to Proximal Coverage Optimization (PPO) – some of the common and profitable coverage gradient algorithms in trendy RL. PPO was launched by OpenAI in 2017 as a solution to a sensible query: how can we replace an RL agent as a lot as attainable with the information now we have, whereas making certain we don’t destabilize coaching by making too giant a change? In different phrases, we wish massive enchancment steps with out “falling off a cliff” in efficiency. Its predecessors, like Belief Area Coverage Optimization (TRPO), tackled this by implementing a tough constraint on the scale of the coverage replace (utilizing advanced second-order optimization). PPO achieves an analogous impact in a a lot less complicated approach – utilizing first-order gradient updates with a intelligent clipped goal – which is simpler to implement and empirically simply pretty much as good.

In follow, PPO is carried out as an on-policy actor-critic algorithm. A typical PPO coaching iteration appears to be like like this:

Run the present coverage within the atmosphere to gather a batch of trajectories (state, motion, reward sequences). For instance, play 2048 steps of the sport or have the agent simulate a number of episodes.
Use the collected knowledge to compute the benefit for every state-action (usually utilizing Generalized Benefit Estimation (GAE) or an analogous methodology to mix the critic’s worth predictions with precise rewards).
Replace the coverage by maximizing the PPO goal above (often by gradient ascent, which in follow means doing a number of epochs of stochastic gradient descent on the collected batch).
Optionally, replace the worth perform (critic) by minimizing a price loss, since PPO sometimes trains the critic concurrently to enhance benefit estimates.

As a result of PPO is on-policy (it makes use of contemporary knowledge from the present coverage for every replace), it forgoes the pattern effectivity of off-policy algorithms like DQN. Nonetheless, PPO usually makes up for this by being secure and scalable it’s simple to parallelize (accumulate knowledge from a number of atmosphere cases) and doesn’t require advanced expertise replay or goal networks. It has been proven to work robustly throughout many domains (robotics, video games, and many others.) with comparatively minimal hyperparameter tuning. Actually, PPO turned one thing of a default selection for a lot of RL issues as a result of its reliability.

PPO variants: There are two major variants of PPO that had been mentioned within the unique papers:

PPO-penalty: which provides a penalty to the target proportional to the KL-divergence between new and outdated coverage (and adapts this penalty coefficient throughout coaching). That is nearer in spirit to TRPO’s method (preserve KL small by specific penalty).
PPO-clip: which is the variant we described above utilizing clipped goal and no specific KL time period. That is by far the extra common model and what folks often imply by “PPO”.

Each variants goal to limit coverage change; PPO-clip turned normal due to its simplicity and powerful efficiency. PPO additionally sometimes consists of entropy bonus regularization (to encourage exploration by not making the coverage too deterministic too rapidly) and different sensible tweaks, however these are particulars past our scope right here.

Why PPO is common – benefits: To sum up, PPO provides a compelling mixture of stability and simplicity. It doesn’t collapse or diverge simply throughout coaching due to the clipped updates, and but it’s a lot simpler to implement than older trust-region strategies. Researchers and practitioners have used PPO for every little thing from controlling robots to coaching game-playing brokers. Notably, PPO (with slight modifications) was utilized in OpenAI’s InstructGPT and different large-scale RL from human suggestions tasks to fine-tune language fashions, as a result of its stability in dealing with high-dimensional motion areas like textual content. It could not at all times be absolutely the most sample-efficient or fastest-learning algorithm on each job, however when doubtful, PPO is usually a dependable selection.

PPO and GRPO vs Different RL Algorithms

To place issues in perspective, let’s briefly evaluate PPO (and by extension GRPO) with another common RL algorithms, highlighting key variations:

DQN (Deep Q-Community, 2015): DQN is a value-based methodology, not a coverage gradient. It learns a Q-value perform (through deep neural community) for discrete actions, and the coverage is implicitly “take the motion with highest Q”. DQN makes use of methods like an expertise replay buffer (to reuse previous experiences and break correlations) and a goal community (to stabilize Q-value updates). In contrast to PPO which is on-policy and updates a parametric coverage instantly, DQN is off-policy and doesn’t parameterize a coverage in any respect (the coverage is grasping w.r.t. Q). PPO sometimes handles giant or steady motion areas higher than DQN, whereas DQN excels in discrete issues (like Atari) and may be extra sample-efficient because of replay.
A3C (Asynchronous Benefit Actor-Critic, 2016): A3C is an earlier coverage gradient/actor-critic algorithm that makes use of a number of employee brokers in parallel to gather expertise and replace a worldwide mannequin asynchronously. Every employee runs by itself atmosphere occasion, and their updates are aggregated to a central set of parameters. This parallelism decorrelates knowledge and accelerates studying, serving to to stabilize coaching in comparison with a single agent working sequentially. A3C makes use of a bonus actor-critic replace (usually with n-step returns) however doesn’t have the express “clipping” mechanism of PPO. Actually, PPO may be seen as an evolution of concepts from A3C/A2C – it retains the on-policy benefit actor-critic method however provides the surrogate clipping to enhance stability. Empirically, PPO tends to outperform A3C, because it did on many Atari video games with far much less wall-clock coaching time, as a result of extra environment friendly use of batch updates (A2C, a synchronous model of A3C, plus PPO’s clipping equals sturdy efficiency). A3C’s asynchronous method is much less widespread now, since you possibly can obtain related advantages with batched environments and secure algorithms like PPO.
TRPO (Belief Area Coverage Optimization, 2015): TRPO is the direct predecessor of PPO. It launched the thought of a “belief area” constraint on coverage updates basically making certain the brand new coverage shouldn’t be too removed from the outdated coverage by implementing a constraint on the KL divergence between them. TRPO makes use of a fancy optimization (fixing a constrained optimization drawback with a KL constraint) and requires computing approximate second order gradients (through conjugate gradient). It was a breakthrough in enabling bigger coverage updates with out chaos, and it improved stability and reliability over vanilla coverage gradient. Nonetheless, TRPO is sophisticated to implement and may be slower because of the second-order math. PPO was born as an easier, extra environment friendly different that achieves related outcomes with first-order strategies. As a substitute of a tough KL constraint, PPO both softens it right into a penalty or replaces it with the clip methodology. Consequently, PPO is simpler to make use of and has largely supplanted TRPO in follow. When it comes to efficiency, PPO and TRPO usually obtain comparable returns, however PPO’s simplicity offers it an edge for growth pace. (Within the context of GRPO: GRPO’s replace rule is basically a PPO-like replace, so it additionally advantages from these insights without having TRPO’s equipment).
DDPG (Deep Deterministic Coverage Gradient, 2015): DDPG is an off-policy actor-critic algorithm for steady motion areas. It combines concepts from DQN and coverage gradients. DDPG maintains two networks: a critic (like DQN’s Q-function) and an actor that deterministically outputs an motion. Throughout coaching, DDPG makes use of a replay buffer and a goal community (like DQN) for stability, and it updates the actor utilizing the gradient of the Q-function (therefore “deterministic coverage gradient”). In easy phrases, DDPG extends Q-learning to steady actions through the use of a differentiable coverage (actor) to pick actions, and it learns that coverage by gradients by way of the Q critic. The draw back is that off-policy actor-critic strategies like DDPG may be considerably finicky – they could get caught in native optima or diverge with out cautious tuning (enhancements like TD3 and SAC had been later developed to handle a few of DDPG’s weaknesses). In comparison with PPO, DDPG may be extra sample-efficient (replaying experiences) and may converge to deterministic insurance policies which could be optimum in noise-free settings, however PPO’s on-policy nature and stochastic coverage could make it extra strong in environments requiring exploration. In follow, for steady management duties, one would possibly select PPO for ease and robustness or DDPG/TD3/SAC for effectivity and efficiency if tuned nicely.

In abstract, PPO (and GRPO) vs others: PPO is an on-policy, coverage gradient methodology targeted on secure updates, whereas DQN and DDPG are off-policy value-based or actor-critic strategies targeted on pattern effectivity. A3C/A2C are earlier on-policy actor-critic strategies that launched helpful methods like multi-environment coaching, however PPO improved on their stability. TRPO laid the theoretical groundwork for secure coverage updates, and PPO made it sensible. GRPO, being a by-product of PPO, shares PPO’s benefits however simplifies the pipeline additional by eradicating the worth perform making it an intriguing choice for situations like large-scale language mannequin coaching the place utilizing a price community is problematic. Every algorithm has its personal area of interest, however PPO’s common reliability is why it’s usually a baseline selection in lots of comparisons.

PPO in Apply: Code Instance

To solidify our understanding, let’s see a fast instance of how one would use PPO in follow. We’ll use a preferred RL library (Secure Baselines3) and prepare a easy agent on a traditional management job (CartPole). This instance will likely be in Python utilizing PyTorch beneath the hood, however you gained’t have to implement the PPO replace equations your self – the library handles it.

Within the code above, we first create the CartPole atmosphere (a traditional balancing pole toy drawback). We then create a PPO mannequin with an MLP (multi-layer perceptron) coverage community. Underneath the hood, this units up each the coverage (actor) and worth perform (critic) networks. Calling mannequin.study(...) launches the coaching loop: the agent will work together with the atmosphere, accumulate observations, calculate benefits, and replace its coverage utilizing the PPO algorithm. The verbose=1 simply prints out coaching progress. After coaching, we run a fast take a look at: the agent makes use of its discovered coverage (mannequin.predict(obs)) to pick actions and we step by way of the atmosphere to see the way it performs. If all went nicely, the CartPole ought to steadiness for a good variety of steps.

import gymnasium as gymnasium
from stable_baselines3 import PPO

env = gymnasium.make("CartPole-v1")

mannequin = PPO(coverage="MlpPolicy", env=env, verbose=1)

mannequin.study(total_timesteps=50000)

# Check the skilled agent
obs, _ = env.reset()
for step in vary(1000):
    motion, _state = mannequin.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, information = env.step(motion)
    if terminated or truncated:
        obs, _ = env.reset()

This instance is deliberately easy and domain-generic. In additional advanced environments, you would possibly want to regulate hyperparameters (just like the clipping, studying charge, or use reward normalization) for PPO to work nicely. However the high-level utilization stays the identical outline your atmosphere, decide the PPO algorithm, and prepare. PPO’s relative simplicity means you don’t should fiddle with replay buffers or different equipment, making it a handy place to begin for a lot of issues.

Conclusion

On this article, we explored the panorama of coverage optimization in reinforcement studying by way of the lens of PPO and GRPO. We started with a refresher on how RL works and why coverage gradient strategies are helpful for instantly optimizing resolution insurance policies. We then launched GRPO, studying the way it forgoes a critic and as an alternative learns from relative comparisons in a bunch of actions – a method that brings effectivity and ease in sure settings. We took a deep dive into PPO, understanding its clipped surrogate goal and why that helps keep coaching stability. We additionally in contrast these algorithms to different well-known approaches (DQN, A3C, TRPO, DDPG), to focus on when and why one would possibly select coverage gradient strategies like PPO/GRPO over others.

Each PPO and GRPO exemplify a core theme in trendy RL: discover methods to get massive studying enhancements whereas avoiding instability. PPO does this with mild nudges (clipped updates), and GRPO does it by simplifying what we study (no worth community, simply relative rewards). As you proceed your RL journey, preserve these rules in thoughts. Whether or not you might be coaching a sport agent or a conversational AI, strategies like PPO have turn into go-to workhorses, and newer variants like GRPO present that there’s nonetheless room to innovate on stability and effectivity.

Sources:

Sutton, R. & Barto, A. Reinforcement Learning: An Introduction. (Background on RL basics).
Schulman et al. Proximal Policy Optimization Algorithms. arXiv:1707.06347 (PPO unique paper).
OpenAI Spinning Up – PPO (PPO clarification and equations).
RLHF Handbook – Policy Gradient Algorithms (Particulars on GRPO formulation and instinct).
Stable Baselines3 Documentation(DQN description) (PPO vs others).

Source link

Not Everything Needs Automation: 5 Practical AI Agents That Deliver Enterprise Value

Prescriptive Modeling Unpacked: A Complete Guide to Intervention With Bayesian Modeling.

5 Crucial Tweaks That Will Make Your Charts Accessible to People with Visual Impairments

Personliga föremål till mixad verklighet – MIT återskapar leksaker i mixed reality

The Case for Centralized AI Model Inference Serving

Time Series Forecasting Made Simple (Part 2): Customizing Baseline Models

The real impact of AI on your organization

AI Text Classification – Use Cases, Application, Process and Importence

Most Popular

AI Layoffs Are Already Here. But Don’t Expect Companies to Always Admit It

Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

Road to AGI (and Beyond) #1 — The AI Timeline is Accelerating

Our Picks

Gemini introducerar funktionen schemalagda åtgärder i Gemini-appen

AIFF 2025 Runway’s tredje årliga AI Film Festival

AI-agenter kan nu hjälpa läkare fatta bättre beslut inom cancervård

Demystifying Policy Optimization in RL: An Introduction to PPO and GRPO

Introduction

PPO and GRPO vs Different RL Algorithms

PPO in Apply: Code Instance

Conclusion

Related Posts