Deep Reinforcement Learning: 0 to 100

the way you’d educate a robotic to land a drone with out programming each single transfer? That’s precisely what I got down to discover. I spent weeks constructing a recreation the place a digital drone has to determine how one can land on a platform—not by following pre-programmed directions, however by studying from trial and error, similar to the way you discovered to experience a motorbike.

That is Reinforcement Studying (RL), and it’s essentially totally different from different machine studying approaches. As a substitute of exhibiting the AI hundreds of examples of “appropriate” landings, you give it suggestions: “Hey, that was fairly good, however possibly strive being extra light subsequent time?” or “Yikes, you crashed—in all probability don’t do this once more.” Via numerous makes an attempt, the AI figures out what works and what doesn’t.

On this submit, I’m documenting my journey from RL fundamentals to constructing a working system that (principally!) teaches a drone to land. You’ll see the successes, the failures, and all of the bizarre behaviors I needed to debug alongside the best way.

1. Reinforcement studying: Overview

A whole lot of the thought will be associated to Pavlov’s canine and Skinner’s rat experiments. The thought is that you simply give the topic a ‘reward‘ when it does one thing you need it to do (optimistic reinforcement) and a ‘penalty‘ when it does one thing dangerous (unfavorable reinforcement). Via many repeated makes an attempt, your topic learns from this suggestions, progressively discovering which actions result in success—much like how Skinner’s rat discovered which lever presses produced meals rewards.

Fig 1. Pavlov’s classical conditioning experiment (AI-generated picture by Google’s Gemini)

In the identical style, we wish a system that can be taught to do issues (or duties) such that it might probably maximize the reward and decrease the penalty. Notice this truth about maximizing reward, which is able to are available later.

1.1 Core Ideas

When speaking about methods that may be applied programmatically on computer systems, the most effective follow is to write down clear definitions for concepts that may be abstracted. Within the examine of AI (and extra particularly, Reinforcement studying), the core concepts will be boiled right down to the next:

Agent (or Actor): That is our topic from the earlier part. This may be the canine, a robotic making an attempt to navigate an enormous manufacturing unit, a online game NPC, and so on.
Atmosphere (or the world): This is usually a place, a simulation with restrictions, a online game’s digital recreation world, and so on. I consider this like, “A field, actual or digital, the place the agent’s whole life is confined to; it solely is aware of of what occurs inside the field. We, because the overlords, can alter this field, whereas the agent will suppose that god is exacting his will on his world.”
Coverage: Similar to in governments, corporations, and plenty of extra related entities, ‘insurance policies’ dictate “What actions ought to be taken when given a sure state of affairs”.
State: That is what the agent “sees” or “is aware of” about its present state of affairs. Consider it because the agent’s snapshot of actuality at any given second—like the way you see the site visitors mild shade, your pace, and the gap to the intersection when driving.
Motion: Now that our agent can ‘see’ issues in its atmosphere, it might need to do one thing about its state. Perhaps it simply awakened from an extended night time’s slumber, and now it desires to get a cup of espresso. On this case, the very first thing it can do is get off the bed. That is an motion that the agent will take to realize its aim, i.e., GET SOME COFFEE!
Reward: Each time the actor executes an motion (of its personal volition), one thing could change on the earth. For instance, our agent obtained off the bed and began strolling in direction of the kitchen, however then, as a result of it’s so dangerous at strolling, it tripped and fell. On this state of affairs, the god (us) rewards it with a punishment for being dangerous at strolling (unfavorable reward). However then the agent makes it to the kitchen and will get the espresso, so the god (us) rewards it with a cookie (optimistic reward).

Fig. 2 Illustration of a theoretical RL system

As you may think about, most of those key elements have to be tailor-made for the particular job/drawback that we wish the agent to resolve.

2. The Gymnasium

Now that we perceive the fundamentals, you is likely to be questioning: how will we really construct one in all these methods? Let me present you the sport I constructed.

For this submit, I’ve written a bespoke online game that anybody can entry and use to coach their very own machine studying agent to play the sport.

The complete code repository will be discovered on GitHub (please star this). I intend to make use of this repository for extra video games and simulation code, together with extra superior methods that I’ll implement in my subsequent installments of posts on RL.

Supply Drone

The supply drone is a recreation the place the target is to fly a drone (possible containing deliveries) onto a platform. To win the sport, now we have to land. To land, now we have to fulfill the next standards:

Be in touchdown proximity to the platform
Be gradual sufficient
Be upright (Touchdown the wrong way up is extra like crashing than touchdown)

All data on how one can run the sport will be discovered within the GitHub repository.

Right here’s what the sport seems like

Sample screenshot of the game — Fig. 3 A screenshot of the sport that I made for this undertaking

If the drone flies off the display screen or touches the bottom, it will likely be thought of a ‘crash’ case and thus result in a failure.

State description

The drone observes 15 steady values that fully describe its state of affairs:

Touchdown Success Standards: The drone should concurrently obtain:

Horizontal alignment: inside platform bounds (|dx| < 0.0625)
Secure method pace: lower than 0.3
Stage orientation: tilt lower than 20° (|angle| < 0.111)
Right altitude: backside of drone touching platform prime

It’s like parallel parking—you want the appropriate place, proper angle, and shifting slowly sufficient to not crash!

How can somebody design a coverage?

There are a lot of methods to design a coverage. It may be Bayesian (sustaining chance distributions over beliefs), it may be a easy lookup desk for discrete states, a hand-coded rule system (“if distance < 10, then brake”), a decision tree, or—as we’ll discover—a neural network that learns the mapping from states to actions by means of gradient descent.

Successfully, we wish one thing that takes within the aforementioned state, performs some computation utilizing this state, and returns what motion ought to be carried out.

Deep Studying to construct a coverage?

So how will we design a coverage that may deal with steady states (like precise drone positions) and be taught complicated behaviors? That is the place neural networks are available.

In case of neural networks (or in deep studying), it’s usually finest to work with motion chances, i.e., “What motion is probably going the most effective given the present state?”. So, we will outline a neural community that can take within the state as a ‘vector’ or ‘assortment of vectors’ as enter. This vector or assortment of vectors must be constructed from the noticed state. For our supply drone recreation, the state vector is:

State vector (from our 2D drone recreation)

The drone observes its absolute place, velocities, orientation, gas, platform place, and derived metrics. Our steady state is:

The place every element represents:

All elements are normalized to roughly [0,1] or [-1,1] ranges for steady neural community coaching.

Motion house (three unbiased binary thrusters)

As a substitute of discrete motion mixtures, we deal with every thruster independently:

Important thruster (upward thrust)
Left thruster (clockwise rotation)
Proper thruster (counter-clockwise rotation)

Every motion is sampled from a Bernoulli distribution, giving us 3 unbiased binary selections per timestep.

Neural-network coverage (probabilistic with Bernoulli sampling)

Let f_θ(s) be the community outputs after sigmoid activation. The coverage makes use of unbiased Bernoulli distributions:

Minimal Python sketch (from our implementation)

# construct state vector from DroneState
s = np.array([
    state.drone_x, state.drone_y,
    state.drone_vx, state.drone_vy,
    state.drone_angle, state.drone_angular_vel,
    state.drone_fuel,
    state.platform_x, state.platform_y,
    state.distance_to_platform,
    state.dx_to_platform, state.dy_to_platform,
    state.speed,
    float(state.landed), float(state.crashed)
])

# community outputs chances for every thruster (after sigmoid)
action_probs = coverage(torch.tensor(s, dtype=torch.float32))  # form: (3,)

# pattern every thruster independently from Bernoulli
dist = Bernoulli(probs=action_probs)
motion = dist.pattern()  # form: (3,), e.g., [1, 0, 1] means foremost+proper thrusters

This reveals how we map the sport’s bodily observations right into a 15-dimensional normalized state vector and produce unbiased binary selections for every thruster.

Code setup (half 1): Imports and recreation socket setup

We first need our recreation’s socket listener to start out. For this, you may navigate to the delivery_drone listing in my repository and run the next command:

pip set up -r necessities.txt # run this as soon as for organising the required modules
python socket_server.py --render human --port 5555 --num-games 1 # run this each time that you must run the sport in socket mode

NOTE: You will want PyTorch to run the code. Please just be sure you have set it up beforehand

import os
import torch
import torch.nn as nn
import math
import numpy as np

from torch.distributions import Bernoulli

# Import the sport's socket shopper
from delivery_drone.recreation.socket_client import DroneGameClient, DroneState

# setup the shopper and connect with the server
shopper = DroneGameClient()
shopper.join()

design a reward operate?

So what makes reward operate? That is arguably the toughest a part of RL (and the place I spent a LOT of my debugging time 🫠).

The reward operate is the soul of any RL implementation (and belief me, get this flawed and your agent will do the weirdest issues). In concept, it ought to outline what ‘good’ behaviour ought to be learnt and what ‘dangerous’ behaviour shouldn’t be learnt. Every motion taken by our agent is characterised by the full collected reward for every behaviour trait exhibited by the motion. For instance, if you’d like the drone to land gently, you may give optimistic rewards for being near the platform and shifting slowly, whereas penalizing crashes or working out of gas—the agent then learns to maximise the sum of all these rewards over time.

Benefit: A greater strategy to measure efficient reward

When coaching our coverage, we don’t simply need to know if an motion rewarded us—we need to know if it was higher than ordinary. That is the instinct behind the benefit.

The benefit tells us: “Was this motion higher or worse than what we usually anticipate?”

In our implementation, we:

Acquire a number of episodes and calculate their returns (whole discounted rewards)
Compute the baseline because the imply return throughout all episodes
Calculate benefit = return – baseline for every timestep
Normalize benefits to have imply=0 and std=1 (for steady coaching)

Why this helps:

Actions with optimistic benefit → higher than common → improve their chance
Actions with unfavorable benefit → worse than common → lower their chance
Reduces variance in gradient updates (extra steady studying)

This easy baseline already provides us a lot better coaching than uncooked returns! It tries to weigh the total sequence of actions in opposition to the outcomes (crashed or landed) such that the coverage learns to take actions that result in higher benefit.

After a variety of trial and error, I’ve designed the next reward operate. The important thing perception was to situation rewards on each proximity AND vertical place – the drone have to be above the platform to obtain optimistic rewards, stopping exploitation methods like hovering beneath the platform.

Brief notice on inversely (and non-linearly) scaling reward

Typically, we need to reward behaviors inversely proportional to sure state values. For instance, distance to the platform ranges from 0 to ~1.41 (normalized by window width). We wish a excessive reward when the gap ≈ 0 and a low reward when distant. I take advantage of numerous scaling features for this:

Plot showing an exponentially decaying function — Fig. 4 Gaussian scalar operate

Examples for other useful scaling functions

Helper features:

def inverse_quadratic(x, decay=20, scaler=10, shifter=0):
    """Reward decreases quadratically with distance"""
    return scaler / (1 + decay * (x - shifter)**2)

def scaled_shifted_negative_sigmoid(x, scaler=10, shift=0, steepness=10):
    """Sigmoid operate scaled and shifted"""
    return scaler / (1 + np.exp(steepness * (x - shift)))

def calc_velocity_alignment(state: DroneState):
    """
    Calculate how nicely the drone's velocity is aligned with optimum path to platform.
    Returns cosine similarity: 1.0 = good alignment, -1.0 = wrong way
    """
    # Optimum path: from drone to platform
    optimal_dx = -state.dx_to_platform
    optimal_dy = -state.dy_to_platform
    optimal_norm = math.sqrt(optimal_dx**2 + optimal_dy**2)

    if optimal_norm < 1e-6:  # Already at platform
        return 1.0

    optimal_dx /= optimal_norm
    optimal_dy /= optimal_norm

    # Present velocity path
    velocity_norm = state.pace
    if velocity_norm < 1e-6:  # Not shifting
        return 0.0

    velocity_dx = state.drone_vx / velocity_norm
    velocity_dy = state.drone_vy / velocity_norm

    # Cosine similarity
    return velocity_dx * optimal_dx + velocity_dy * optimal_dy

Code for the present reward operate:

def calc_reward(state: DroneState):
    rewards = {}
    total_reward = 0

    # 1. Time penalty - distance-based (penalize extra when far)
    minimum_time_penalty = 0.3
    maximum_time_penalty = 1.0
    rewards['time_penalty'] = -inverse_quadratic(
        state.distance_to_platform,
        decay=50,
        scaler=maximum_time_penalty - minimum_time_penalty
    ) - minimum_time_penalty
    total_reward += rewards['time_penalty']

    # 2. Distance & velocity alignment - ONLY when above platform
    velocity_alignment = calc_velocity_alignment(state)
    dist = state.distance_to_platform

    rewards['distance'] = 0
    rewards['velocity_alignment'] = 0

    # Key situation: drone have to be above platform (dy > 0) to get optimistic rewards
    if dist > 0.065 and state.dy_to_platform > 0:
        # Reward motion towards platform when velocity is aligned
        if velocity_alignment > 0:
            rewards['distance'] = state.pace * scaled_shifted_negative_sigmoid(dist, scaler=4.5)
            rewards['velocity_alignment'] = 0.5

    total_reward += rewards['distance']
    total_reward += rewards['velocity_alignment']

    # 3. Angle penalty - distance-based threshold
    abs_angle = abs(state.drone_angle)
    max_angle = 0.20
    max_permissible_angle = ((max_angle - 0.111) * dist) + 0.111
    extra = abs_angle - max_permissible_angle
    rewards['angle'] = -max(extra, 0)
    total_reward += rewards['angle']

    # 4. Velocity penalty - penalize extreme pace
    rewards['speed'] = 0
    pace = state.pace
    max_speed = 0.4
    if dist < 1:
        rewards['speed'] = -2 * max(pace - 0.1, 0)
    else:
        rewards['speed'] = -1 * max(pace - max_speed, 0)
    total_reward += rewards['speed']

    # 5. Vertical place penalty - penalize being beneath platform
    rewards['vertical_position'] = 0
    if state.dy_to_platform > 0:  # Drone is above platform (GOOD)
        rewards['vertical_position'] = 0
    else:  # Drone is beneath platform (BAD!)
        rewards['vertical_position'] = state.dy_to_platform * 4.0  # Destructive penalty
    total_reward += rewards['vertical_position']

    # 6. Terminal rewards
    rewards['terminal'] = 0
    if state.landed:
        rewards['terminal'] = 500.0 + state.drone_fuel * 100.0
    elif state.crashed:
        rewards['terminal'] = -200.0
        # Further penalty for crashing removed from goal
        if state.distance_to_platform > 0.3:
            rewards['terminal'] -= 100.0
    total_reward += rewards['terminal']

    rewards['total'] = total_reward
    return rewards

And sure, these magic numbers like 4.5, 0.065, and 4.0? They got here from rather a lot of trial and error. Welcome to RL, the place hyperparameter tuning is half artwork, half science, and half luck (sure, I do know that’s three halves).

def compute_returns(rewards, gamma=0.99):
    """
    Compute discounted returns (G_t) for every timestep based mostly on the Bellman equation
    
    G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
    """
    returns = []
    G = 0
    
    # Compute backwards (extra environment friendly)
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    
    return returns

The necessary factor to notice is that reward features are topic to cautious trial and error. One mistake or over-reward right here, and the agent goes off in optimizing behaviour that exploits the errors. This leads us to reward hacking.

Reward hacking

Reward hacking happens when an agent finds an unintended strategy to maximize reward with out really fixing the duty you wished it to resolve. The agent isn’t “dishonest” on objective—it’s doing precisely what you informed it to do, simply not what you meant for it to do.

Traditional instance: If you happen to reward a cleansing robotic for “no seen dust,” it’d be taught to show off its digital camera as a substitute of cleansing!

My painful studying expertise: I discovered this out the onerous approach. In an early model of my drone touchdown reward operate, I gave the drone factors for being “steady and gradual” wherever close to the platform. Sounds cheap, proper? Unsuitable! Inside 50 coaching episodes, my drone discovered to only hover in place perpetually, racking up free factors. It was technically optimum for my badly-designed reward operate—however really touchdown? Nope! I watched it hover for five minutes straight earlier than I noticed what was occurring.

Right here’s the problematic code I wrote:

# DO NOT COPY THIS!
# If drone is above the platform (|dx| < 0.0625) and shut (distance < 0.25):
corridor_reward = inverse_quadratic(distance, decay=20, scaler=15)  # As much as 15 factors
if steady and gradual:
    corridor_reward += 10  # Further 10 factors!
# Whole doable: 25 factors per step!

An instance of reward hacking in motion:

Fig. 5 The drone learnt to hover across the platform and farm rewards

Plot showing hacked rewards — Fig. 6 Plot that reveals that the drone is clearly reward hacking

Making a coverage community

As mentioned above, we’re going to use a neural community because the coverage that powers the mind of our agent. Right here’s a easy implementation that takes within the state vector and computes a chance distribution over 3 unbiased actions:

Activate the primary thruster
Activate the left thruster
Activate the appropriate thruster

def state_to_array(state):
    """Helper operate to transform DroneState dataclass to numpy array"""
    information = np.array([
        state.drone_x,
        state.drone_y,
        state.drone_vx,
        state.drone_vy,
        state.drone_angle,
        state.drone_angular_vel,
        state.drone_fuel,
        state.platform_x,
        state.platform_y,
        state.distance_to_platform,
        state.dx_to_platform,
        state.dy_to_platform,
        state.speed,
        float(state.landed),
        float(state.crashed)
    ])
    
    return torch.tensor(information, dtype=torch.float32)

class DroneGamerBoi(nn.Module):
    def __init__(self, state_dim=15):
        tremendous().__init__()
        
        self.community = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.LayerNorm(128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.LayerNorm(64),
            nn.ReLU(),
            nn.Linear(64, 3),
            nn.Sigmoid()
        )
        
    def ahead(self, state):
        if isinstance(state, DroneState):
            state = state_to_array(state)
        
        return self.community(state)

Successfully, as a substitute of the motion house being a 2³ = 8 house, I diminished it to selections over the three unbiased thrusters utilizing Bernoulli sampling. This discount makes optimization simpler by treating every thruster independently somewhat than as one large categorical selection (at the least that’s what I believe—I could also be flawed, but it surely labored for me!).

Coaching a coverage with coverage gradients

Studying Methods: When Ought to We Replace?

Right here’s a query that tripped me up early on: ought to we replace the coverage after each single motion, or wait and see how the entire episode performs out? Seems, this selection issues rather a lot.

Whenever you attempt to optimize based mostly purely on the reward obtained for an motion, it results in a excessive variance drawback (principally, the coaching sign is tremendous noisy and the gradients level in random instructions!). What I imply by “excessive variance” is that the optimization algorithm receives extraordinarily combined alerts within the gradient that’s used to replace the parameters in our coverage community. For a similar motion, the system could emit a particular gradient path, however then for a barely totally different state (however identical motion) may yield one thing fully reverse. This results in gradual, and doubtlessly no, coaching.

There are 3 ways we may replace our coverage:

Studying after each motion (Per-Step Updates)

The drone fires its thruster as soon as, will get a small reward, and instantly updates its whole technique. That is like adjusting your basketball kind after each single shot—approach too reactive! One fortunate motion that will increase the reward doesn’t essentially imply that the agent did good, and one unfortunate motion doesn’t imply the agent did dangerous. The training sign is simply too noisy.

My first try: I attempted this method early on. The drone would wiggle round randomly, make one fortunate transfer that obtained a tiny bit extra reward, instantly overfit to that precise transfer, after which crash repeatedly making an attempt to breed it. It was painful to observe—like watching somebody be taught the flawed lesson from pure likelihood.

Studying after one full try (Per-Episode Updates)

Higher! Now we let the drone attempt to land (or crash), see how the entire try went, after which replace. That is like ending an episode after which fascinated about what to enhance. A minimum of now we see the total penalties of our actions. However right here’s the issue: what if that one touchdown was simply fortunate? Or unfortunate? We’re nonetheless basing our studying on a single information level.

Studying from a number of makes an attempt (Multi-Episode Batch Updates)

That is the candy spot. We run a number of (6 in my case) drone touchdown makes an attempt concurrently, see how all of them went, after which replace our coverage based mostly on the common efficiency. Some makes an attempt may get fortunate, some unfortunate, however averaged collectively, we get a a lot clearer image of what really works. Though that is fairly heavy on the pc, if you happen to can run it, it really works approach higher than any of the earlier strategies. In fact, this methodology is definitely not the most effective, however it’s fairly easy to know and implement; there are different (and higher) strategies.

Right here’s the code to gather a number of episodes within the drone recreation:

def collect_episodes(shopper: DroneGameClient, coverage: nn.Module, max_steps=300):
    """
    Acquire episodes with early stopping
    
    Args:
        shopper: The sport's socket shopper
        coverage: PyTorch module
        max_steps: Most steps per episode (default: 300)
    """
    num_games = shopper.num_games
    
    # Initialize storage
    all_episodes = [{'states': [], 'actions': [], 'log_probs': [], 'rewards': [], 'carried out': False} 
                    for _ in vary(num_games)]
    
    # Reset all video games
    game_states = [client.reset(game_id) for game_id in range(num_games)]
    step_counts = [0] * num_games  # Monitor steps per recreation
    
    whereas not all(ep['done'] for ep in all_episodes):
        # Batch lively video games
        batch_states = []
        active_game_ids = []
        
        for game_id in vary(num_games):
            if not all_episodes[game_id]['done']:
                batch_states.append(state_to_array(game_states[game_id]))
                active_game_ids.append(game_id)
        
        if len(batch_states) == 0:
            break
        
        # Batched inference
        batch_states_tensor = torch.stack(batch_states)
        batch_action_probs = coverage(batch_states_tensor)
        batch_dist = Bernoulli(probs=batch_action_probs)
        batch_actions = batch_dist.pattern()
        batch_log_probs = batch_dist.log_prob(batch_actions).sum(dim=1)
        
        # Execute actions
        for i, game_id in enumerate(active_game_ids):
            motion = batch_actions[i]
            log_prob = batch_log_probs[i]
            
            next_state, _, carried out, _ = shopper.step({
                "main_thrust": int(motion[0]),
                "left_thrust": int(motion[1]),
                "right_thrust": int(motion[2])
            }, game_id)
            
            reward = calc_reward(next_state)
            
            # Retailer information
            all_episodes[game_id]['states'].append(batch_states[i])
            all_episodes[game_id]['actions'].append(motion)
            all_episodes[game_id]['log_probs'].append(log_prob)
            all_episodes[game_id]['rewards'].append(reward['total'])
            
            # Replace state and step rely
            game_states[game_id] = next_state
            step_counts[game_id] += 1
            
            # Examine carried out situations
            if carried out or step_counts[game_id] >= max_steps:
                # Apply timeout penalty if hit max steps with out touchdown
                if step_counts[game_id] >= max_steps and never next_state.landed:
                    all_episodes[game_id]['rewards'][-1] -= 500  # Timeout penalty
                
                all_episodes[game_id]['done'] = True
    
    # Return episodes
    return [(ep['states'], ep['actions'], ep['log_probs'], ep['rewards']) 
            for ep in all_episodes]

The Maximization-Minimization Puzzle

In typical deep studying (supervised studying), we decrease a loss operate:

We need to go “downhill” towards decrease loss (higher predictions).

However in reinforcement studying, we need to maximize whole reward! Our aim is:

The issue: Deep studying frameworks are constructed for minimization, not maximization. How will we flip “maximize reward” into “decrease loss”?

The easy trick: Maximize J(θ) = Reduce -J(θ)

So our loss operate turns into:

Now, gradient descent will climb up (extra like Gradient Ascend) the reward panorama (as a result of we’re happening the unfavorable reward)!

The REINFORCE Algorithm (Coverage Gradient)

The coverage gradient theorem (Williams, 1992) tells us how one can compute the gradient of anticipated reward:

(I do know, I do know—this seems intimidating. However stick to me, it’s really fairly elegant when you see what’s occurring!)

The place:

In plain English (as a result of that formulation is dense):

If motion a_t led to a excessive return G_t, improve its chance
If motion a_t led to a low return G_t, lower its chance
The gradient tells us which path to regulate the neural community weights

Including a Baseline (Variance Discount)

Utilizing uncooked returns G_t results in excessive variance (noisy gradients). We enhance this by subtracting a baseline b(s_t):

The best baseline is the imply return:

This provides us the benefit: A_t=G_t-b

Optimistic benefit → motion was higher than common → improve chance
Destructive benefit → motion was worse than common → lower chance

Why this helps: As a substitute of “this motion gave reward 100” (is that good?), now we have “this motion gave 100 when the common is 50” (that’s nice!). Relative efficiency is clearer than absolute.

Our Implementation

In our drone touchdown code, we use REINFORCE with baseline:

# 1. Acquire episodes and compute returns
returns = compute_returns(rewards, gamma=0.99)  # G_t with discounting

# 2. Compute baseline (imply of all returns)
baseline = returns_tensor.imply()

# 3. Compute benefits
benefits = returns_tensor - baseline

# 4. Normalize benefits (additional variance discount)
benefits = (benefits - benefits.imply()) / (benefits.std() + 1e-8)

# 5. Compute loss (notice the unfavorable signal!)
loss = -(log_probs_tensor * benefits).imply()

# 6. Gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()

We repeat the above loop as many instances as we wish or until the drone learns to land correctly. Take a look at this pocket book for extra code!

Present Outcomes (reward operate continues to be fairly flawed)

After numerous hours of tweaking rewards, adjusting hyperparameters, and watching my drone crash in artistic new methods, I lastly obtained it working (principally!). Although my designed reward operate just isn’t good, I do suppose that it is ready to educate a coverage community. Right here’s a profitable touchdown:

Gif showing a good run of the agent — Fig. 6 The drone learnt one thing!

Fairly cool, proper? However right here’s the place issues get attention-grabbing (and irritating)…

The persistent hovering drawback: A basic limitation

Even with the improved reward operate that situations rewards on vertical place (dy_to_platform > 0). The skilled coverage nonetheless reveals a irritating conduct: when the drone misses the platform, it learns to descend towards it however then hovers beneath the platform somewhat than making an attempt to land.

I spent over every week gazing reward plots (and altering reward features), questioning why my “mounted” reward operate was nonetheless producing this hovering conduct. Once I lastly plotted the collected rewards, the sample turned crystal clear—and truthfully, I couldn’t even be mad on the agent for locating this technique.

What’s occurring?

By analyzing the collected rewards over an episode the place the drone hovers beneath the platform, I found one thing attention-grabbing:

Fig. 7 Gif exhibiting “hovering beneath platform” drawback

Fig. 8 Plot that reveals that the drone is clearly reward hacking

The plots reveal that:

Distance reward (orange): Accumulates to ~+70 early, then plateaus (no extra rewards)
Velocity alignment (inexperienced): Accumulates to ~+30 early, then plateaus
Time penalty (blue): Steadily accumulates to ~-250 (retains getting worse)
Vertical place (brown): Steadily accumulates to ~-200 (penalty for being beneath)
Whole reward: Ends round -400 to -600 (after timeout)

The important thing perception: The drone descends from above the platform (amassing distance and velocity rewards on the best way down), passes by means of the platform top, after which settles into hovering beneath as a substitute of finishing the touchdown. As soon as beneath, it stops getting optimistic rewards (discover how the gap and velocity traces plateau round step 50-60) however continues accumulating time penalties and vertical place penalties. Nonetheless, this technique continues to be viable as a result of making an attempt to land dangers a direct -200 crash penalty, whereas hovering beneath “solely” prices ~-400 to -600 over the total episode.

Why does this occur?

The elemental difficulty is that our reward operate r(s', a) can solely see the present state, not the trajectory. Give it some thought: at any single timestep, the reward operate can’t inform the distinction between:

A drone making progress towards touchdown (approaching from above with managed descent)
A drone exploiting the reward construction (oscillating to farm rewards)

Each may need dy_to_platform > 0 at a given second, in order that they obtain equivalent rewards! The agent isn’t dumb—it’s simply optimizing precisely what you informed it to optimize.

So what would really repair this?

To actually resolve this drawback, I personally suppose that rewards ought to rely upon state transitions: r(s, a, s') as a substitute of simply r(s, a). This may allow you to reward based mostly on (s being the present state, and s’ prime being the subsequent state):

Progress: Solely reward if distance(s') < distance(s) (really getting nearer!)
Vertical enchancment: Solely reward if the drone is persistently shifting upward relative to the platform
Trajectory consistency: Penalize fast path adjustments that point out oscillation

This can be a extra principled resolution than making an attempt to patch the present reward operate with more and more harsh penalties (which is principally what I attempted for some time, and it didn’t actually work). The oscillation exploit exists as a result of we’re essentially lacking details about the trajectory.

Within the subsequent submit, I’ll discover Actor-Critic strategies and methods that may incorporate temporal data to forestall these exploitation methods. Keep tuned!

If you happen to discover a strategy to repair this, please attain out to me!

This brings us to the tip of this submit on “the only strategy to do Deep Reinforcement Studying.”

Subsequent on the listing

Actor-Critic methods
DQL
PPO & GRPO
Making use of this to methods that require imaginative and prescient 👀

References

Foundational Stuff

Turing, A. M. (1950). “Computing Equipment and Intelligence.”.
- Unique Turing Check paper
Williams, R. J. (1992). “Easy Statistical Gradient-Following Algorithms for Connectionist Reinforcement Studying.” Machine Studying.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Studying: An Introduction. MIT Press.

Classical Conditioning & Behavioral Psychology

Pavlov, I. P. (1927). Conditioned Reflexes: An Investigation of the Physiological Exercise of the Cerebral Cortex. Oxford College Press.
- Classical conditioning experiments
Skinner, B. F. (1938). The Conduct of Organisms: An Experimental Evaluation. Appleton-Century-Crofts.
- Operant conditioning and the Skinner Field

Coverage Gradient Strategies

Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). “Coverage Gradient Strategies for Reinforcement Studying with Operate Approximation.” Advances in Neural Data Processing Programs.
- Theoretical foundations of coverage gradients
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). “Excessive-Dimensional Steady Management Utilizing Generalized Benefit Estimation.” arXiv preprint arXiv:1506.02438.

Neural Networks & Deep Studying

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Studying. MIT Press.

On-line Assets

Karpathy, A. “Deep Reinforcement Studying: Pong from Pixels.”
Spinning Up in Deep RL by OpenAI

Code Repository

Jumle, V. (2025). “Reinforcement Studying 101: Supply Drone Touchdown.”

Buddy

Singh, Navroop Kaur. (2025): For offering “Optimistic Vibes & Consideration”. Thanks!

All pictures on this article are both AI-generated (utilizing Gemini), personally made by me, or screenshots & plots that I made.

Source link

“The success of an AI product depends on how intuitively users can interact with its capabilities”

How to Crack Machine Learning System-Design Interviews

Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI

Ny AI från Tencent skapar kompletta 3D-världar från bara en mening eller en bild-

MIT announces the Initiative for New Manufacturing | MIT News

Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI

Omfattande läcka avslöjar systempromptar från ledande AI-verktyg

Topic Model Labelling with LLMs | Towards Data Science

Most Popular

Chinese universities want students to use more AI, not less

Time Series Forecasting Made Simple (Part 4.1): Understanding Stationarity in a Time Series

Explained: Generative AI’s environmental impact | MIT News

Our Picks