Close Menu
    Trending
    • Optimizing Data Transfer in Distributed AI/ML Training Workloads
    • Achieving 5x Agentic Coding Performance with Few-Shot Prompting
    • Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found
    • From Transactions to Trends: Predict When a Customer Is About to Stop Buying
    • America’s coming war over AI regulation
    • “Dr. Google” had its issues. Can ChatGPT Health do better?
    • Evaluating Multi-Step LLM-Generated Content: Why Customer Journeys Require Structural Metrics
    • Why SaaS Product Management Is the Best Domain for Data-Driven Professionals in 2026
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel
    Artificial Intelligence

    The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

    ProfitlyAIBy ProfitlyAIDecember 31, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    use gradient descent to search out the optimum values of their weights. Linear regression, logistic regression, neural networks, and enormous language fashions all depend on this precept. Within the earlier articles, we used easy gradient descent as a result of it’s simpler to point out and simpler to grasp.

    The identical precept additionally seems at scale in trendy massive language fashions, the place coaching requires adjusting hundreds of thousands or billions of parameters.

    Nonetheless, actual coaching not often makes use of the fundamental model. It’s usually too gradual or too unstable. Trendy methods use variants of gradient descent that enhance pace, stability, or convergence.

    On this bonus article, we deal with these variants. We take a look at why they exist, what drawback they remedy, and the way they alter the replace rule. We don’t use a dataset right here. We use one variable and one operate, solely to make the habits seen. The aim is to point out the motion, to not practice a mannequin.

    1. Gradient Descent and the Replace Mechanism

    1.1 Downside setup

    To make these concepts seen, we is not going to use a dataset right here, as a result of datasets introduce noise and make it more durable to watch the habits straight. As an alternative, we are going to use a single operate:
    f(x) = (x – 2)²

    We begin at x = 4, and the gradient is:
    gradient = 2*(x – 2)

    This easy setup removes distractions. The target is to not practice a mannequin, however to grasp how the completely different optimisation guidelines change the motion towards the minimal.

    1.2 The construction behind each replace

    Each optimisation methodology that follows on this article is constructed on the identical loop, even when the inner logic turns into extra subtle.

    • First, we learn the present worth of x.
    • Then, we compute the gradient with the expression 2*(x – 2).
    • Lastly, we replace x in response to the particular rule outlined by the chosen variant.

    The vacation spot stays the identical and the gradient at all times factors within the appropriate course, however the best way we transfer alongside this course adjustments from one methodology to a different. This variation in motion is the essence of every variant.

    1.3 Fundamental gradient descent because the baseline

    Fundamental gradient descent applies a direct replace based mostly on the present gradient and a hard and fast studying charge:

    x = x – lr * 2*(x – 2)

    That is essentially the most intuitive type of studying as a result of the replace rule is simple to grasp and simple to implement. The tactic strikes steadily towards the minimal, but it surely usually does so slowly, and it will possibly wrestle when the training charge will not be chosen fastidiously. It represents the inspiration on which all different variants are constructed.

    Gradient descent in Excel – all photos by creator

    2. Studying Charge Decay

    Studying Charge Decay doesn’t change the replace rule itself. It adjustments the dimensions of the training charge throughout iterations in order that the optimisation turns into extra steady close to the minimal. Massive steps assist when x is much from the goal, however smaller steps are safer when x will get near the minimal. Decay reduces the danger of overshooting and produces a smoother touchdown.

    There’s not a single decay formulation. A number of schedules exist in follow:

    • exponential decay
    • inverse decay (the one proven within the spreadsheet)
    • step-based decay
    • linear decay
    • cosine or cyclical schedules

    All of those observe the identical concept: the training charge turns into smaller over time, however the sample is determined by the chosen schedule.

    Within the spreadsheet instance, the decay formulation is the inverse type:
    lr_t = lr / (1 + decay * iteration)

    With the replace rule:
    x = x – lr_t * 2*(x – 2)

    This schedule begins with the total studying charge on the first iteration, then progressively reduces it. At first of the optimisation, the step dimension is massive sufficient to maneuver shortly. As x approaches the minimal, the training charge shrinks, stabilising the replace and avoiding oscillation.

    On the chart, each curves begin at x = 4. The mounted studying charge model strikes sooner at first however approaches the minimal with much less stability. The decay model strikes extra slowly however stays managed. This confirms that decay doesn’t change the course of the replace. It solely adjustments the step dimension, and that change impacts the habits.

    3. Momentum Strategies

    Gradient Descent strikes within the appropriate course however may be gradual on flat areas. Momentum strategies tackle this by including inertia to the replace.

    They accumulate course over time, which creates sooner progress when the gradient stays constant. This household contains commonplace Momentum, which builds pace, and Nesterov Momentum, which anticipates the subsequent place to cut back overshooting.

    3.1 Normal momentum

    Normal momentum introduces the concept of inertia into the training course of. As an alternative of reacting solely to the present gradient, the replace retains a reminiscence of earlier gradients within the type of a velocity variable:

    velocity = 0.9velocity + 2(x – 2)
    x = x – lr * velocity

    This strategy accelerates studying when the gradient stays constant for a number of iterations, which is very helpful in flat or shallow areas.

    Nonetheless, the identical inertia that generates pace also can result in overshooting the minimal, which creates oscillations across the goal.

    3.2 Nesterov Momentum

    Nesterov Momentum is a refinement of the earlier methodology. As an alternative of updating the speed on the present place alone, the tactic first estimates the place the subsequent place can be, after which evaluates the gradient at that anticipated location:

    velocity = 0.9velocity + 2((x – 0.9*velocity) – 2)
    x = x – lr * velocity

    This look-ahead behaviour reduces the overshooting impact that may seem in common Momentum, which ends up in a smoother strategy to the minimal and fewer oscillations. It retains the advantage of pace whereas introducing a extra cautious sense of course.

    4. Adaptive Gradient Strategies

    Adaptive Gradient Strategies modify the replace based mostly on info gathered throughout coaching. As an alternative of utilizing a hard and fast studying charge or relying solely on the present gradient, these strategies adapt to the dimensions and habits of current gradients.

    The aim is to cut back the step dimension when gradients turn into unstable and to permit regular progress when the floor is extra predictable. This strategy is helpful in deep networks or irregular loss surfaces, the place the gradient can change in magnitude from one step to the subsequent.

    4.1 RMSProp (Root Imply Sq. Propagation)

    RMSProp stands for Root Imply Sq. Propagation. It retains a operating common of squared gradients in a cache, and this worth influences how aggressively the replace is utilized:

    cache = 0.9cache + (2(x – 2))²
    x = x – lr / sqrt(cache) * 2*(x – 2)

    The cache turns into bigger when gradients are unstable, which reduces the replace dimension. When gradients are small, the cache grows extra slowly, and the replace stays near the conventional step. This makes RMSProp efficient in conditions the place the gradient scale will not be constant, which is widespread in deep studying fashions.

    4.2 Adam (Adaptive Second Estimation)

    Adam stands for Adaptive Second Estimation. It combines the concept of Momentum with the adaptive behaviour of RMSProp. It retains a transferring common of gradients to seize course, and a transferring common of squared gradients to seize scale:

    m = 0.9m + 0.1(2(x – 2)) v = 0.999v + 0.001(2(x – 2))²
    x = x – lr * m / sqrt(v)

    The variable m behaves like the speed in momentum, and the variable v behaves just like the cache in RMSProp. Adam updates each values at each iteration, which permits it to speed up when progress is obvious and shrink the step when the gradient turns into unstable. This stability between pace and management is what makes Adam an ordinary selection in neural community coaching.

    4.3 Different Adaptive Strategies

    Adam and RMSProp are the most typical adaptive strategies, however they don’t seem to be the one ones. A number of associated strategies exist, every with a selected goal:

    • AdaGrad adjusts the training charge based mostly on the total historical past of squared gradients, however the charge can shrink too shortly.
    • AdaDelta modifies AdaGrad by limiting how a lot the historic gradient impacts the replace.
    • Adamax makes use of the infinity norm and may be extra steady for very massive gradients.
    • Nadam provides Nesterov-style look-ahead behaviour to Adam.
    • RAdam makes an attempt to stabilise Adam within the early part of coaching.
    • AdamW separates weight decay from the gradient replace and is really helpful in lots of trendy frameworks.

    These strategies observe the identical concept as RMSProp and Adam: adapting the replace to the habits of the gradients. They signify refinements or extensions of the ideas launched above, and they’re a part of the identical broader household of adaptive optimisation algorithms.

    Conclusion

    All strategies on this article intention for a similar aim: transferring x towards the minimal. The distinction is the trail. Gradient Descent gives the fundamental rule. Momentum provides pace, and Nesterov improves management. RMSProp adapts the step to gradient scale. Adam combines these concepts, and Studying Charge Decay adjusts the step dimension over time.

    Every methodology solves a selected limitation of the earlier one. None of them change the baseline. They lengthen it. In follow, optimisation will not be one rule, however a set of mechanisms that work collectively.

    The aim stays the identical. The motion turns into simpler.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOvercoming Nonsmoothness and Control Chattering in Nonconvex Optimal Control Problems
    Next Article Chunk Size as an Experimental Variable in RAG Systems
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Optimizing Data Transfer in Distributed AI/ML Training Workloads

    January 23, 2026
    Artificial Intelligence

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026
    Artificial Intelligence

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    DeepVerse 4D – AI som förstår världen i fyra dimensioner

    June 10, 2025

    ROC AUC Explained: A Beginner’s Guide to Evaluating Classification Models

    September 17, 2025

    Vana is letting users own a piece of the AI models trained on their data | MIT News

    April 4, 2025

    Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

    July 15, 2025

    Music, Lyrics, and Agentic AI: Building a Smart Song Explainer using Python and OpenAI

    November 14, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Mirage AI skapar GTA och Forza spel

    July 7, 2025

    ChatGPT Revenue Surge, New AGI Timelines, Amazon’s AI Agent, Claude for Education, Model Context Protocol & LLMs Pass the Turing Test

    April 10, 2025

    Real-Time Intelligence in Microsoft Fabric: The Ultimate Guide

    October 4, 2025
    Our Picks

    Optimizing Data Transfer in Distributed AI/ML Training Workloads

    January 23, 2026

    Achieving 5x Agentic Coding Performance with Few-Shot Prompting

    January 23, 2026

    Why the Sophistication of Your Prompt Correlates Almost Perfectly with the Sophistication of the Response, as Research by Anthropic Found

    January 23, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.