Close Menu
    Trending
    • Three OpenClaw Mistakes to Avoid and How to Fix Them
    • I Stole a Wall Street Trick to Solve a Google Trends Data Problem
    • How AI is turning the Iran conflict into theater
    • Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)
    • Machine Learning at Scale: Managing More Than One Model in Production
    • Improving AI models’ ability to explain their predictions | MIT News
    • Write C Code Without Learning C: The Magic of PythoC
    • LatentVLA: Latent Reasoning Models for Autonomous Driving
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel
    Artificial Intelligence

    The Machine Learning “Advent Calendar” Day 1: k-NN Regressor in Excel

    ProfitlyAIBy ProfitlyAIDecember 1, 2025No Comments16 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    to this “Advent Calendar” of Machine learning and deep learning in Excel.

    For Day 1, we start with the k-NN (k-Nearest Neighbors) regressor algorithm. And as you will note, that is actually the only mannequin and it’s a good strategy to begin.

    For many who already know this mannequin, listed here are some questions for you. Hopefully, they are going to make you wish to proceed studying. And there are some delicate classes that aren’t taught in conventional programs.

    • Is scaling of steady options essential for this mannequin?
    • How can categorical options be dealt with?
    • What must be carried out to the continual options to enhance mannequin efficiency?
    • What forms of distance measures may be extra appropriate in sure conditions? For instance, when predicting home costs the place geographical location issues?

    Spoiler: with a naive k-NN, you can’t get the perfect scaling robotically.

    It’s also a possibility, in case you are not conversant in Excel formulation, to make use of formulation akin to RANK, IF, SUMPRODUCT, and different helpful Excel capabilities.

    You need to use this link to get the Excel/Google Sheet file, and my recommendation is that you simply observe the article, and also you do some testing with the file to higher perceive.

    k-NN regressor in Excel – picture by writer

    The precept of k-NN

    If you wish to promote or purchase an condominium, how would you estimate the worth?

    Please take into consideration a really life like method, not some complicated mannequin that it’s a must to spend hours to construct.

    One thing that you are able to do for actual.

    Properly, you’ll most likely ask your neighbors who’ve an condominium of the identical or related measurement. And also you calculate the typical worth of those flats.

    Sure, that’s precisely the concept of k-NN, for k-Nearest Neighbors: seek for probably the most related examples, and use their values to estimate the brand new one.

    For instance this activity with a concrete instance of home pricing estimation, we’ll use this well-known dataset known as California Housing Dataset. That is the Census knowledge from California block teams, used to foretell median home worth.

    California Housing Dataset – License: MIT

    Every remark will not be a person home, however it’s nonetheless fascinating to make use of this instance.

    Here’s a fast description of the variables.

    Goal variable is MedHouseVal, which is the median home worth, in items of 100,000 USD (instance: 3.2 means 320,000 {dollars}).

    The function variables are as follows:

    1. MedInc: median earnings (in items of 10,000 USD)
    2. HouseAge: median age of homes
    3. AveRooms: common variety of rooms per family
    4. AveBedrms: common variety of bedrooms per family
    5. Inhabitants: folks residing within the block group
    6. AveOccup: common variety of occupants per family
    7. Latitude: geographic latitude
    8. Longitude: geographic longitude

    k-NN with One Steady Characteristic

    Earlier than we use a number of options to search out the neighbors, let’s first solely use one function and some observations.

    Though the method for one steady function can be quite simple, we’ll nonetheless observe each step. We first discover our dataset, then we practice the mannequin with a hyperparameter, and ultimately, we will use the mannequin to foretell.

    Coaching dataset

    Right here is the plot of this straightforward dataset of 10 observations. The x-axis is the continual function, and the y-axis is the goal variable.

    One easy function for Okay-NN – picture by writer

    Now, think about that we have now to foretell the worth for a brand new remark x=10. How can we do this?

    Mannequin coaching?

    Step one for nearly all machine studying fashions is coaching.

    However for k-NN, your mannequin is your whole dataset. In different phrases, you don’t have to coach the mannequin, you employ the unique dataset instantly.

    So in scikit-learn, while you do mannequin.match, for a k-NN estimator, nothing actually occurs.

    Some could ask: what about ok?

    Properly, ok is the hyperparameter. So it’s a must to select a worth for ok, and it may be tuned.

    Prediction for one new remark

    For the hyperparameter ok, we’ll use ok=3, for the reason that dataset could be very small.

    For one function variable, the gap can trivially be the absolute worth of the distinction of the worth between the brand new remark and the others.

    Within the sheet “algo1D”, you possibly can change the worth of the brand new remark, and use the filter on distance column C to order the dataset within the rising order, the 3-nearest neighbors can be plotted.

    To make the calculation extra automated, we will use RANK operate to see the smallest observations by way of distance.

    And we can also create a column of indicators (column G), with indicator = 1, in the event that they belong to the k-nearest neighbors.

    Lastly, for the prediction, we will use SUMPRODUCT to calculate the typical worth of all y values with indicator =1.

    Within the plot,

    • the sunshine blue dots characterize the dataset
    • the crimson dot represents the brand new remark with the anticipated y worth
    • the yellow dots characterize the 3-nearest neighbors of the brand new remark (in crimson)
    k-NN regressor in Excel with one function – picture by writer

    Let’s recap — the prediction section consists of the next steps:

    • For one given new remark, calculate the gap between this new remark and all of the observations within the coaching dataset.
    • Determine the ok observations which have the shortest distance. In Excel, we’ll use the filter to order manually the coaching dataset. Or we will use RANK (and a indicator column) to get the highest ok observations.
    • Calculate the anticipated worth, by calculating the typical worth of the goal variable, through the use of SUMPRODUCT.

    Prediction for an interval of recent observations

    Within the sheet “algo1D f” (f for ultimate), I plotted the prediction for an inventory of recent observations, starting from 1 to 17.

    With a programming language, we may do it simply in a loop, and for a bigger variety of new observations, so the illustration may very well be denser.

    With Excel, I manually repeated the next steps:

    • enter a worth for x
    • order the gap column
    • copy-paste the prediction
    k-NN regressor in Excel with one function – picture by writer

    Impact of the hyperparameter ok

    The hyperparameter that’s utilized in k-NN is the variety of neighbors that we take note of for the calculation of common worth.

    We often use this following graph to elucidate how a mannequin may be underfitted, or overfitted.

    k-NN regressor overfitting and overfitting – picture by writer

    In our case, if ok is small, there is usually a danger of overfitting.

    If ok is giant, there may be danger of underfitting.

    The intense case of very giant ok is that ok may be the entire variety of the coaching dataset. And the worth of the prediction would be the identical for each new remark: it’s the international common.

    k-NN regressor in Excel with impact of ok – picture by writer

    So, we will say that k-NN improves the concept of predicting with a calculation of the typical worth with a number of observations which can be near the brand new remark.

    k-NN with Two Steady Options

    Now, we’ll research the case of two steady function variables x1 and x2. And we’ll solely discuss concerning the variations with the earlier scenario of 1 function variable.

    Two steady function variables dataset

    When we have now two function variables, I can not plot in 3D with Excel, so the plot incorporates solely x1 as x-axis, and x2 as y-axis.

    So don’t be confused with the earlier dataset, for which y-axis represents the goal worth y.

    Okay-NN with two options in Excel – picture by writer

    Prediction with the Euclidean distance

    Now that we have now two options, we have now to take each of them into consideration.

    One common distance we will use is the Euclidean Distance.

    Then we will use the identical course of to the highest ok observations which have the minimal distance with the brand new remark.

    k-NN regressor in Excel with two options – picture by writer

    To get a visible plot, we will use the identical colours

    • Blue for coaching dataset
    • Pink for the brand new remark
    • Yellow for the discovered ok nearest neighbors
    k-NN regressor in Excel plot with two options – picture by writer

    Affect of the size of the variables

    When you might have two options, one query that we will ask is the influence of the size of the function for the results of prediction.

    First, let’s see this straightforward instance, I multiplied the function x2 by 10.

    k-NN regressor in Excel with completely different scales – picture by writer

    Will this scaling influence the predictions? The reply is in fact sure.

    And we will simply examine them, as within the following picture.

    k-NN regressor in Excel with completely different scales – picture by writer

    It’s simple sufficient to know that the Euclidean Distance sums the squared distinction of the options, no matter their scales.

    Consequently, the function that has a big scale will dominate the gap.

    On the subject of function scaling, one frequent operation is standardization (additionally known as centering and discount) or min–max scaling. The concept is to put all options on a comparable scale.

    BUT, let’s take into consideration this case: what if one function is expressed in {dollars}, and the opposite in yen.

    In the true world, the right relation between the 2 scales is about 1 greenback = 156 yen (as of November 2025). We all know this as a result of we perceive the that means of the items.

    How would the mannequin know this? It DOES NOT.

    The one hyperparameter is ok, and the mannequin doesn’t alter something to right for variations in items or scales. k-NN has no inside mechanism for understanding that two options have completely different items.

    And that is solely the begining of the issues…

    k-NN with the California Housing dataset

    Now, let’s lastly use the real-world dataset of California Housing dataset.

    With the one-feature dataset, we received the essential thought of how k-NN works. With two-feature dataset, we noticed that the size of options is essential.

    Now, with this real-world dataset, we’ll see that the heterogeneous nature of the options make the Euclidean distance meaningless.

    We are going to see another extra essential concepts after we use k-NN in observe.

    Naive utility of k-NN regressor

    Since all options on this dataset are steady, we will simply calculate the Euclidean Distance. And we outline a quantity ok, to calculate the typical worth of the goal variable, right here MedHouseVal.

    In Excel, you possibly can simply do this your self. Or you possibly can help me here and get all of the information.

    k-NN regressor in Excel with California Housing dataset – picture by writer

    Notion of distance primarily based on completely different options

    I mentioned that the earlier utility is naive, as a result of if you happen to look nearer, you will note these issues:

    MedInc (median earnings) is expressed in items of 10,000 USD. If we resolve to specific it in 100,000 USD or in 1,000 USD as an alternative, the prediction will change, as a result of k-NN is delicate to the size of the options. We noticed this drawback earlier than.

    Now, furthermore, every function has a unique nature.

    • MedInc is an sum of money (in {dollars}).
    • HouseAge is an age in years.
    • AveRooms is a depend of rooms.
    • Inhabitants is a lot of individuals.
    • Latitude and longitude are geographic coordinates.

    Due to this fact, the Euclidean distance is doomed.

    Various kinds of distances

    The commonest selection is the Euclidean distance, however it isn’t the one one.

    We are able to additionally use Manhattan distance when options characterize grid-like actions, and Cosine distance when solely the route issues (as with textual content embeddings).

    Every distance modifications how “nearest” is outlined, and due to this fact can change which neighbors KNN selects.

    Relying on the info, different distances may be extra applicable.
    For instance, with latitude and longitude, we will use the true geographical distance (in meters) as an alternative of a easy Euclidean distance on levels.

    Within the California Housing dataset, that is particularly helpful as a result of we have now the precise latitude and longitude of every district.

    Nonetheless, as soon as we attempt to mix these geographical distances with different variables (akin to median earnings, variety of rooms, or inhabitants), the issue turns into extra complicated, as a result of the variables have very completely different natures and scales.

    k-NN regressor in Excel with geographical distance – picture by writer

    Within the cartography renderings beneath, I used k-NN as a smoothing operate to refine the values related to completely different areas of Paris.

    On the left, every space has solely a single worth, so from one quarter to its neighboring quarters, there is usually a discontinuity of the variable.

    On the precise, k-NN permits me to estimate a worth for each particular tackle by smoothing the data primarily based on close by areas.

    Furthermore, for indicators such because the proportion of sure skilled classes, I additionally utilized population-based weighting in order that bigger areas have a stronger affect within the smoothing course of.

    k-NN regressor in Excel with smoothing – picture by writer

    As a conclusion, when the scenario permits it, selecting a extra particular distance might help us higher seize the underlying actuality.

    By linking the gap to the character of the info, we will make k-NN rather more significant: geographical distance for coordinates, cosine distance for embeddings, and so forth. The selection of distance isn’t just a technical element, it modifications how the mannequin “sees” the world and which neighbors it considers related.

    How Categorical Options may be Modeled

    You might hear that categorical options can’t be dealt with in k-NN fashions.

    However this isn’t utterly true.

    k-NN can work with categorical variables so long as we will outline a distance between two observations.

    Many individuals will say: “simply use one-hot encoding.”

    Others point out label encoding, or ordinal encoding.

    However these strategies behave very otherwise in a distance-based mannequin.

    To make this clear, we’ll use one other dataset: the diamond price dataset (CC BY 4.0 license), which incorporates a number of options akin to carat, minimize, shade, and readability.

    For simplicity, we’ll use solely carat (numerical) and readability (categorical) to display a number of outcomes.

    Predicting Costs with Carat

    First, we’ll begin with carat, because you most likely know that the worth of a diamond relies upon primarily on the dimensions (carat) of the stone.

    The graphic beneath exhibits how k-NN can discover diamonds with related sizes to estimate the worth.

    k-NN regressor in Excel with diamond worth dataset – picture by writer

    One-Scorching Encoding for Readability Characteristic

    Now allow us to have a look at readability.

    Beneath is the desk of classes with their meanings, and we apply one-hot encoding to remodel every class right into a binary vector.

    Readability Which means
    IF Internally Flawless
    VVS1 Very Very Barely Included 1
    VVS2 Very Very Barely Included 2
    VS1 Very Barely Included 1
    VS2 Very Barely Included 2
    SI1 Barely Included 1
    SI2 Barely Included 2
    I1 Included 1

    On this desk, we see that for the brand new diamond with readability VVS2, the closest neighbors are all diamonds from the identical readability class.

    The numerical function carat has little or no affect on the gap, whereas it’s a extra essential function, as you possibly can see within the worth column.

    k-NN regressor in Excel with one-hot encoding for readability in diamond worth dataset – picture by writer

    Key situation 1: all classes are equally distant

    When utilizing Euclidean distance on one-hot vectors:

    • IF vs VVS1 → distance = √2
    • IF vs SI2 → distance = √2
    • IF vs I1 → distance = √2

    Each completely different class is at precisely the identical distance.
    This doesn’t mirror the true diamond grading scale.

    Key situation 2: scaling drawback with steady variables

    As a result of we mix one-hot readability with carat (a steady function), we face one other drawback:

    • carat values in our instance are beneath 1
    • readability vectors have variations of √2 → readability dominates the gap calculation

    So even small modifications in readability overpower the impact of carat.
    That is precisely the identical scaling situation we face with multi-continuous options, however even stronger.

    Ordinal Encoding for Readability

    Now we will strive encoding the Readability function with numerical labels. However as an alternative of utilizing the traditional labels 1, 2, 3… we use expert-based labels that mirror the true grading scale.

    The concept is to translate the readability ranges into values that behave extra like a steady function, much like carat, even when readability will not be strictly steady.

    k-NN regressor in Excel with readability encoding – picture by writer

    With this expert-based encoding, the distances grow to be extra significant.

    Carat and readability are actually on comparable scales, so neither function utterly dominates the gap calculation.

    So, we acquire a higher stability between measurement and readability when choosing neighbors, which provides extra life like predictions.

    k-NN regressor in Excel with ordinal encoding for readability function in diamond worth dataset – picture by writer

    Conclusions

    In conclusion, the k-NN regressor is a extremely non-linear, native estimator. It’s so native that solely the Okay closest observations are literally used.

    After implementing k-NN regressor in Excel, I feel that we will actually ask this query: Is the k-NN regressor actually a Machine Studying mannequin?

    • There isn’t a mannequin coaching
    • When predicting, the number of the neighbor observations doesn’t rely upon the worth of goal variable

    However, it’s so simple to know, that with Excel, we will simply implement the entire algorithm. Furthermore, we will alter the gap as we want.

    Businesspeople can see the concept instantly: to foretell a worth, we have a look at related observations.

    The precise issues with k-NN, and all fashions which can be primarily based on distance:

    • the size of the options
    • the heterogeneous nature of options, which makes the sum meaningless
    • the particular distance that must be outlined in concrete conditions
    • for categorical options, label/ordinal encoding may very well be optimized if we may discover the optimum scaling.

    So briefly, the issue is the scaling of the options. We might imagine that they are often tuned as hyperparameters, however then the tuning would require a lot time.

    We are going to see later that it’s precisely the motivation behind one other household of fashions.

    Right here, the notion of scale can be equal to the notion of function significance, as a result of in k-NN the significance of every function is outlined earlier than utilizing the mannequin.

    So that is solely the start of our journey. We are going to uncover collectively different fashions that may do higher, from this straightforward mannequin, by bettering in numerous instructions: function scaling, from distance to likelihood, splitting to higher mannequin every class…



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleExploring how AI will shape the future of work | MIT News
    Next Article Driving American battery innovation forward | MIT News
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Three OpenClaw Mistakes to Avoid and How to Fix Them

    March 9, 2026
    Artificial Intelligence

    I Stole a Wall Street Trick to Solve a Google Trends Data Problem

    March 9, 2026
    Artificial Intelligence

    Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

    March 9, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Machine Learning at Scale: Managing More Than One Model in Production

    March 9, 2026

    AI Might Take Your Job. But These Roles Could Be Your Future

    June 24, 2025

    WayinVideo AI: Features, Benefits, Pricing and Alternatives

    December 25, 2025

    The Architecture Behind Web Search in AI Chatbots

    December 4, 2025

    Meta lanserar fristående AI-app som utmanar ChatGPT

    May 1, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    The Rise of the “AI-First” Company Is About to Reshape the Future of Work

    May 6, 2025

    Get Started with Rust: Installation and Your First CLI Tool – A Beginner’s Guide

    May 13, 2025

    How to Use LLMs for Powerful Automatic Evaluations

    August 13, 2025
    Our Picks

    Three OpenClaw Mistakes to Avoid and How to Fix Them

    March 9, 2026

    I Stole a Wall Street Trick to Solve a Google Trends Data Problem

    March 9, 2026

    How AI is turning the Iran conflict into theater

    March 9, 2026
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.