, we are going to implement AUC in Excel.
AUC is normally used for classification duties as a efficiency metric.
However we begin with a confusion matrix, as a result of that’s the place everybody begins in apply. Then we are going to see why a single confusion matrix shouldn’t be sufficient.
And we can even reply these questions:
- AUC means Space Underneath the Curve, however beneath which curve?
- The place does that curve come from?
- Why is the realm significant?
- Is AUC a likelihood? (Sure, it has a probabilistic interpretation)
1. Why a confusion matrix shouldn’t be sufficient
1.1 Scores from fashions
A classifier will normally give us scores, not remaining choices. The choice comes later, once we select a threshold.
In case you learn the earlier “Creation Calendar” articles, you’ve already seen that “rating” can imply various things relying on the mannequin household:
- Distance-based fashions (comparable to k-NN) typically compute the proportion of neighbors for a given class (or a distance-based confidence), which turns into a rating.
- Density-based fashions compute a chance beneath every class, then normalize to get a remaining (posterior) likelihood.
- Classification Tree-based fashions typically output the proportion of a given class among the many coaching samples contained in the leaf (that’s the reason many factors share the identical rating).
- Weight-based fashions (linear fashions, kernels, neural networks) compute a weighted sum or a non-linear rating, and generally apply a calibration step (sigmoid, softmax, Platt scaling, and so on.) to map it to a likelihood.
So irrespective of the method, we find yourself with the identical state of affairs: a rating per commentary.
Then, in apply, we choose a threshold, typically 0.5, and we convert scores into predicted lessons.
And that is precisely the place the confusion matrix enters the story.
1.2 The confusion matrix at one threshold
As soon as a threshold is chosen, each commentary turns into a binary choice:
- predicted constructive (1) or predicted unfavorable (0)
From that, we will rely 4 numbers:
- TP (True Positives): predicted 1 and truly 1
- TN (True Negatives): predicted 0 and truly 0
- FP (False Positives): predicted 1 however truly 0
- FN (False Negatives): predicted 0 however truly 1
This 2×2 counting desk is the confusion matrix.
Then we sometimes compute ratios comparable to:
- Precision = TP / (TP + FP)
- Recall (TPR) = TP / (TP + FN)
- Specificity = TN / (TN + FP)
- FPR = FP / (FP + TN)
- Accuracy = (TP + TN) / Whole
To date, every thing is clear and intuitive.
However there’s a hidden limitation: all these values depend upon the edge. So the confusion matrix evaluates the mannequin at one working level, not the mannequin itself.
1.3 When one threshold breaks every thing
This can be a unusual instance, but it surely nonetheless makes the purpose very clearly.
Think about that your threshold is ready to 0.50, and all scores are under 0.50.
Then the classifier predicts:
- Predicted Constructive: none
- Predicted Detrimental: everybody
So that you get:
- TP = 0, FP = 0
- FN = 10, TN = 10

This can be a completely legitimate confusion matrix. It additionally creates a really unusual feeling:
- Precision turns into
#DIV/0!as a result of there aren’t any predicted positives. - Recall is 0% since you didn’t seize any constructive.
- Accuracy is 50%, which sounds “not too dangerous”, despite the fact that the mannequin discovered nothing.
Nothing is improper with the confusion matrix. The difficulty is the query we requested it to reply.
A confusion matrix solutions: “How good is the mannequin at this particular threshold?”
If the edge is poorly chosen, the confusion matrix could make a mannequin look ineffective, even when the scores include actual separation.
And in your desk, the separation is seen: positives typically have scores round 0.49, negatives are extra round 0.20 or 0.10. The mannequin shouldn’t be random. Your threshold is just too strict.
That’s the reason a single threshold shouldn’t be sufficient.
What we’d like as an alternative is a option to consider the mannequin throughout thresholds, not at a single one.
2. ROC
First we have now to construct the curve, since AUC stands for Space Underneath a Curve, so we have now to grasp this curve.
2.1 What ROC means (and what it’s)
As a result of the primary query everybody ought to ask is: AUC beneath which curve?
The reply is:
AUC is the realm beneath the ROC curve.
However this raises one other query.
What’s the ROC curve, and the place does it come from?
ROC stands for Receiver Working Attribute. The identify is historic (early sign detection), however the concept is fashionable and easy: it describes what occurs once you change the choice threshold.
The ROC curve is a plot with:
- x-axis: FPR (False Constructive Fee)
FPR = FP / (FP + TN) - y-axis: TPR (True Constructive Fee), additionally referred to as Recall or Sensitivity
TPR = TP / (TP + FN)
Every threshold provides one level (FPR, TPR). While you join all factors, you get the ROC curve.
At this stage, one element issues: the ROC curve shouldn’t be straight noticed; it’s constructed by sweeping the edge over the rating ordering.
2.2 Constructing the ROC curve from scores
For every rating, we will use it as a threshold (and naturally, we may additionally outline personalized thresholds).
For every threshold:
- we compute TP, FP, FN, TN from the confusion matrix
- then we calculate FPR and TPR
So the ROC curve is solely the gathering of all these (FPR, TPR) pairs, ordered from strict thresholds to permissive thresholds.
That is precisely what we are going to implement in Excel.

At this level, it is very important discover one thing that feels nearly too easy. Once we construct the ROC curve, the precise numeric values of the scores don’t matter. What issues is the order.
If one mannequin outputs scores between 0 and 1, one other outputs scores between -12 and +5, and a 3rd outputs solely two distinct values, ROC works the identical method. So long as greater scores are inclined to correspond to the constructive class, the edge sweep will create the identical sequence of selections.
That’s the reason step one in Excel is at all times the identical: kind by rating from highest to lowest. As soon as the rows are in the fitting order, the remaining is simply counting.
2.3 Studying the ROC curve
Within the Excel sheet, the development turns into very concrete.
You kind observations by Rating, from highest to lowest. You then stroll down the checklist. At every row, you act as if the edge is ready to that rating, which means: every thing above is predicted constructive.
That lets Excel compute cumulative counts:
- what number of positives you’ve accepted up to now
- what number of negatives you’ve accepted up to now
From these cumulative counts and the dataset totals, we compute TPR and FPR.
Now each row is one ROC level.
Why the ROC curve seems to be like a staircase
- When the subsequent accepted row is a constructive, TP will increase, so TPR will increase whereas FPR stays flat.
- When the subsequent accepted row is a unfavorable, FP will increase, so FPR will increase whereas TPR stays flat.
That’s the reason, with actual finite knowledge, the ROC curve is a staircase. Excel makes this seen.
2.4 Reference instances you must acknowledge
A couple of reference instances aid you learn the curve instantly:
- Good classification: the curve goes straight up (TPR reaches 1 whereas FPR stays 0), then goes proper on the prime.

- Random classifier: the curve stays near the diagonal line from (0,0) to (1,1).

- Inverted rating: the curve falls “under” the diagonal, and the AUC turns into smaller than 0.5. However on this case we have now to vary the scores with 1-score. In principle, we will contemplate this fictive case. In apply, this normally occurs when scores are interpreted within the improper route or class labels are swapped.

These aren’t simply principle. They’re visible anchors. Upon getting them, you’ll be able to interpret any actual ROC curve shortly.
3. ROC AUC
Now, with the curve, what can we do?
3.1 Computing the realm
As soon as the ROC curve exists as a listing of factors (FPR, TPR), the AUC is pure geometry.
Between two consecutive factors, the realm added is the realm of a trapezoid:
- width = change in FPR
- top = common TPR of the 2 factors
In Excel, this turns into a “delta column” method:
- compute dFPR between consecutive rows
- multiply by the common TPR
- sum every thing

Totally different instances:
- excellent classification: AUC = 1
- random rating: AUC ≈ 0.5
- inverted rating: AUC < 0.5
So the AUC is actually the abstract of the entire ROC staircase.
3.2. AUC as a likelihood
AUC shouldn’t be about selecting a threshold.
It solutions a a lot easier query:
If I randomly choose one constructive instance and one unfavorable instance, what’s the likelihood that the mannequin assigns a better rating to the constructive one?
That’s all.
- AUC = 1.0 means excellent rating (the constructive at all times will get a better rating)

- AUC = 0.5 means random rating (it’s mainly a coin flip)

- AUC < 0.5 means the rating is inverted (negatives are inclined to get greater scores)
This interpretation is extraordinarily helpful, as a result of it explains once more this essential level:
AUC solely depends upon rating ordering, not on absolutely the values.
Because of this ROC AUC works even when the “scores” aren’t completely calibrated chances. They are often uncooked scores, margins, leaf proportions, or any monotonic confidence worth. So long as greater means “extra possible constructive”, AUC can consider the rating high quality.
Conclusion
A confusion matrix evaluates a mannequin at one threshold, however classifiers produce scores, not choices.
ROC and AUC consider the mannequin throughout all thresholds by specializing in rating, not calibration.
Ultimately, AUC solutions a easy query: how typically does a constructive instance obtain a better rating than a unfavorable one?
Seen this fashion, ROC AUC is an intuitive metric, and a spreadsheet is sufficient to make each step specific.
