Explainable Anomaly Detection with RuleFit: An Intuitive Guide

your anomaly detection outcomes to your stakeholders, the instant subsequent query is all the time “why?”.

In observe, merely flagging an anomaly is never sufficient. Understanding what went improper is crucial to figuring out the perfect subsequent motion.

But, most machine learning-based anomaly detection strategies cease at producing an anomaly rating. They’re black-box in nature, which makes it painful to make sense of their outputs-why does this pattern have the next anomaly rating than its neighbors?

To sort out this explainability problem, you will have seemingly already resorted to standard eXplainable AI (XAI) methods. Maybe you’re calculating function significance to determine which variables are driving the abnormality, or you’re operating counterfactual evaluation to see how shut a case was to regular.

These are helpful, however what should you might do extra? What should you can derive a set of interpretable IF-THEN guidelines that characterize the recognized anomalies?

That is precisely what the RuleFit algorithm [1] guarantees.

On this put up, we’ll discover how the RuleFit algorithm works intuitively, how it may be utilized to clarify detected anomalies, and stroll via a concrete case research.

1. How Does It Work?

Earlier than diving into the technical particulars, let’s first make clear what we goal to have after making use of the algorithm: We need to have a set of IF-THEN guidelines that quantitatively characterize the irregular samples, in addition to the significance of these guidelines.

To get there, we have to reply two questions:

(1) How will we generate significant IF-THEN situations from the info?

(2) How will we calculate the rule significance rating to find out which of them truly matter?

The RuleFit algorithm addresses these questions by splitting the work into two complementary elements, the “Rule” and the “Match”.

1.1 The “Rule” in RuleFit

In RuleFit, a rule appears to be like like this:

IF x1 < 10 AND x2 > 5 THEN 1 ELSE 0

Would this construction look a bit extra acquainted if we visualize it like this:

Determine 1. A rule is only one particular path via a call tree. (Picture by writer)

Sure, it’s a determination tree! The rule right here is simply traversing one particular path via the tree, from the foundation node to the leaf node.

In RuleFit, the rule technology course of closely depends on constructing determination bushes, which predict the goal final result given the enter options. As soon as the tree is constructed, any path from the foundation to a node in a tree could be transformed to a call rule, as we now have simply seen within the instance above.

To make sure the foundations are numerous, RuleFit doesn’t simply match one determination tree. As a substitute, it leverages tree ensemble algorithms (e.g., random forest, Gradient Boosting bushes, and so forth.) to generate many various determination bushes.

Additionally, the depths of these bushes are, on the whole, totally different. This brings the advantages of producing guidelines with variable lengths, additional enhancing the range.

Right here, we must always be aware that though the ensemble bushes are constructed with predicting the goal final result in thoughts, the RuleFit algorithm does probably not care in regards to the finish prediction outcomes. It merely makes use of this tree-building train because the automobile to extract significant, quantitative guidelines.

Successfully, because of this we’ll discard the expected worth in every node and solely maintain the situations that lead us to a node. These situations produce the foundations we care about.

Okay, we are able to now wrap up the primary processing step within the RuleFit algorithm: the rule constructing. The result of this step is a pool of candidate guidelines that might doubtlessly clarify the precise knowledge conduct.

However out of all these guidelines, which of them truly deserve our consideration?

Effectively, that is the place the second step of RuleFit is available in. We “match” to rank.

1.2 The “Match” in RuleFit

Basically, RuleFit uncovers an important guidelines by way of function choice.

First, RuleFit treats every rule as a brand new binary function, that’s, if the rule is glad for a selected pattern, it will get a price of 1 for this binary function; in any other case, its worth is 0.

Then, RuleFit performs sparse linear regression with Lasso by utilizing all of the “uncooked” options from the unique dataset, in addition to the newly engineered binary options derived from the foundations, to foretell the goal final result. This manner, every function (uncooked options + binary rule options) will get a coefficient.

One key attribute of Lasso is that its loss perform forces the coefficients of these unimportant options to be precisely zero. This successfully means these unimportant options are faraway from the mannequin.

Because of this, by merely analyzing which binary rule options survived the Lasso evaluation, we’d instantly know which guidelines are necessary by way of getting correct predictions of the goal final result. As well as, by wanting on the coefficient magnitudes related to the rule options, we’d have the ability to rank the significance of the foundations.

1.3 Recap

We’ve simply lined the important principle behind the RuleFit algorithm. To summarize, we are able to view this method as a two-step resolution for offering explainability:

(1) It first extracts the foundations by coaching an ensemble of determination bushes. That’s the “Rule” half.

(2) It then cleverly converts these guidelines into binary options and performs normal function choice by utilizing sparse linear regression (Lasso). That’s the “Match” half.

Lastly, the surviving guidelines with non-zero coefficients are necessary ones which can be price our consideration.

At this level, you will have observed that “predicting goal final result” pops up at each the “Rule” and “Match” steps. If we’re coping with a regression or classification downside, it’s simply comprehensible that the “goal final result” is the numerical worth or the label we need to predict, and the foundations could be interpreted as patterns that drive the prediction.

However what about anomaly detection, which is essentially an unsupervised activity? How can we apply RuleFit there?

2. Anomaly Rationalization with RuleFit

2.1 Utility Sample

To start with, we have to remodel the unsupervised explainability downside right into a supervised one. Right here’s how.

As soon as we now have our anomaly detection outcomes (doesn’t matter which algorithm we used), we are able to create binary labels, i.e., 1 for an recognized anomaly and 0 for a standard knowledge level, as our “goal final result.” This manner, we now have precisely what RuleFit wants: the uncooked options, and the goal final result to foretell.

Then, the RuleFit can work its magic to generate a pool of candidate guidelines and match a sparse linear regression mannequin to retain solely the necessary guidelines. The coefficients of the ensuing mannequin would then point out how a lot every rule contributes to the log-odds of an occasion being categorized as an anomaly. To place it one other method, they inform us which rule mixtures most strongly push a pattern towards being labeled as anomalous.

Notice which you could, in principle, additionally use the anomaly rating (produced by the first anomaly detection mannequin) because the “goal final result”. This may change the appliance of RuleFit from a classification setting to a regression setting.

Each approaches are legitimate, however they reply barely totally different questions: With the binary label classification setting, the RuleFit uncovers “What makes one thing an anomaly?“; With the anomaly rating regression setting, the RuleFit uncovers “What drives the severity of an anomaly?“.

In observe, the foundations generated by each approaches will most likely be very related. Nonetheless, utilizing a binary anomaly label because the goal for a RuleFit is extra generally used for explaining detected anomalies. It’s simple by way of interpretation and direct applicability to creating enterprise guidelines for flagging future anomalies.

2.2 Case Examine

Let’s stroll via a concrete instance to see how RuleFit works in motion. Right here, we’ll create an anomaly detection situation utilizing the Iris dataset [2] (licensed CC BY 4.0), the place every pattern consists of 4 options (sepal_length, sepal_width, petal_length, petal_width) and is labeled as one of many following three classes: Setosa, Versicolor, and Virginica.

Step 1: Information Setup

First, we’ll use all Setosa samples (50) and all Versicolor samples (50) because the “regular” samples. For the “irregular” samples, we’ll use a subset of Virginica samples (10).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
np.random.seed(42)

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.knowledge, columns=iris.feature_names)
y_true = iris.goal

# Get regular samples (Setosa + Versicolor)
normal_mask = (y_true == 0) | (y_true == 1)
X_normal_all = X[normal_mask].copy()

# Get Virginica samples
virginica_mask = (y_true == 2)
X_virginica = X[virginica_mask].copy()

# Randomly choose 10
anomaly_indices = np.random.alternative(len(X_virginica), measurement=10, exchange=False)
X_anomalies = X_virginica.iloc[anomaly_indices].copy()

To make the situation extra real looking, we create a separate coaching set and take a look at set. The practice set accommodates pure “regular” samples, whereas the take a look at set consists of randomly sampled 20 “regular” samples and 10 “irregular” samples.

train_indices = np.random.alternative(len(X_normal_all), measurement=80, exchange=False)
test_indices = np.setdiff1d(np.arange(len(X_normal_all)), train_indices)

X_train = X_normal_all.iloc[train_indices].copy()
X_normal_test = X_normal_all.iloc[test_indices].copy()

# Create take a look at set (20 regular + 10 anomalous)
X_test = pd.concat([X_normal_test, X_anomalies], ignore_index=True)
y_test_true = np.concatenate([
    np.zeros(len(X_normal_test)),   
    np.ones(len(X_anomalies))       
])

Step 2: Anomaly Detection

Subsequent, we carry out anomaly detection. Right here, we faux we don’t know the precise labels. On this case research, we apply Native Outlier Issue (LOF) because the anomaly detection algorithm, which locates anomalies by measuring how remoted a knowledge level is in comparison with the density of its native neighbors. In fact, you can too strive different anomaly detection algorithms, comparable to Gaussian Combination Fashions (GMM), Ok-Nearest Neighbors (KNN), and Autoencoders, amongst others. Nonetheless, remember that the intention right here is just to get the detection outcomes, our foremost focus is the anomaly rationalization in step 3.

Particularly, we’ll use the pyOD library to coach the mannequin and make inferences:

# Set up the pyOD library
#!pip set up pyod

from pyod.fashions.lof import LOF

# Standardize options
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.remodel(X_test)

# Native Outlier Issue
lof = LOF(n_neighbors=3)
lof.match(X_train_scaled)

train_scores = lof.decision_function(X_train_scaled)
test_scores = lof.decision_function(X_test_scaled)
threshold = np.percentile(train_scores_lof, 99)
y_pred = (test_scores > threshold).astype(int)

Discover that we now have used the 99% quantile of the anomaly scores obtained on the coaching set as the edge. For particular person take a look at samples, if its anomaly rating is greater than the edge, this pattern might be labeled as “anomaly”. In any other case, the pattern is taken into account “regular”.

At this stage, we are able to shortly examine the detection efficiency with:

classification_report(y_test_true, y_pred, target_names=['Normal', 'Anomaly'])

Not tremendous nice outcomes. Out of 10 true anomalies, solely 5 of them are caught. Nonetheless, the excellent news is that LOF didn’t produce any false positives. You’ll be able to additional enhance the efficiency by tuning the LOF mannequin hyperparameters, adjusting the edge, and even contemplating ensemble studying methods. However bear in mind: our objective right here is to not get the perfect detection accuracy. As a substitute, we goal to see if RuleFit can correctly generate guidelines to clarify the anomalies detected by the LOF mannequin.

Step 3: Anomaly Rationalization

Now we’re attending to the core subject. To use RuleFit, let’s first set up the library from imodels, which is a sklearn-compatible, Interpretable ML package deal for concise, clear, and correct predictive modeling:

pip set up imodels

On this case, we’ll think about a binary label classification setting, the place the irregular samples (within the take a look at set) flagged by the LOF mannequin are labeled as 1, and different un-flagged regular samples (additionally within the take a look at set) are labeled as 0. Notice that we’re labeling primarily based on LOF’s detection outcomes, not the precise floor reality, which we faux we don’t know.

To provoke the RuleFit mannequin:

from imodels import RuleFitClassifier

rf = RuleFitClassifier(                 
        max_rules = 30,           
        lin_standardise=True,           
        include_linear=True,           
        random_state = 42
)

We are able to then proceed with becoming the RuleFit mannequin:

rf.match(
    X_test, 
    y_pred, 
    feature_names=X_test.columns
)

In observe, it’s normally a great observe to do a fast sanity examine to guage how effectively the RuleFit mannequin’s predictions align with the anomaly labels decided by the LOF algorithm:

from sklearn.metrics import accuracy_score, roc_auc_score

y_label = rf.predict(X_test)               
y_prob  = rf.predict_proba(X_test)[:, 1]   

print("accuracy:", accuracy_score(y_pred, y_label))
print("roc-auc:", roc_auc_score (y_pred, y_prob))

For our case, we see that each printouts are 1. This confirms that the RuleFit mannequin has efficiently discovered the patterns that LOF used to determine anomalies. On your personal issues, should you observe values a lot decrease than 1, you would wish to fine-tune your RuleFit hyperparameters.

Now let’s look at the foundations:

guidelines = rf._get_rules()
guidelines = guidelines[rules.coef != 0]                         
guidelines = guidelines[~rules.type.str.contains('linear')]      
guidelines['abs_coef'] = guidelines['coef'].abs()
guidelines = guidelines.sort_values('significance', ascending=False)

The RuleFit algorithm returns a complete of 24 guidelines. A snapshot is proven under:

Let’s first make clear the that means of the outcomes columns:

The “rule” column and the “abs_coef” column are self-explanatory.
The “kind” column has two distinctive values: “linear” and “rule”. The “linear” denotes the unique enter options, whereas “rule” denotes the “IF-THEN” situations generated from determination bushes.
The “coef” column represents the coefficients produced by the Lasso regression evaluation. A optimistic worth signifies that if the rule applies, the log-odds of being categorized because the irregular class will increase. A bigger magnitude signifies a stronger affect of that rule on the prediction.
The “assist” column data the fraction of information samples the place the rule applies.
The “significance” column is calculated as absolutely the worth of the coefficient multiplied by the usual deviation of the binary (0 or 1) values that the rule takes on. So why this calculation? As we now have simply mentioned, a bigger absolute coefficient means a stronger direct affect on the log-odds. That’s clear. For the usual deviation time period, it successfully measures the “discriminative energy” of the foundations. For instance, if a rule is nearly all the time TRUE (very small normal deviation), it doesn’t cut up your knowledge successfully. The identical holds if the rule is nearly all the time FALSE. In different phrases, the rule can’t clarify a lot of the variation within the goal variable. Due to this fact, the significance rating combines each the energy of the rule’s affect (coefficient magnitude) and the way effectively it discriminates between totally different samples (normal deviation).

For our particular case, we see just one high-impact rule (Rule #24):

If a flower’s petal is longer than 5.45 cm and wider than 2 cm, the percentages that LOF classifies it as “anomalous” improve 85-fold. (Notice that exp(4.448999) ~= 85)

Guidelines #26 and #27 are nested inside Rule #24. That is widespread in observe, as RuleFit typically produces “households” of comparable guidelines as a result of they arrive from neighbouring tree splits. Due to this fact, the one rule that really issues for characterizing the LOF-identified anomalies is Rule #24.

Additionally, we see that the assist for Rule #24 is 0.1667 (5/30). This successfully signifies that all 5 LOF-identified anomalies could be defined by this rule. We are able to see that extra clearly within the determine under:

There you have got it: the rule to explain the recognized anomalies!

3. Conclusion

On this weblog put up, we explored the RuleFit algorithm as a robust resolution for explainable anomaly detection. We mentioned:

The way it works: A two-step method the place determination bushes are first fitted to derive significant guidelines, adopted by a sparse linear regression to rank the rule significance.
Learn how to apply to anomaly rationalization: Use the detection outcomes because the pseudo labels and use them because the “goal final result” for the RuleFit mannequin.

With RuleFit in your modeling toolkit, the subsequent time stakeholders ask “Why is that this anomaly?”, you’ll have concrete IF-THEN guidelines that they will perceive and act upon.

Reference

[1] Jerome H. Friedman, Bogdan E. Popescu, Predictive studying by way of rule ensembles, arXiv, 2008.

[2] Fisher, R. A., Iris [Data set]. UCI Machine Learning Repository, 1936.

Source link

Three OpenClaw Mistakes to Avoid and How to Fix Them

I Stole a Wall Street Trick to Solve a Google Trends Data Problem

Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

With AI, researchers predict the location of virtually any protein within a human cell | MIT News

PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch

Martin Trust Center for MIT Entrepreneurship welcomes Ana Bakshi as new executive director | MIT News

TDS Newsletter: How to Make Smarter Business Decisions with AI

How OpenAI and Microsoft’s New Pact Unlocks the Path to AGI

Most Popular

Building Video Game Recommender Systems with FastAPI, PostgreSQL, and Render: Part 1

Data Analyst or Data Engineer or Analytics Engineer or BI Engineer ?

AI Agents Processing Time Series and Large Dataframes

Our Picks