- SHAP KernelExplainer takes ~30 ms per prediction (even with a small background)
- A neuro-symbolic mannequin generates explanations contained in the ahead move in 0.9 ms
- That’s a 33× speedup with deterministic outputs
- Fraud recall is an identical (0.8469), with solely a small AUC drop
- No separate explainer, no randomness, no further latency price
- All code runs on the Kaggle Credit Card Fraud Detection dataset [1]
Full code: https://github.com/Emmimal/neuro-symbolic-xai-fraud/
The Second the Downside Turned Actual
I used to be debugging a fraud detection system late one night and needed to grasp why the mannequin had flagged a particular transaction. I referred to as KernelExplainer, handed in my background dataset, and waited. Three seconds later I had a bar chart of characteristic attributions. I ran it once more to double-check a worth and acquired barely totally different numbers.
That’s once I realised there was a structural limitation in how explanations have been being generated. The mannequin was deterministic. The reason was not. I used to be explaining a constant resolution with an inconsistent methodology, and neither the latency nor the randomness was acceptable if this ever needed to run in actual time.
This text is about what I constructed as an alternative, what it price in efficiency, and what it acquired proper, together with one outcome that shocked me.
If explanations can’t be produced immediately and persistently, they can’t be utilized in real-time fraud techniques.
Key Perception: Explainability shouldn’t be a post-processing step. It needs to be a part of the mannequin structure.
Limitations of SHAP in Actual-Time Settings
To be exact about what SHAP truly does: Lundberg and Lee’s SHAP framework [2] computes Shapley values (an idea from cooperative sport principle [3]) that attribute a mannequin’s output to its enter options. KernelExplainer, the model-agnostic variant, approximates these values utilizing a weighted linear regression over a sampled coalition of options. The background dataset acts as a baseline, and nsamples controls what number of coalitions are evaluated per prediction.
This approximation is extraordinarily helpful for mannequin debugging, characteristic choice, and post-hoc evaluation.
The limitation examined right here is narrower however important: when explanations have to be generated at inference time, hooked up to particular person predictions, beneath real-time latency constraints.
While you connect SHAP to a real-time fraud pipeline, you’re operating an approximation algorithm that:
- Depends upon a background dataset you must keep and move at inference time
- Produces outcomes that shift relying on
nsamplesand the random state - Takes 30 ms per pattern at a lowered configuration
The chart under exhibits what that post-hoc output seems like — a world characteristic rating computed after the prediction was already made.
Within the benchmark I ran on the Kaggle creditcard dataset [1], SHAP itself printed a warning:
Utilizing 200 background knowledge samples might trigger slower run instances.
Think about using shap.pattern(knowledge, Ok) or shap.kmeans(knowledge, Ok)
to summarize the background as Ok samples.
This highlights the trade-off between background dimension and computational price in SHAP. 30 ms at 200 background samples is the decrease sure. Bigger backgrounds, which enhance attribution stability, push the price larger.
The neuro-symbolic mannequin I constructed takes 0.898 ms for the prediction and rationalization collectively. There isn’t a flooring to fret about as a result of there isn’t a separate explainer.
The Dataset
All experiments use the Kaggle Credit score Card Fraud Detection dataset [1], protecting 284,807 actual bank card transactions from European cardholders in September 2013, of which 492 are confirmed fraud.
Form : (284807, 31)
Fraud fee : 0.1727%
Fraud samples : 492
Legit samples : 284,315
The options V1 by means of V28 are PCA-transformed principal elements. The unique options are anonymised and never disclosed within the dataset. Quantity is the transaction worth. Time was dropped.
Quantity was scaled with StandardScaler. I utilized SMOTE [4] completely to the coaching set to deal with the category imbalance. The take a look at set was held on the real-world 0.17% fraud distribution all through.
Prepare dimension after SMOTE : 454,902
Fraud fee after SMOTE : 50.00%
Take a look at set : 56,962 samples | 98 confirmed fraud
The take a look at set construction is necessary: 98 fraud circumstances out of 56,962 samples is the precise working situation of this drawback. Any mannequin that scores effectively right here is doing so on a genuinely exhausting process.
Two Fashions, One Comparability
The Baseline: Normal Neural Community
The baseline is a four-layer MLP with batch normalisation [5] and dropout [6], a typical structure for tabular fraud detection.
class FraudNN(nn.Module):
def __init__(self, input_dim):
tremendous().__init__()
self.internet = nn.Sequential(
nn.Linear(input_dim, 128), nn.BatchNorm1d(128),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(128, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.3),
nn.Linear(64, 32), nn.ReLU(),
nn.Linear(32, 1), nn.Sigmoid(),
)
It makes a prediction and nothing else. Explaining that prediction requires a separate SHAP name.
The Neuro-Symbolic Mannequin: Clarification as Structure
The neuro-symbolic mannequin has three elements working collectively: a neural spine, a symbolic rule layer, and a fusion layer that mixes each alerts.
The neural spine learns latent representations from all 29 options. The symbolic rule layer runs six differentiable guidelines in parallel, each computing a gentle activation between zero and one utilizing a sigmoid operate. The fusion layer takes each outputs and produces the ultimate chance.
class NeuroSymbolicFraudDetector(nn.Module):
"""
Enter
|--- Neural Spine (latent fraud representations)
|--- Symbolic Rule Layer (6 differentiable guidelines)
|
Fusion Layer --> P(fraud) + rule_activations
"""
def __init__(self, input_dim, feature_names):
tremendous().__init__()
self.spine = nn.Sequential(
nn.Linear(input_dim, 64), nn.BatchNorm1d(64),
nn.ReLU(), nn.Dropout(0.2),
nn.Linear(64, 32), nn.BatchNorm1d(32), nn.ReLU(),
)
self.symbolic = SymbolicRuleLayer(feature_names)
self.fusion = nn.Sequential(
nn.Linear(32 + 1, 16), nn.ReLU(), # 32 from spine + 1 from symbolic layer (weighted rule activation abstract)
nn.Linear(16, 1), nn.Sigmoid(),
)

The six symbolic guidelines are anchored to the creditcard options with the strongest printed fraud sign [7, 8]: V14, V17, V12, V10, V4, and Quantity.
RULE_NAMES = [
"HIGH_AMOUNT", # Amount exceeds threshold
"LOW_V17", # V17 below threshold
"LOW_V14", # V14 below threshold (strongest signal)
"LOW_V12", # V12 below threshold
"HIGH_V10_NEG", # V10 heavily negative
"LOW_V4", # V4 below threshold
]
Every threshold is a learnable parameter initialised with a site prior and up to date throughout coaching by way of gradient descent. This implies the mannequin doesn’t simply use guidelines. It learns the place to attract the traces.
The reason is a by-product of the ahead move. When the symbolic layer evaluates the six guidelines, it already has every little thing it wants to provide a human-readable breakdown. Calling predict_with_explanation() returns the prediction, confidence, which guidelines fired, the noticed values, and the discovered thresholds, all in a single ahead move at no further price.
Coaching
Each fashions have been skilled for 40 epochs utilizing Adam [9] with weight decay and a step studying fee scheduler.
[Baseline NN] Epoch 40/40 practice=0.0067 val=0.0263
[Neuro-Symbolic] Epoch 40/40 practice=0.0030 val=0.0099
The neuro-symbolic mannequin converges to a decrease validation loss. Each curves are clear with no signal of instability from the symbolic elements.

Efficiency on the Actual-World Take a look at Set
[Baseline NN]
precision recall f1-score assist
Legit 0.9997 0.9989 0.9993 56864
Fraud 0.5685 0.8469 0.6803 98
ROC-AUC : 0.9737
[Neuro-Symbolic]
precision recall f1-score assist
Legit 0.9997 0.9988 0.9993 56864
Fraud 0.5425 0.8469 0.6614 98
ROC-AUC : 0.9688
Recall on fraud is an identical: 0.8469 for each fashions. The neuro-symbolic mannequin catches precisely the identical proportion of fraud circumstances because the unconstrained black-box baseline.
The precision distinction (0.5425 vs 0.5685) means the neuro-symbolic mannequin generates just a few extra false positives. Whether or not that’s acceptable will depend on the price ratio between false positives and missed fraud in your particular deployment. The ROC-AUC hole (0.9688 vs 0.9737) is small.
The purpose is just not that the neuro-symbolic mannequin is extra correct. It’s that it’s comparably correct whereas producing explanations that the baseline can not produce in any respect.
What the Mannequin Truly Discovered
After 40 epochs, the symbolic rule thresholds are now not initialised priors. The mannequin discovered them.
Rule Discovered Threshold Weight
--------------------------------------------------------------
HIGH_AMOUNT Quantity > -0.011 (scaled) 0.121
LOW_V17 V17 < -0.135 0.081
LOW_V14 V14 < -0.440 0.071
LOW_V12 V12 < -0.300 0.078
HIGH_V10_NEG V10 < -0.320 0.078
LOW_V4 V4 < -0.251 0.571
The thresholds for V14, V17, V12, and V10 are in step with what printed EDA on this dataset has recognized because the strongest fraud alerts [7, 8]. The mannequin discovered them by means of gradient descent, not handbook specification.
However there’s something uncommon within the weight column: LOW_V4 carries 0.571 of the overall symbolic weight, whereas the opposite 5 guidelines share the remaining 0.429. One rule dominates the symbolic layer by a large margin.
That is the outcome I didn’t count on, and it’s price being direct about what it means. The rule_weights are handed by means of a softmax throughout coaching, which in precept prevents any single weight from collapsing to 1. However softmax doesn’t implement uniformity. It simply normalises. With ample gradient sign, one rule can nonetheless accumulate many of the weight if the characteristic it covers is strongly predictive throughout the coaching distribution.
V4 is a recognized fraud sign on this dataset [7], however this stage of dominance suggests the symbolic layer is behaving extra like a single-feature gate than a multi-rule reasoning system throughout inference. For the mannequin’s predictions this isn’t an issue, because the neural spine continues to be doing the heavy lifting on latent representations. However for the reasons, it implies that on many transactions, the symbolic layer’s contribution is essentially decided by a single rule.
I’ll come again to what needs to be achieved about this.
The Benchmark
The central query: how lengthy does it take to provide a proof, and does the output have the properties you want in manufacturing?
I ran each rationalization strategies on 100 take a look at samples.
All latency measurements have been taken on CPU (Intel i7-class machine, PyTorch, no GPU acceleration).
SHAP (KernelExplainer, 200 background samples, nsamples=100)
Complete : 3.00s Per pattern : 30.0 ms
Neuro-Symbolic (predict_with_explanation, single ahead move)
Complete : 0.0898s Per pattern : 0.898 ms
Speedup : 33x

The latency distinction is the headline, however the consistency distinction issues as a lot in observe.
SHAP’s KernelExplainer makes use of Monte Carlo sampling to approximate Shapley values [2]. Run it twice on the identical enter and also you get totally different numbers. The reason shifts with the random state. In a regulated setting the place choices must be auditable, a stochastic rationalization is a legal responsibility.
The neuro-symbolic mannequin produces the identical rationalization each time for a similar enter. The rule activations are a deterministic operate of the enter options and the discovered weights. There may be nothing to range.

Studying a Actual Clarification
Right here is the output from predict_with_explanation() on take a look at set transaction 840, a confirmed fraud case.
Prediction : FRAUD
Confidence : 100.0%
Guidelines fired (4) -- produced INSIDE the ahead move:
Rule Worth Op Threshold Weight
-------------------------------------------------
LOW_V17 -0.553 < -0.135 0.081
LOW_V14 -0.582 < -0.440 0.071
LOW_V12 -0.350 < -0.300 0.078
HIGH_V10_NEG -0.446 < -0.320 0.078
4 guidelines fired concurrently. Every line tells you which ones characteristic was concerned, the noticed worth, the discovered threshold it crossed, and the load that rule carries within the symbolic layer. This output was not reconstructed from the prediction after the actual fact. It was produced on the similar second because the prediction, as a part of the identical computation.
Discover that LOW_V4 (the rule with 57% of the symbolic weight) didn’t fireplace on this transaction. The 4 guidelines that did fireplace (V17, V14, V12, V10) all carry comparatively modest weights individually. The mannequin nonetheless predicted FRAUD at 100% confidence, which suggests the neural spine carried this resolution. The symbolic layer’s function right here was to establish the precise sample of 4 anomalous V-feature values firing collectively, and floor it as a readable rationalization.
That is truly a helpful demonstration of how the 2 elements work together. The neural spine produces the prediction. The symbolic layer produces the justification. They aren’t at all times in good alignment, and that stress is informative.

The identical benchmark run data how incessantly every rule fired throughout fraud-predicted transactions — produced throughout inference with no separate computation. As a result of the 100-sample window displays the real-world 0.17% fraud fee, it comprises only a few fraud predictions, so the bars are skinny. The sample turns into clearer throughout the complete take a look at set, however even right here it confirms the mechanism is working.

The Full Comparability

What Ought to Be Performed In another way
The V4 weight collapse. The softmax over rule_weights failed to forestall one rule from accumulating 57% of the symbolic weight. The right repair is a regularisation time period throughout coaching that penalises weight focus. For instance, an entropy penalty on the softmax output that actively rewards extra uniform distributions throughout guidelines. With out this, the symbolic layer can degrade towards a single-feature gate, which weakens the interpretability argument.
The HIGH_AMOUNT threshold. The discovered threshold for Quantity converged to -0.011 (scaled), which is successfully zero, so the rule fires for nearly any non-trivially small transaction, which suggests it contributes little or no discrimination. The issue is probably going a mix of the characteristic being genuinely much less predictive on this dataset than area instinct suggests (V options dominate within the printed literature [7, 8]) and the initialisation pulling the brink to a low-information area. A bounded threshold initialisation or a discovered gate that may suppress low-utility guidelines would deal with this extra cleanly.
Determination threshold tuning. Each fashions have been evaluated at a 0.5 threshold. In observe, the proper threshold will depend on the price ratio between false positives and missed fraud within the deployment context. That is particularly necessary for the neuro-symbolic mannequin the place precision is barely decrease. A threshold shift towards 0.6 or 0.65 would recuperate precision at the price of some recall. This trade-off needs to be made intentionally, not left on the default.
The place This Matches
That is the fifth article in a sequence on neuro-symbolic approaches to fraud detection. The sooner work covers the foundations:
This text provides a fifth dimension: the explainability structure itself. Not simply whether or not the mannequin may be defined, however whether or not the reason may be produced on the velocity and consistency that manufacturing techniques truly require.
SHAP stays the proper software for mannequin debugging, characteristic choice, and exploratory evaluation. What this experiment exhibits is that when rationalization must be a part of the choice (logged in actual time, auditable per transaction, out there to downstream techniques), the structure has to vary. Publish-hoc strategies are too sluggish and too inconsistent for that function.
The neuro-symbolic strategy trades a small quantity of precision for a proof that’s deterministic, speedy, and structurally inseparable from the prediction itself. Whether or not that trade-off is worth it will depend on your system. The numbers are right here that can assist you resolve.
Disclosure
This text is predicated on impartial experiments utilizing publicly out there knowledge (Kaggle Credit score Card Fraud dataset) and open-source instruments. No proprietary datasets, firm assets, or confidential data have been used. The outcomes and code are totally reproducible as described, and the GitHub repository comprises the whole implementation. The views and conclusions expressed listed here are my very own and don’t symbolize any employer or group.
References
[1] ULB Machine Studying Group. Credit score Card Fraud Detection. Kaggle, 2018. Obtainable at: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud (Dataset launched beneath the Open Database License. Unique analysis: Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G., 2015.)
[2] Lundberg, S. M., & Lee, S.-I. (2017). A unified strategy to deciphering mannequin predictions. Advances in Neural Info Processing Techniques, 30. Obtainable at: https://arxiv.org/abs/1705.07874
[3] Shapley, L. S. (1953). A worth for n-person video games. In H. W. Kuhn & A. W. Tucker (Eds.), Contributions to the Idea of Video games (Vol. 2, pp. 307–317). Princeton College Press. https://doi.org/10.1515/9781400881970-018
[4] Chawla, N. V., Bowyer, Ok. W., Corridor, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Artificial minority over-sampling method. Journal of Synthetic Intelligence Analysis, 16, 321–357. Obtainable at: https://arxiv.org/abs/1106.1813
[5] Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep community coaching by decreasing inside covariate shift. Proceedings of the thirty second Worldwide Convention on Machine Studying (ICML). Obtainable at: https://arxiv.org/abs/1502.03167
[6] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A easy solution to stop neural networks from overfitting. Journal of Machine Studying Analysis, 15(1), 1929–1958. Obtainable at: https://jmlr.org/papers/v15/srivastava14a.html
[7] Dal Pozzolo, A., Caelen, O., Le Borgne, Y.-A., Waterschoot, S., & Bontempi, G. (2014). Discovered classes in bank card fraud detection from a practitioner perspective. Knowledgeable Techniques with Purposes, 41(10), 4915–4928. https://doi.org/10.1016/j.eswa.2014.02.026
[8] Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., & Bontempi, G. (2018). SCARFF: A scalable framework for streaming bank card fraud detection with Spark. Info Fusion, 41, 182–194. https://doi.org/10.1016/j.inffus.2017.09.005
[9] Kingma, D. P., & Ba, J. (2015). Adam: A technique for stochastic optimization. Proceedings of the third Worldwide Convention on Studying Representations (ICLR). Obtainable at: https://arxiv.org/abs/1412.6980
