No Peeking Ahead: Time-Aware Graph Fraud Detection

In my final article [1], I threw out quite a lot of concepts centered round constructing structured graphs, primarily centered on descriptive or unsupervised exploration of knowledge by means of graph constructions. Nonetheless, once we use graph options to enhance our fashions, the temporal nature of the information have to be taken into account. If we wish to keep away from undesired results, we have to be cautious to not leak future data into our coaching course of. This implies our graph (and the options derived from it) have to be constructed in a time-aware, incremental means.

Information leakage is such a paradoxical drawback {that a} 2023 research by Sayash Kapoor and Arvind Narayanan [2] discovered that, as much as that second, it had affected 294 analysis papers throughout 17 scientific fields. They classify the forms of knowledge leakages starting from textbook errors to open analysis issues.

The problem is that in prototyping, outcomes usually appear very promising when they’re actually not. More often than not, individuals don’t understand this till fashions are deployed in manufacturing, losing the time and sources of a whole workforce. Then, efficiency often falls in need of expectations with out understanding why. This situation can turn out to be the Achilles’ heel that undermines all enterprise AI initiatives.

…

ML-base leakage

Information leakage happens when the coaching knowledge incorporates details about the output that gained’t be accessible throughout inference. This causes overly optimistic analysis metrics throughout growth, creating deceptive expectations. Nonetheless, when deployed in real-time programs with the right knowledge movement, the mannequin predictions turn out to be untrustworthy as a result of it realized from data not accessible.

Ethically, we should attempt to provide outcomes that really replicate the capabilities of our fashions, somewhat than sensational or deceptive findings. When a mannequin strikes from prototyping to manufacturing, it mustn’t fail to generalize correctly; if it does, its sensible worth is undermined. Fashions that fail to generalize effectively can exhibit vital issues throughout inference or deployment, compromising their usefulness.

That is particularly harmful in delicate contexts like fraud detection, which frequently contain imbalanced knowledge situations (with fewer fraud instances than non-fraud). In these conditions, the hurt brought on by knowledge leakage is extra pronounced as a result of the mannequin may overfit to leaked knowledge associated to the minority class, producing seemingly good outcomes for the minority label, which is the toughest to foretell. This will result in missed fraud detections, leading to critical sensible penalties.

Information leakage examples will be categorized into textbook errors and open analysis issues [2] as follows:

Textbook Errors:

Imputing lacking values utilizing the whole dataset as a substitute of solely the coaching set, inflicting details about the take a look at knowledge to leak into coaching.
Duplicated or very comparable situations showing each in coaching and take a look at units, akin to photos of the identical object taken from barely completely different angles.
Lack of clear separation between coaching and take a look at datasets, or no take a look at set in any respect, resulting in fashions getting access to take a look at data earlier than analysis.
Utilizing proxies of final result variables that not directly reveal the goal variable.
Random knowledge splitting in situations the place a number of associated information belong to a single entity, akin to a number of declare standing occasions from the identical buyer.
Artificial knowledge augmentation carried out over the entire dataset, as a substitute of solely on the coaching set.

Open issues for analysis:

Temporal leakage happens when future knowledge unintentionally influences coaching. In such instances, strict separation is difficult as a result of timestamps will be noisy or incomplete.
Updating database information with out lineage or audit path, for instance, altering fraud standing with out storing historical past, may cause fashions to coach on future or altered knowledge unintentionally.
Advanced real-world knowledge integration and pipeline points that introduce leakage by means of misconfiguration or lack of controls.

These instances are a part of a broader taxonomy reported in machine studying analysis, highlighting knowledge leakage as a important and infrequently an underinvestigated threat for dependable modeling [3]. Such points come up even with easy tabular knowledge, and so they can stay hidden when working with many options if each isn’t individually checked.

Now, let’s contemplate what occurs once we embody nodes and edges within the equation…

…

Graph-base leakage

Within the case of graph-based fashions, leakage will be sneakier than in conventional tabular settings. When options are derived from related parts or topological constructions, utilizing future nodes or edges can silently alter the graph’s construction. For instance:

methodologies akin to graph neural networks (GNNs) be taught the context not solely from particular person nodes but in addition from their neighbours, which might inadvertently introduce leakage if delicate or future data is propagated throughout the graph construction throughout coaching.
when the graph construction is overwritten or up to date with out preserving the previous occasions means the mannequin loses priceless context wanted for correct temporal evaluation, and it could once more entry data within the incorrect time or lose traceability about doable leakage or issues with the information that originate the graphs.
Computing graph aggregations like diploma, triangles, or PageRank on the whole graph with out accounting for the temporal dimension (time-agnostic aggregation) makes use of all edges: previous, current, and future. This causes knowledge leakage as a result of options embody data from future edges that wouldn’t be accessible at prediction time.

Graph temporal leakage happens when options, edges, or node relationships from future time factors are included throughout coaching in a means that violates the chronological order of occasions. This leads to edges or coaching options that incorporate knowledge from time steps that ought to be unknown.

…

How can this be mounted?

We are able to construct a single graph that captures the whole historical past by assigning timestamps or time intervals to edges. To investigate the graph as much as a particular cut-off date (t), we “look again in time” by filtering any graph to incorporate solely the occasions that occurred earlier than or at that cutoff. This method is right for stopping knowledge leakage as a result of it ensures that solely previous and current data is used for modeling. Moreover, it gives flexibility in defining completely different time home windows for secure and correct temporal evaluation.

On this article, we construct a temporal graph of insurance coverage claims the place the nodes signify particular person claims, and temporal hyperlinks are created when two claims share an entity (e.g., telephone quantity, license plate, restore store, and so on.) to make sure the right occasion order. Graph-based options are then computed to feed fraud prediction fashions, fastidiously avoiding using future data (no peeking).

The concept is straightforward: if two claims share a typical entity and one happens earlier than the opposite, we join them for the time being this connection turns into seen (determine 1). As we defined within the earlier part, the best way we mannequin the information is essential, not solely to seize what we’re actually in search of, but in addition to allow using superior strategies akin to Graph Neural Networks (GNNs).

Determine 1: Claims and entities (akin to telephone numbers) are added to the graph as they arrive over time. When a brand new declare (Claim_id2 at time t) shares a beforehand noticed entity with an earlier declare (Claim_id1 at time t-1), a directed temporal edge (blue arrow) is created from the sooner to the later declare. This development reveals when relationships turn out to be seen and ensures causal, time-respecting connectivity within the graph. Picture by Creator.

In our graph mannequin, we save the timestamp when an entity is first seen, capturing the second it seems within the knowledge. Nonetheless, in lots of real-world situations, it’s also helpful to think about a time interval spanning the entity’s first and final appearances (for instance, generated with one other variable like plate or electronic mail). This interval can present richer temporal context, reflecting the lifespan or lively interval of nodes and edges, which is efficacious for dynamic temporal graph analyses and superior mannequin coaching.

Code

The code is on the market on this repository: Link to the repository

To run the experiments, arrange a Python ≥3.11 surroundings with the required libraries (e.g., torch, torch-geometric, networkx, and so on.). It is suggested to make use of a digital surroundings (by way of venv or conda) to maintain dependencies remoted.

Code Pipeline

The diagram of Determine 2, exhibits the end-to-end workflow for fraud detection with GraphSAGE. Step 1 masses the (simulated) uncooked claims knowledge. Step 2 builds a time-stamped directed graph (entity→declare and older-claim→newer-claim). Step 3 performs temporal slicing to create prepare, validation, and take a look at units, then indexes nodes, builds options, and eventually trains and validates the mannequin.

Determine 2. Finish-to-end pipeline for temporal fraud modeling: **(I)** load knowledge → **(II)** construct and save a time-stamped graph and **(III)** put together subgraphs (temporal slicing → node indexing → characteristic constructing → PyG `Information`) for coaching and inference. Picture by Creator.

Step 1: Simulated Fraud Dataset

We first simulate a dataset of insurance claims. Every row within the dataset represents a declare and consists of variables akin to:

Entities: insurer_license_plate, insurer_phone_number, insurer_email, insurer_address, repair_shop, bank_account, claim_location, third_party_license_plate
Core data: claim_id, claim_date, type_of_claim, insurer_id, insurer_name
Goal: fraud (binary variable indicating whether or not the declare is fraudulent or not)

These entity attributes act as potential hyperlinks between claims, permitting us to deduce connections by means of shared values (e.g., two claims utilizing the identical restore store or telephone quantity). By modeling these implicit relationships as edges in a graph, we will construct highly effective topological representations that seize suspicious behavioral patterns and allow downstream duties akin to characteristic engineering or graph-based studying.

Desk 1. Overview of simulated insurance coverage claims knowledge, exhibiting key entity fields and the fraud label for every file used within the train. Desk by Creator.

Determine 3. Distribution of fraud and non-fraud claims (left) and evolution of day by day fraud price with declare quantity (proper) within the simulated dataset. Fraud base price of roughly 12.45%. Picture by Creator.

Step2: Graph Modeling

We use the NetworkX library to construct our graph model. For small-scale examples, NetworkX is ample and efficient. For extra superior graph processing, instruments like Memgraph or Neo4j may very well be used. To mannequin with NetworkX, we create nodes and edges representing entities and their relationships, enabling community evaluation and visualization inside Python.

So, we’ve got:

one node per declare, with node key equal to the claim_id and attributes as node_type and claim_date
one node per entity worth (telephone, plate, checking account, store, and so on.). Node key: "{column_name}:{worth}" and attributes node_type = <column_name> (e.g., "insurer_phone_number", "bank_account", "repair_shop") label = <worth> (simply the uncooked worth with out the prefix)

The graph consists of these two forms of edges:

claim_id(t-1)→ claim_id(t) : when two claims share an entity (with edge_type='claim-claim')
entity_value →claim_id: direct hyperlink to the shared entity (with edge_type='entity-claim')

These edges are annotated with:

edge_type: to differentiate the relation (declare→declare vs entity→declare)
entity_type: the column from which the worth comes (like bank_account)
shared_value: the precise worth (like a telephone quantity or license plate)
timestamp: when the sting was added (primarily based on the present declare’s date)

To interpret our simulation, we carried out a script that generates explanations for why a declare is flagged as fraud. In Determine 4, declare 20000695 is taken into account dangerous primarily as a result of it’s related to restore store SHOP_856, which acts as an lively hub with a number of claims linked round comparable dates, a sample usually seen in fraud “bursts.” Moreover, this declare shares a license plate and handle with a number of different claims, creating dense connections to different suspicious instances.

Determine 4. Visible clarification for declare 20000695: The left panel exhibits a entity-claim community highlighting connections between the declare and key entities such because the restore store, location, and license plate; the suitable panel shows the claim-claim community, revealing how this declare clusters with others by way of shared entities. The decrease panel summarizes threat elements supporting the fraud label. Streamlit code. Picture by Creator.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.

Step 3: Graph preparation & Coaching

Illustration studying transforms advanced, high-dimensional knowledge (like textual content, photos, or sensor readings) into simplified, structured codecs (usually known as embeddings) that seize significant patterns and relationships. These realized representations enhance mannequin efficiency, interpretability, and the flexibility to switch studying throughout completely different duties.

We train a neural network, to map every enter to a vector in ℝᵈ that encodes what issues. In our pipeline, GraphSAGE does illustration studying on the declare graph: it aggregates data from a node’s neighbours (shared telephones, outlets, plates, and so on.) and mixes that with the node’s personal attributes to provide a node embedding. These embeddings are then fed to a small classifier head to foretell fraud.

3.1. Temporal slicing

From the only full graph we create in step 2, we extract three time-sliced subgraphs for prepare, validation, and take a look at. For every cut up we select a cutoff date and hold solely (1) declare nodes with claim_date ≤ cutoff, and (2) edges whose timestamp ≤ cutoff. This produces a time-consistent subgraph for that cut up: no data from the longer term leaks into the previous, matching how the mannequin would run in manufacturing with solely historic knowledge accessible.

3.2 Node indexing

Give each node within the sliced graph an integer index 0…N-1. That is simply an ID mapping (like tokenization). We’ll use these indices to align options, labels, and edges in tensors.

3.3 Construct node options

Create one characteristic row per node:

Sort one-hot (declare, telephone, electronic mail, …).
Diploma stats computed throughout the sliced graph: normalized in-degree, out-degree, and undirected diploma throughout the sliced graph.
Prior fraud from older neighbors (claims solely): fraction of older related claims (direct declare→declare predecessors) which can be labeled fraud, contemplating solely neighbors that existed earlier than the present declare’s time.
We additionally set the label y (1/0) for claims and 0 for entities, and mark claims in claim_mask so loss/metrics are computed solely on claims.

3.4 Construct PyG Information

Translate edges (u→v) right into a 2×E integer tensor edge_index utilizing the node indices and add self-loops so every node additionally retains its personal options at each layer. Pack all the pieces right into a PyG Information(x, edge_index, y, claim_mask) object. Edges are directed, so message passing respects time (earlier→later).

3.5 GraphSage:

We implement a GraphSAGE structure in PyTorch Geometric with the SAGEConv layer. so, we run two GraphSAGE convolution layers (imply aggregation), ReLU, dropout, then a linear head to foretell fraud vs non-fraud. We prepare full-batch (no neighbor sampling). The loss is weighted to deal with class imbalance and is computed solely on declare nodes by way of claim_mask. After every epoch we consider on the validation cut up and select the choice threshold that maximizes F1; we hold the perfect mannequin by val-F1 (early stopping).

Determine 5. PyTorch implementation of a GraphSAGE mannequin structure for node illustration studying and prediction on graph knowledge. Picture by Creator.

3.6 Inference outcomes.

Consider the perfect mannequin on the take a look at cut up utilizing the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a carry desk/plot (how concentrated fraud is by rating decile), export a t-SNE plot of declare embeddings to visualise construction.

**Determine 6:** Mannequin outcomes. **Left:** Raise by decile on the take a look at set; **Proper:** t-SNE of declare embeddings (fraud vs. non-fraud). Picture by Creator.

The carry chart evaluates how effectively the mannequin ranks fraud: bars present carry by rating decile and the road exhibits cumulative fraud seize. Within the prime 10–20% of claims (Deciles 1–2), the fraud price is about 2–3× the common, suggesting that reviewing the highest 20–30% of claims would seize a big share of fraud. The t-SNE plot exhibits a number of clusters the place fraud concentrates, indicating the mannequin learns significant relational patterns, whereas overlap with non-fraud factors highlights remaining ambiguity and alternatives for characteristic or mannequin tuning.

…

Conclusion

Utilizing a graph that solely connects older claims to newer claims (previous to future) with out “leaking” future fraud data, the mannequin efficiently concentrates fraud instances within the prime scoring teams, reaching about 2–3 occasions higher detection within the prime 10–20%. This setup is dependable sufficient to deploy.

As a take a look at, it’s doable to attempt a model the place the graph is two-way or undirected (connections each methods) and examine the spurious enchancment with the one-way model. If the two-way model will get considerably higher outcomes, it’s seemingly due to “temporal leakage,” that means future data is badly influencing the mannequin. This can be a method to show why two-way connections shouldn’t be utilized in actual use instances.

To keep away from making the article too lengthy, we are going to cowl the experiments with and with out leakage in a separate article. On this article, we deal with creating a mannequin that meets manufacturing readiness.

There’s nonetheless room to enhance with richer options, calibration, and small mannequin tweaks, however our focus right here is to clarify a leak-safe temporal graph methodology that addresses knowledge leakage.

References

[1] Gomes-Gonçalves, E. (2025, January 23). Functions and Alternatives of Graphs in Insurance coverage. Medium. Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab

[2] Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility disaster in machinelearning-based science. Patterns. 2023; 4 (9): 100804. Link.

[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of knowledge leakage induced by clusters. Stochastic Environmental Analysis and Danger Evaluation, 38(7), 2815–2828.

[4] Huang, S., et. al. (2024). UTG: In the direction of a Unified View of Snapshot and Occasion Based mostly Fashions for Temporal Graphs. arXiv preprint arXiv:2407.12269. https://arxiv.org/abs/2407.12269

[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. In the direction of Information Science. Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/

[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/reports/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1

Source link

Reading Research Papers in the Age of LLMs

The Machine Learning “Advent Calendar” Day 6: Decision Tree Regressor

TDS Newsletter: How to Design Evals, Metrics, and KPIs That Work

Is RAG Dead? The Rise of Context Engineering and Semantic Layers for Agentic AI

What Does “Following Best Practices” Mean in the Age of AI?

What’s next for AI and math

Alibaba Cloud presenterar AI-modeller och verktyg för internationella kunder

DeepMind har utvecklat Music AI Sandbox

Most Popular

Implementing the Coffee Machine Project in Python Using Object Oriented Programming

How to Maximize Technical Events – NVIDIA GTC Paris 2025

Avoiding Costly Mistakes with Uncertainty Quantification for Algorithmic Home Valuations

Our Picks