Why Your ML Model Works in Training But Fails in Production

, I labored on real-time fraud detection programs and suggestion fashions for product firms that regarded wonderful throughout improvement. Offline metrics had been robust. AUC curves had been steady throughout validation home windows. Function significance plots informed a clear, intuitive story. We shipped with confidence.

Just a few weeks later, our metrics began to float.

Click on-through charges on suggestions started to slip. Fraud fashions behaved inconsistently throughout peak hours. Some selections felt overly assured, others oddly blind. The fashions themselves had not degraded. There have been no sudden knowledge outages or damaged pipelines. What failed was our understanding of how the system behaved as soon as it met time, latency, and delayed reality in the actual world.

This text is about these failures. The quiet, unglamorous issues that present up solely when machine studying programs collide with actuality. Not optimizer selections or the most recent structure. The issues that don’t seem in notebooks, however floor at 3 a.m. dashboards.

My message is straightforward: most manufacturing ML failures are knowledge and time issues, not modeling issues. If you don’t design explicitly for a way data arrives, matures, and modifications, the system will quietly make these assumptions for you.

Time Journey: An Assumption Leak

Time journey is the most typical manufacturing ML failure I’ve seen, and likewise the least mentioned in concrete phrases. Everybody nods while you point out leakage. Only a few groups can level to the precise row the place it occurred.

Let me make it specific.

Think about a fraud dataset with two tables:

transactions: when the fee occurred

The transactions desk exhibits a consumer making a number of funds on December twenty fourth, all earlier than mid afternoon.(Picture by writer, generated utilizing artificial knowledge for illustration)

chargebacks: when the fraud end result was reported

*The chargeback desk exhibits a fraud report arriving at 6:40 PM the identical day.*(Picture by writer, generated utilizing artificial knowledge for illustration)

The characteristic we would like is user_chargeback_count_last_30_days.

The batch job runs on the finish of the day, simply earlier than midnight, and computes chargeback counts for the final 30 days. For consumer U123, the depend is 1. As of midnight, that’s factually appropriate.

Picture by writer, generated utilizing artificial knowledge for illustration

Now have a look at the ultimate joined coaching dataset.

Morning transactions at 9:10 AM and 11:45 AM already carry a chargeback depend of 1. On the time these funds had been made, the chargeback had not but been reported. However the coaching knowledge doesn’t know that. Time has been flattened.

That is the place the mannequin cheats.

From the mannequin’s perspective, dangerous trying transactions already include confirmed fraud alerts. Offline recall improves dramatically. Nothing appears to be like improper at this level.

However in manufacturing, the mannequin is meant to by no means sees the long run.

When deployed, these early transactions do not need a chargeback depend but. The sign disappears and efficiency collapses.

This isn’t a modeling mistake. It’s an assumption leak.

The hidden assumption is {that a} each day batch characteristic is legitimate for all occasions on that day. It’s not. A characteristic is barely legitimate if it may have existed on the precise second the prediction was made.

Each characteristic should reply one query:

“Might this worth have existed on the precise second the prediction was made?”

If the reply isn’t a assured sure, the characteristic is invalid.

Function Defaults That Turn out to be Alerts

After time journey, it is a quite common failure cause that I’ve seen in manufacturing programs. Not like leakage, this one doesn’t depend on the long run. It depends on silence.

Most engineers deal with lacking values as a hygiene drawback. Fill them with common, median or another imputation method after which transfer on.

These defaults really feel innocent. One thing secure sufficient so the mannequin can maintain working.

That assumption seems to be costly.

In actual programs, lacking hardly ever means random. Lacking typically means new, unknown, not but noticed, or not but trusted. Once we collapse all of that right into a single default worth, the mannequin doesn’t see a spot. It sees a sample.

Let me make this concrete.

I first bumped into this in an actual time fraud system the place we used a characteristic referred to as avg_transaction_amount_last_7_days. For energetic customers, this worth was effectively behaved. For brand spanking new or inactive customers, the characteristic pipeline returned a default worth of zero.

As an example how the default worth turned a robust proxy for consumer standing, I computed the noticed fraud fee grouped by the characteristic’s worth:

knowledge.groupby("avg_txn_amount_last_7_days")["is_fraud"].imply()

As proven, customers with a price of zero exhibit a markedly decrease fraud fee—not as a result of zero spending is inherently secure, however as a result of it implicitly encodes “new or inactive consumer.”

All customers with a mean transaction quantity of zero are non fraud. Not as a result of zero is inherently secure, however as a result of these customers are new/inactive. The mannequin doesn’t study “low spending is secure”. It learns “lacking historical past means secure”.

The default has develop into a sign.

Throughout coaching, this appears to be like good as precision improves. Then manufacturing site visitors modifications.

A downstream service begins timing out throughout peak hours. Abruptly, energetic customers quickly lose their historical past options. Their avg_transaction_amount_last_7_days flips to zero. The mannequin confidently marks them as low danger.

Skilled groups deal with this otherwise. They separate absence from worth, observe characteristic availability explicitly. Most significantly, they by no means permit silence to masquerade as data.

Inhabitants Shift With out Distribution Shift

This failure mode took me for much longer to acknowledge, largely as a result of all the standard alarms stayed silent.

When individuals speak about knowledge drift, they often imply distribution shift. Function histograms transfer. Percentiles change. KS tests mild up dashboards. Everybody understands what to do subsequent. Examine upstream knowledge, retrain, recalibrate.

Inhabitants shift with out distribution shift is totally different. Right here, the characteristic distributions stay steady. Abstract statistics barely transfer. Monitoring dashboards look reassuring. And but, mannequin habits degrades steadily.

I first encountered this in a big scale funds danger system that operated throughout a number of consumer segments. The mannequin consumed transaction stage options like quantity, time of day, gadget alerts, velocity counters, and service provider class codes. All of those options had been closely monitored. Their distributions barely modified month over month.

Nonetheless, fraud charges began creeping up in a really particular slice of site visitors. What modified was not the info. It was who the info represented.

Over time, the product expanded into new consumer cohorts. New geographies with totally different fee habits. New service provider classes with unfamiliar transaction patterns. Promotional campaigns that introduced in customers who behaved otherwise however nonetheless fell inside the identical numeric ranges. From a distribution perspective, nothing regarded uncommon. However the underlying inhabitants had shifted.

The mannequin had been educated totally on mature customers with lengthy behavioral histories. Because the consumer base grew, a bigger fraction of site visitors got here from newer customers whose habits regarded statistically related however semantically totally different. A transaction quantity of two,000 meant one thing very totally different for a protracted tenured consumer than for somebody on their first day. The mannequin didn’t know that, as a result of we had not taught it to care.

*Inhabitants shift with out distribution shift*

See this determine above. It exhibits why this failure mode is tough to detect in follow. The primary two plots present transaction quantity and short-term velocity distributions for mature and new customers. From a monitoring perspective, these options seem steady with the overlap. If this had been the one sign obtainable, most groups would conclude that the info pipeline and mannequin inputs stay wholesome.

The third plot reveals the actual drawback. Though the characteristic distributions are practically equivalent, the fraud fee differs considerably throughout populations. The mannequin applies the identical resolution boundaries to each teams as a result of the inputs look acquainted, however the underlying danger isn’t the identical. What has modified isn’t the info itself, however who the info represents.

As site visitors composition modifications by means of progress or growth these assumptions cease holding, though the info continues to look statistically regular. With out explicitly modeling inhabitants context or evaluating efficiency throughout cohorts, these failures stay invisible till enterprise metrics start to degrade.

Earlier than You Go

Not one of the failures on this article had been brought on by unhealthy fashions.

The architectures had been affordable. The options had been thoughtfully designed. What failed was the system across the mannequin, particularly the assumptions we made about time, absence, and who the info represented.

Time isn’t a static index. Labels arrive late. Options mature inconsistently. Batch boundaries hardly ever align with resolution moments. Once we ignore that, fashions study from data they’ll by no means see once more.

If there’s one takeaway, it’s this: robust offline metrics should not proof of correctness. They’re proof that the mannequin suits the assumptions you gave it. The true work of machine studying begins when these assumptions meet actuality.

Design for that second.

References & Additional Studying

[1] ROC Curves and AUC (Google Machine Studying Crash Course)
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

[2] Kolmogorov–Smirnov Check (Wikipedia)
https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test[3] Knowledge Distribution Shifts and Monitoring (Huyen Chip)
https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html

Source link

Why Care About Prompt Caching in LLMs?

How Vision Language Models Are Trained from “Scratch”

Personalized Restaurant Ranking with a Two-Tower Embedding Variant

3 Techniques to Effectively Utilize AI Agents for Coding

AI algorithm enables tracking of vital white matter pathways | MIT News

Harnessing human-AI collaboration for an AI roadmap that moves beyond pilots

Personal Skill Development & Career Advancement

MIT researchers “speak objects into existence” using AI and robotics | MIT News

Most Popular

Unpacking the bias of large language models | MIT News

Introducing Google’s File Search Tool | Towards Data Science

The State of AI: Chatbot companions and the future of our privacy

Our Picks