. You’re three weeks right into a churn prediction mannequin, hunched over a laptop computer, watching a Bayesian optimization sweep crawl by means of its two hundredth trial. The validation AUC ticks from 0.847 to 0.849. You screenshot it. You submit it in Slack. Your supervisor reacts with a thumbs-up.
You’re feeling productive. You aren’t.
For those who’ve ever spent days squeezing fractions of a % out of a Machine Studying (ML) metric whereas a quiet voice at the back of your head whispered does any of this really matter?, you already sense the issue. That voice is true. And silencing it with one other grid search is without doubt one of the most costly habits within the occupation.
Right here’s the uncomfortable math: more than 80% of Artificial Intelligence (AI) projects fail, in line with RAND Company analysis revealed in 2024. The primary root trigger isn’t dangerous fashions. It isn’t inadequate knowledge. It’s misunderstanding (or miscommunicating) what drawback must be solved. Not a modeling failure. A framing failure.
This text offers you a concrete protocol to catch that failure earlier than you write a single line of coaching code. 5 steps. Every one takes a dialog, not a GPU cluster.
“All that progress in algorithms means it’s really time to spend extra time on the info.” Andrew Ng didn’t say spend extra time on the mannequin. He mentioned the other.
The Productive Procrastination Lure
Hyperparameter tuning seems like engineering. You’ve gotten a search area. You’ve gotten an goal operate. You iterate, measure, enhance. The suggestions loop is tight (minutes to hours), the progress is seen (metrics go up), and the work is legible to your staff (“I improved AUC by 2 factors”).
Drawback framing seems like stalling. You sit in a room with enterprise stakeholders who use imprecise language. You ask questions that don’t have clear solutions. There’s no metric ticking upward. No Slack screenshot to submit. Your supervisor asks what you probably did right this moment and also you say, “I spent 4 hours determining whether or not we must always predict churn or predict reactivation chance.” That reply doesn’t sound like progress.
However it’s the solely progress that issues.
Picture by the writer.
The reason being structural. Tuning operates inside the issue as outlined. If the issue is outlined fallacious, tuning optimizes a operate that doesn’t map to enterprise worth. You get a gorgeous mannequin that solves the fallacious factor. And no quantity of Optuna sweeps can repair a goal variable that shouldn’t exist.
Zillow Wager $500 Million on the Mistaken Drawback
In 2021, Zillow shut down its home-buying division, Zillow Offers, after dropping over $500 million. The corporate had acquired roughly 7,000 properties throughout 25 metro areas, persistently overpaying as a result of its pricing algorithm (the Zestimate) didn’t alter to a cooling market.
The post-mortems centered on idea drift. The mannequin educated on hot-market knowledge couldn’t sustain as demand slowed. Contractor shortages throughout COVID delayed renovations. The suggestions loop between buy and resale was too gradual to catch the error.
However the deeper failure occurred earlier than any mannequin was educated.
Zillow framed the issue as: Given a house’s options, predict its market worth. That framing assumed a steady relationship between options and worth. It assumed Zillow may renovate and resell quick sufficient that the prediction window stayed quick. It assumed the mannequin’s error distribution was symmetric (overpaying and underpaying equally doubtless). None of these assumptions held.
Rivals Opendoor and Offerpad survived the identical market shift. Their fashions detected the cooling and adjusted pricing. The distinction wasn’t algorithmic sophistication. It was how every firm framed what their mannequin wanted to do and the way shortly they up to date that body.
Zillow didn’t lose $500 million due to a nasty mannequin. They misplaced it as a result of they by no means questioned whether or not “predict house worth” was the appropriate drawback to unravel at their operational velocity.
When the AI Discovered to Detect Rulers As a substitute of Most cancers
A analysis staff constructed a neural community to categorise pores and skin lesions as benign or malignant. The mannequin reached accuracy corresponding to board-certified dermatologists. Spectacular numbers. Clear validation curves.
Then somebody checked out what the mannequin really discovered.
It was detecting rulers. When dermatologists suspect a lesion could be malignant, they place a ruler subsequent to it to measure its dimension. So within the coaching knowledge, photos containing rulers correlated with malignancy. The mannequin discovered a shortcut: ruler current = most likely most cancers. Ruler absent = most likely benign.
The accuracy was actual. The training was rubbish. And no hyperparameter tuning may have caught this, as a result of the mannequin was performing precisely as instructed on the info precisely as supplied. The failure was upstream: no one requested, “What ought to the mannequin be taking a look at to make this determination?” earlier than measuring how nicely it made the choice.
It is a sample referred to as shortcut learning, and it reveals up in every single place. Fashions be taught to use correlations in your knowledge that received’t maintain in manufacturing. The one protection is a transparent specification of what the mannequin ought to and mustn’t use as sign, and that specification comes from drawback framing, not from tuning.
Why Framing Errors Survive So Lengthy
If dangerous drawback framing is that this harmful, why do sensible groups preserve skipping it?
Three reinforcing dynamics make it persistent.
First, suggestions asymmetry. While you tune a hyperparameter, you see the end in minutes. While you reframe an issue, the payoff is invisible for weeks. Human brains low cost delayed rewards. So groups gravitate towards the quick suggestions loop of tuning, even when the gradual work of framing has 10x the return.
Second, legibility bias. “I improved accuracy from 84.7% to 84.9%” is a clear, defensible assertion in a standup assembly. “I spent yesterday convincing the product staff that we’re optimizing the fallacious metric” sounds such as you achieved nothing. Organizations reward seen output. Framing produces no seen output till it prevents a catastrophe no one is aware of was coming.
Third, identification. Information scientists are educated as mannequin builders. The instruments, the programs, the Kaggle leaderboards, the interview questions: all of them middle on modeling. Drawback framing seems like another person’s job (product, enterprise, technique). Claiming it means stepping exterior your technical identification, and that’s uncomfortable.

Andrew Ng named this sample when he launched the idea of data-centric Artificial Intelligence (AI) in 2021. He outlined it as “the self-discipline of systematically engineering the info wanted to construct a profitable AI system.” His argument: the ML neighborhood had spent a decade obsessing over mannequin structure whereas treating knowledge (and by extension, drawback definition) as another person’s job. The returns from higher architectures had plateaued. The returns from higher drawback definition had barely been tapped.
The Metal-Man for Tuning
Earlier than going additional: hyperparameter tuning just isn’t ineffective. There are conditions the place it’s precisely the appropriate factor to do.
For those who’ve already validated that your goal variable maps on to a enterprise determination. In case your knowledge distribution in manufacturing matches coaching. For those who’ve confirmed that your options seize the sign the enterprise cares about (and solely that sign). If all of that is true, then tuning the mannequin’s capability, regularization, and studying price is reputable optimization.
The declare isn’t “by no means tune.” The declare is: most groups begin tuning earlier than they’ve earned the appropriate to tune. They skip the framing work that determines whether or not tuning will matter in any respect. And when tuning produces marginal positive factors on a misframed drawback, these positive factors are illusory.
Information analytics analysis reveals the sample clearly: when you’ve achieved 95% of doable efficiency with fundamental configuration, spending days to extract another 0.5% rarely justifies the computational cost. That calculation will get worse when the 95% is measured in opposition to the fallacious goal.
The 5-Step Drawback Framing Protocol
This protocol runs earlier than any modeling. It takes 2 to five days relying on stakeholder availability. Each step produces a written artifact that your staff can reference and problem. Skip a step, and also you’re playing that your assumptions are right. Most aren’t.
Step 1: Title the Choice (Not the Prediction)
Who: Information science lead + the enterprise stakeholder who will act on the mannequin’s output.
When: First assembly. Earlier than any knowledge exploration.
How: Ask this query and write down the reply verbatim:
“When this mannequin produces an output, what particular determination modifications? Who makes that call, and what do they do in a different way?”
Instance (good): “The retention staff calls the highest 200 at-risk clients every week as an alternative of emailing all 5,000. The mannequin ranks clients by reactivation chance so the staff is aware of who to name first.”
Instance (dangerous): “We need to predict churn.” (No determination named. No actor recognized. No motion specified.)
Pink flag: If the stakeholder can’t identify a selected determination, the undertaking doesn’t have a use case but. Pause. Don’t proceed to knowledge exploration. A mannequin with no determination is a report no one reads.
Step 2: Outline the Error Price Asymmetry
Who: Information science lead + enterprise stakeholder + finance (if out there).
When: Identical assembly or subsequent day.
How: Ask:
“What’s worse: a false optimistic or a false detrimental? By how a lot?”
Instance: For a fraud detection mannequin, a false detrimental (missed fraud) prices the corporate a mean of $4,200 per incident. A false optimistic (blocking a reputable transaction) prices $12 in customer support time plus a 3% probability of dropping the shopper ($180 anticipated worth). The ratio is roughly 23:1. This implies the mannequin needs to be tuned for recall, not precision, and the choice threshold needs to be set a lot decrease than 0.5.
Why this issues: Default ML metrics (accuracy, F1) assume symmetric error prices. Actual enterprise issues nearly by no means have symmetric error prices. For those who optimize F1 when your precise value ratio is 23:1, you’ll construct a mannequin that performs nicely on paper and poorly in manufacturing. Zillow’s Zestimate handled overestimates and underestimates as equally dangerous. They weren’t. Overpaying for a home you possibly can’t resell for months is catastrophically worse than underbidding and dropping a deal.
Step 3: Audit the Goal Variable
Who: Information science lead + area knowledgeable.
When: After Steps 1-2 are documented. Earlier than any characteristic engineering.
How: Reply these 4 questions in writing:
- Does this goal variable really measure what the enterprise cares about? “Churn” would possibly imply “cancelled subscription” in your knowledge however “stopped utilizing the product” within the stakeholder’s thoughts. These are totally different populations. Make clear which one maps to the choice in Step 1.
- When is the goal noticed relative to when the mannequin must act? For those who’re predicting 30-day churn however the retention staff wants 14 days to intervene, your prediction window is fallacious. The mannequin must predict churn at the very least 14 days earlier than it occurs.
- Is the goal contaminated by the intervention you’re making an attempt to optimize? If previous retention efforts already decreased churn for some clients, your coaching knowledge underestimates their true churn threat. The mannequin learns “these clients don’t churn” when the reality is “these clients don’t churn as a result of we intervened.” That is the causal inference lure, and it’s invisible in commonplace practice/check splits.
- Can the mannequin be taught the appropriate sign, or will it discover shortcuts? The ruler-in-dermatology drawback. Listing the options. For every one, ask: “Would a website knowledgeable use this characteristic to make this determination?” If not, it could be a proxy that received’t generalize.
Step 4: Simulate the Deployment Choice
Who: Full undertaking staff (DS, engineering, product, enterprise stakeholder).
When: After Steps 1-3 are documented. Earlier than modeling begins.
How: Run a tabletop train. Current the staff with 10 artificial mannequin outputs (a mixture of right predictions, false positives, and false negatives) and ask:
- “Given this output, what motion does the enterprise take?”
- “Is that motion right given the bottom reality?”
- “How a lot does every error sort value?”
- “At what confidence threshold does the enterprise cease trusting the mannequin?”
This train surfaces misalignments that no metric can catch. You would possibly uncover that the enterprise really wants a rating (not a binary classification). Or that the stakeholder received’t act on predictions under 90% confidence, which implies half your mannequin’s output is ignored. Or that the “motion” requires data the mannequin doesn’t present (like why a buyer is in danger).
Artifact: A one-page deployment spec itemizing: who makes use of the output, in what format, at what frequency, with what confidence threshold, and what occurs when the mannequin is fallacious.
Step 5: Write the Anti-Goal
Who: Information science lead.
When: After Steps 1-4. The final examine earlier than modeling begins.
How: Write one paragraph answering:
“If this undertaking succeeds on each metric we’ve outlined however nonetheless fails in manufacturing, what went fallacious?”
Instance 1: “The churn mannequin hits 0.91 AUC on the check set, however the retention staff ignores it as a result of the predictions arrive 48 hours after their weekly planning assembly. The mannequin is correct however operationally ineffective as a result of we didn’t align the prediction cadence with the choice cadence.”
Instance 2: “The fraud mannequin flags 15% of transactions, overwhelming the evaluation staff. They begin rubber-stamping approvals to clear the queue. Technically the mannequin catches fraud; virtually the people within the loop have discovered to disregard it.”
The anti-target is an inversion: as an alternative of defining success, outline probably the most believable failure. For those who can write a vivid anti-target, you possibly can usually stop it. For those who can’t write one, you haven’t thought exhausting sufficient about deployment.

Is This a Tuning Drawback or a Framing Drawback?
Not each stalled undertaking wants reframing. Typically the issue is well-framed and also you genuinely want higher mannequin efficiency. Use this diagnostic to inform the distinction.

What Modifications When Groups Body First
The shift from model-centric to problem-centric work isn’t nearly avoiding failure. It modifications what “senior” means in knowledge science.
Junior knowledge scientists are valued for modeling ability: are you able to practice, tune, and deploy? Senior knowledge scientists needs to be valued for framing ability: are you able to translate an ambiguous enterprise state of affairs right into a well-posed prediction drawback with the appropriate goal, the appropriate options, and the appropriate success standards?
The business is slowly catching up. Andrew Ng’s push towards data-centric AI is one sign. The RAND Company’s 2024 report on AI anti-patterns is one other: their prime suggestion is that leaders ought to guarantee technical workers perceive the aim and context of a undertaking earlier than beginning. QCon’s 2024 analysis of ML failures names “misaligned targets” as the commonest pitfall.
The sample is obvious. The bottleneck in ML isn’t algorithms. It’s alignment between the mannequin’s goal and the enterprise’s precise want. And that alignment is a human dialog, not a computational one.
The bottleneck in ML just isn’t compute or algorithms. It’s the dialog between the one who builds the mannequin and the one who makes use of the output.
For organizations, this implies drawback framing needs to be a first-class exercise with its personal time allocation, its personal deliverables, and its personal evaluation course of. Not a preamble to “the true work.” The true work.
For particular person knowledge scientists, it means the quickest option to enhance your affect isn’t studying a brand new framework or mastering distributed coaching. It’s studying to ask higher questions earlier than you open a pocket book.
It’s 11:14 PM on a Wednesday. You’re three weeks right into a undertaking. Your validation metric is climbing. You’re about to launch one other sweep.
Cease.
Open a clean doc. Write one sentence: “The choice that modifications based mostly on this mannequin’s output is ___.” For those who can’t fill within the clean with out calling a stakeholder, you’ve simply discovered the highest-ROI exercise for tomorrow morning. It received’t really feel like progress. It received’t produce a Slack-worthy screenshot. But it surely’s the one work that determines whether or not the following three weeks matter in any respect.
References
- RAND Company, “The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed”, James Ryseff, Brandon De Bruhl, Sydne J. Newberry, 2024.
- MIT Sloan, “Why It’s Time for ‘Data-Centric Artificial Intelligence’”, Sara Brown, June 2022.
- insideAI Information, “The $500mm+ Debacle at Zillow Offers: What Went Wrong with the AI Models?”, December 2021.
- Stanford Graduate Faculty of Enterprise, “Flip Flop: Why Zillow’s Algorithmic Home Buying Venture Imploded”.
- Diagnostics (MDPI), “Uncovering and Correcting Shortcut Learning in Machine Learning Models for Skin Cancer Diagnosis”, 2022.
- VentureBeat, “When AI Flags the Ruler, Not the Tumor”.
- InfoQ, “QCon SF 2024: Why ML Projects Fail to Reach Production”, November 2024.
- Quantity Analytics, “8 Hyperparameter Tuning Insights Backed by Data Analytics”.
