: Limitations of Machine Studying
As an information scientist in right now’s digital age, you should be outfitted to reply a wide range of questions that go far past easy sample recognition. Typical machine studying is constructed on affiliation; it seeks to seek out patterns in present knowledge to foretell future observations beneath the belief that the underlying system stays fixed. If you happen to prepare a mannequin to foretell home costs, you might be asking the algorithm to seek out the probably value given a set of options.
Nonetheless, causal evaluation introduces a “what if” part. It goes past commentary to ask how the system would react if we actively modified a variable. That is the distinction between noticing that individuals who purchase costly lattes are additionally doubtless to purchase sports activities automobiles and understanding whether or not reducing the worth of that espresso will really trigger a rise in automotive gross sales. On the planet of causal inference, we’re primarily attempting to study the underlying legal guidelines of a enterprise or social system, permitting us to foretell the outcomes of actions we haven’t taken but.
Causal evaluation is important in a wide range of fields when we have to transfer past observing patterns to make choices, notably in areas like healthcare, advertising and marketing, and public coverage. Take into account a medical researcher evaluating a brand new blood strain treatment and its impact on coronary heart assault severity. With historic knowledge, you may see that sufferers taking the treatment even have extra extreme coronary heart assaults. A typical ML (Machine Studying) mannequin would recommend the drug is dangerous. Nonetheless, that is doubtless on account of confounding: medical doctors solely prescribe the treatment to sufferers who have already got poorer well being. To seek out the reality, we should isolate the drug’s precise affect from the noise of the sufferers’ present situations.
On this article, I’ll introduce a number of the necessary ideas and instruments in causal ML in an accessible method. I’ll solely use libraries that handle knowledge, calculate chances, and estimate regression parameters. This text will not be a tutorial, however a place to begin for these concerned with however intimidated by causal inference strategies. I used to be impressed by the web studying Causal Inference for the Courageous and True by Matheus Facure Alves. Word: For these unfamiliar with likelihood, E[X] refers back to the common worth {that a} random variable/amount x takes.
The Potential Outcomes Framework
Once we begin a causal examine, the questions we ask are way more particular than loss minimization or prediction accuracy. We sometimes begin with the Common Remedy Impact (ATE), which tells us the imply affect of an intervention or motion throughout a whole inhabitants.
In our medical instance, we need to know the distinction in coronary heart assault severity if the complete inhabitants took the drug versus if the complete inhabitants didn’t. To outline this mathematically, we use the Potential Outcomes Framework. First, let’s outline just a few variables:
Y: The Final result (e.g., a coronary heart assault severity rating from 0 to 100).T: The Remedy indicator. It is a binary “swap”:T = 1 means the affected person took the drug.T = 0 means the affected person didn’t take the drug (the Management).Y(1): The result we’d see if the affected person was handled.Y(0): The result we’d see if the affected person was not handled.
The theoretical ATE is the anticipated distinction between these two potential outcomes throughout the complete inhabitants:ATE = E[Y(1) - Y(0)]
To handle the dilemma of unobserved outcomes, researchers use the Potential Outcomes Framework as a conceptual information. On this framework, we assume that for each particular person, there exist two “potential” outcomes: Y(1) and Y(0). We solely ever observe one among these two values, which is named the Basic Drawback of Causal Inference.
If a affected person takes the treatment (T=1), we see their factual end result, Y(1). Their end result with out the treatment is now a counterfactual, a state of the world that would have existed however didn’t.
The limitation of causal inference is that for any given particular person, we solely ever observe one among these two values. If a affected person takes the treatment, we see their factual end result, Y(1), whereas their end result with out the treatment Y(0) stays a counterfactual, a state that would have existed however didn’t.
ATE in a Good World
For the reason that particular person remedy impact is the distinction between these two values, it stays hidden to us. This shifts the complete objective of causal estimation away from the person and towards the group. As a result of we can not subtract a counterfactual from a factual for one particular person, we should discover intelligent methods to match teams of individuals.
If the group receiving the remedy is statistically similar to the group that isn’t, we will use the common noticed end result of 1 group to face in for the lacking counterfactual of the opposite. This permits us to estimate the Common Remedy Impact by calculating the distinction between the imply end result of the handled group and the imply end result of the management group:
{ATE} = E[Y|T=1] - E[Y|T=0]
Suppose that for individuals who took the drug, we noticed a imply coronary heart assault severity of 56/100, in comparison with 40/100 for individuals who didn’t. If we try and estimate the causal impact by taking a easy distinction in means, the information means that taking the drug led to a 16-point improve in severity.
E[Y|T=1] = 56, E[Y|T=0] = 40 -> ATE_BIASED = 16
Except this drug is among the many most harmful created, there’s doubtless one other mechanism at play. This discrepancy arises as a result of we will solely interpret a easy distinction in means because the Common Remedy Impact if the remedy was assigned by way of a Randomized Managed Trial (RCT), which ensures full random task of remedy teams. With out randomization, the handled and management teams should not exchangeable and differ in ways in which make a direct comparability tough to do.
Randomization
The rationale an RCT is the default technique for calculating the ATE is that it helps remove Choice Bias. In our medical instance, the 16-point hurt we noticed doubtless occurred as a result of medical doctors gave the drug to the highest-risk sufferers. On this situation, the handled group was already predisposed to larger severity scores earlier than they ever took the tablet. Once we use an RCT, we take away the human ingredient of alternative. With this randomized choice, we be sure that high-risk and low-risk sufferers are distributed equally between each teams.
Mathematically, randomization ensures that the remedy task is impartial of the potential outcomes.
Now, we will assume that the common end result of the handled group is an ideal proxy for what would have occurred if the complete inhabitants had been handled. As a result of the “Handled” and “Management” teams begin as statistical clones of each other, any distinction we see on the finish of the examine should be attributable to the drug itself.
Observational Knowledge and Confounders
In the true world, we are sometimes compelled to work with observational knowledge. In these conditions, the straightforward distinction in means fails us due to the presence of confounders. A confounder is a variable that influences each the remedy and the result, making a “backdoor path” that enables a non-causal correlation to move between them.
With a purpose to visualize these hidden relationships, causal researchers use Directed Acyclic Graphs (DAGs). A DAG is a specialised graph the place variables are represented as nodes and causal relationships are represented as arrows. Directed that the arrows have a particular path, indicating a one-way causal move from a trigger to an impact. Acyclic means the graph comprises no cycles; you can’t comply with a sequence of arrows and find yourself again on the first variable, primarily as a result of transitioning from one node to the following ought to symbolize a lapse in time. A confounder will reveal itself in a DAG by its directed connection to each the remedy and the result, as seen beneath.
As soon as we’ve got recognized the confounders by way of our DAG, the following step is to mathematically account for them. If we need to isolate the true impact of the drug, we have to examine sufferers who’re comparable in each manner apart from whether or not they took the medication. In causal evaluation, crucial instrument for that is Linear Regression. By together with the confounder as an impartial variable, the mannequin calculates the connection whereas holding the preliminary well being of the affected person fixed. For our instance, I generated a mock dataset
the place remedy task was depending on preliminary well being (I.H). This may be seen within the code beneath, the place each the likelihood of receiving the drug and the severity relies on the preliminary well being rating.

On this view, people who acquired the drug had a mean severity improve of three.47 factors. To seek out the reality, we run an OLS (Peculiar Least Squares) a number of linear regression mannequin to manage
for the members’ preliminary well being score.

A very powerful discovering right here is the coefficient of the remedy variable (drug). Whereas the uncooked knowledge instructed the drug was dangerous, our coefficient is roughly -9.89. This implies that once we management for the confounder of preliminary well being, taking the drug really decreases coronary heart assault severity by practically 10 factors. That is very near our true impact, which was a lower of precisely 10 factors!
It is a end result that was extra in keeping with our expectations, and that’s as a result of we eradicated a big supply of choice bias by controlling for confounders. The wonderful thing about linear regression on this context is that the setup is just like that of a typical regression drawback. Transformations will be utilized, diagnostic plots will be produced, and slopes will be interpreted as regular. Nonetheless, as a result of we’re together with confounders in our mannequin, their impact on the result won’t be absorbed into the remedy coefficient, one thing often called de-biasing or adjusting, as beforehand talked about.
Matching and Propensity Scoring
Whereas a number of linear regression is a robust instrument for de-biasing, it depends closely on the belief that the connection between your confounders and the result is linear. In lots of real-world conditions, your handled and management teams may be so basically completely different {that a} regression mannequin is forcedto guess leads to areas the place it has no precise knowledge.
To unravel this, researchers usually flip to Matching, a way that shifts the main focus from mathematical adjustment to knowledge restructuring. As a substitute of utilizing a formulation to carry well being fixed, matching searches the management group for a ”twin” for each handled particular person. Once we pair a affected person who took the drug (T = 1) with a affected person of practically an identical preliminary well being who didn’t (T = 0), we successfully prune our dataset right into a Artificial RCT.
On this balanced subset, the teams are lastly exchangeable, permitting us to match their outcomes on to reveal the true Common Remedy Impact (ATE). It’s nearly as if every pair permits us to watch each the factual and the counterfactual states for a single sort of commentary. Once we consider methods to match two entries in a dataset, take into account that every entry is represented by a vector in an n-dimensional area, the place n − 1 is the variety of options or confounders.
At first look, it appears we might merely calculate the space between these vectors utilizing Euclidean distance. Nonetheless, the problem with this strategy is that every one covariates are weighted equally, no matter their precise causal affect. In excessive dimensions, an issue often called the curse of dimensionality, even an entry’s closest match might nonetheless be basically completely different within the ways in which really matter for the remedy.
In our mock dataset, taking a look at members with the bottom well being scores beneath, we see that handled participant 74 and untreated participant 668 have practically an identical preliminary well being scores. As a result of we’re solely coping with one confounder right here, these two are ultimate candidates to be matched collectively. Nonetheless, as dimensionality will increase, it turns into inconceivable to seek out these matches by simply wanting on the numbers, and easy Euclidean distance fails to prioritize the variables that actually drive the choice bias.

In apply, this course of is mostly executed as one-to-one matching, the place every handled unit is paired with its single closest neighbor within the management group. To make sure these matches are high-
high quality, we use the Propensity Rating: a single quantity representing the likelihood {that a} participant would obtain the remedy given their traits, P (T = 1|X). This rating collapses our high-
dimensional area right into a single dimension that particularly displays the probability of therapies given a set of covariates. We then use a k-Nearest Neighbors (k-NN) algorithm to carry out a ”fuzzy” search
on this rating.
To stop poor matches, we will select a threshold to function the utmost allowable distance to match. We are able to calculate propensity in quite a lot of methods, the most typical being logistic regression, however different ML strategies able to outputting chances, equivalent to XGBoost or Random Forest, work as effectively. Within the beneath code, I calculated propensities by establishing a logistic regression mannequin that predicts drug participation from simply preliminary well being. In apply, you’ll have extra confounders in your mannequin.
As talked about, step one of propensity rating matching is the calculation of the propensity rating. In our instance, we solely have preliminary well being as a confounder, in order that would be the sole covariate in our easy logistic regression.

As anticipated, members 74 and 668 have been assigned a really comparable propensity and would doubtless be matched. Additionally it is usually useful to generate what is named a Frequent Help plot, which shows the density of calculated propensity scores separated by handled and management. Ideally, we need to see as a lot overlap and symmetry as attainable, as that means matching models can be easier. As seen beneath, choice bias is current in our dataset. It’s a good train to research the information era code above and decide why.

Though not obligatory within the one-dimensional case, we will then use k-NN to match handled with untreated based mostly on their propensity rating.
If you happen to recall from earlier than, our linear regression yielded an ATE of -9.89 in comparison with our now calculated worth of -10.16. As we improve the complexity and variety of covariates in our mannequin, our propensity rating matching ATE will doubtless get nearer and nearer to the underlying causal impact of -10.
Time Invariant Results Utilizing Distinction-in-Variations
Whereas matching is superb for de-biasing based mostly on the variables we will see, it falls quick when there are hidden elements, like a affected person’s genetic predisposition or a hospital’s particular administration type, that we haven’t recorded in our knowledge. If these unobserved confounders are time-invariant (that means they keep fixed over the examine interval), we will use Distinction-in-Variations (DiD) to cancel them out.
As a substitute of simply evaluating the handled group to the management group at a single time limit, DiD seems to be at two teams over two durations: earlier than and after the remedy. The logic is straightforward but elegant: we calculate the change within the management group and assume the handled group would have modified by that very same quantity in the event that they hadn’t acquired the remedy. Any extra change noticed within the handled group is attributed to the remedy itself. The equation for the DiD estimator is as follows:
Whereas this formulation might seem intimidating at first look, it’s best learn because the distinction in adjustments occurring earlier than and after remedy. For instance, think about two ice cream outlets in several cities. Earlier than the weekend, Retailer A (our remedy group) sells 200 cones, and Retailer B (our management group) sells 300. On Saturday, a warmth wave hits Retailer A’s city, however not Retailer B’s. By the top of the day, Retailer A’s gross sales leap to 500, whereas Retailer B’s gross sales rise to 400. A easy evaluation of Retailer A would recommend the warmth wave triggered a +300 improve. Nonetheless, the management store (Retailer B) grew by +100 in the identical interval with none warmth wave, maybe on account of a vacation or basic summer season climate.
The Distinction-in-Variations strategy subtracts this pure time pattern of +100 from Retailer A’s whole development. It successfully cancels out any time-invariant confounders—elements like the shop’s location or its base recognition that may have in any other case skewed our outcomes. This reveals that the true causal affect of the warmth wave was +200 models.
A big limitation of the essential Distinction-in-Variations (DiD) is that it doesn’t account for elements that change over time. Whereas the “change-in-change” logic efficiently cancels out static, time-invariant confounders (like somebody’s genetic historical past or a hospital’s geographic location), it stays susceptible to time-varying confounders. These are elements that shift in the course of the examine interval and have an effect on the remedy and management teams otherwise.
In our coronary heart assault examine, as an illustration, even a DiD evaluation may very well be biased if the hospitals administering the drug additionally underwent vital staffing adjustments or acquired upgraded gear in the course of the “Put up” interval. If we fail to account for these altering variables, the DiD estimator will incorrectly attribute their affect to the drug itself, resulting in a “polluted” causal estimate.
You will need to be aware that the straightforward cross-sectional knowledge construction we utilized for Regression and Matching is inadequate for this technique. To calculate a “change within the change,” we want a temporal dimension in our dataset. Particularly, we want a variable indicating whether or not an commentary occurred within the Pre-treatment or Put up-treatment interval for each the handled and management teams.
To resolve this, we transfer past easy subtraction and implement DiD inside a A number of Linear Regression framework. This permits us to explicitly “management” for time-varying elements, successfully isolating the remedy impact whereas holding exterior shifts fixed.
The regression mannequin is outlined as:

Under, a brand new artificial dataset is constructed to replicate the required construction. I additionally added a High quality of Care variable for demonstration functions. I didn’t embrace the total simulation code on account of its size, nevertheless it primarily modifies the earlier logic by duplicating our observations throughout two distinct time durations.

Since we’ve got our knowledge within the appropriate format, we will match a linear regression mannequin utilizing the specs simply talked about.

The R-squared worth of 0.324 signifies that the mannequin explains roughly 32.4 p.c of thevariance in coronary heart assault severity. In causal evaluation, that is widespread, as many unmeasured elements like
genetics are handled as noise. The intercept of 48.71 represents the baseline severity for the controlgroup in the course of the pre-treatment interval. The drug coefficient of 12.75 confirms choice bias, displaying
the handled group initially had larger severity scores. Moreover, the standard of care coefficient suggests that every unit improve in that index corresponds to a 2.10-point discount in severity.
The interplay time period, drug:put up, gives the difference-in-differences estimator, which reveals an estimated drug impact of -6.58. This tells us the treatment lowered severity after adjusting for group variations and time tendencies, although the estimate is notably decrease than the true impact of -10. This discrepancy happens as a result of the standard of care improved particularly for the handled group in the course of the post-treatment interval, as a result of knowledge era course of. Since these two adjustments occurred concurrently to the identical group, they’re completely correlated, or collinear.
The mannequin primarily faces a mathematical stalemate the place it can not decide if the development got here from the drug or the higher care, so it splits the credit score between them. As for any linear regression, if two variables are completely correlated, a mannequin may drop one fully or present extremely unstable estimates. However, all variables preserve p-values of 0.000, confirming that regardless of the break up credit score, the outcomes stay statistically vital. In actual knowledge and evaluation, we’ll cope with these sorts of conditions, and you will need to know all of the instruments in your knowledge science shed earlier than you deal with an issue.
Conclusion and Last Ideas
On this article, we explored the transition from customary ML to the logic of causal inference. We noticed by way of artificial examples that whereas easy variations in means will be deceptive on account of choice bias, strategies like linear regression, propensity rating matching, and difference-in-differences permit us to strip away confounders and isolate true affect.
Having these instruments in our arsenal will not be sufficient. As seen with our remaining mannequin, even subtle methods can yield points when interventions overlap. Whereas these strategies are highly effective in adjusting for confounding, they require a deep understanding of their underlying mechanics. Counting on mannequin outputs with out acknowledging the fact of collinearity or time-varying elements can result in deceptive conclusions.
On the similar time, realizing when and methods to apply these instruments can function a worthwhile talent to any knowledge scientist. In my view, among the finest components of doing statistical programming for causal inference is the truth that a lot of the strategies stem from just a few elementary statistical fashions, making implementation simpler than one may anticipate.
The true world is undeniably messy and full of information points, and it’s uncommon that we’ll observe a wonderfully clear causal sign. Causal machine studying is finally about exploiting the appropriate knowledge whereas having the arrogance that our variables permit for true adjustment. This text is my first step within the documentation of my causal inference journey, and I plan to launch an element two that dives deeper into extra matters, together with Instrumental Variables (IV), Panel Regression, Double Machine Studying (DML), and Meta-Learners.
Urged Studying
Facure, Matheus. Causal Inference for the Courageous and True. Out there at: https://matheusfacure.github.io/python-causality-handbook/
