Close Menu
    Trending
    • Implementing DRIFT Search with Neo4j and LlamaIndex
    • Agentic AI in Finance: Opportunities and Challenges for Indonesia
    • Dispatch: Partying at one of Africa’s largest AI gatherings
    • Topp 10 AI-filmer genom tiderna
    • OpenAIs nya webbläsare ChatGPT Atlas
    • Creating AI that matters | MIT News
    • Scaling Recommender Transformers to a Billion Parameters
    • Hidden Gems in NumPy: 7 Functions Every Data Scientist Should Know
    ProfitlyAI
    • Home
    • Latest News
    • AI Technology
    • Latest AI Innovations
    • AI Tools & Technologies
    • Artificial Intelligence
    ProfitlyAI
    Home » Zero-Inflated Data: A Comparison of Regression Models
    Artificial Intelligence

    Zero-Inflated Data: A Comparison of Regression Models

    ProfitlyAIBy ProfitlyAISeptember 5, 2025No Comments13 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    on a regression drawback. I knew that the goal I wished to design a predictive mannequin for was countable (i.e. 0, 1, 2, …). Consequently, I instantly considered selecting a Generalized Linear Mannequin (GLM) with a related discrete distribution, just like the Poisson distribution or the Damaging binomial distribution. However all the things didn’t go in addition to anticipated. I mistook haste for pace.

    Zero inflated information

    To start with, allow us to take a look at a dataset appropriate for the publish. I’ve chosen the outcomes of the NextGen National Household Travel Survey [1]. The variable of curiosity, named “BIKETRANSIT”, is the variety of “days in final 30 days biking used”, so that is an integer worth between 0 and 30 for every day customers. Here’s a histogram of the variable in query.

    Histogram of the variety of biking days

    We are able to clearly see the countable information is zero inflated. A lot of the respondents haven’t used a motorbike a single day during the last 30 days. I’ve additionally observed some attention-grabbing patterns: there are usually extra folks reporting bike use on precisely 5, 10, 15, 20, 25, or 30 days in comparison with the adjoining numbers. That is in all probability as a result of respondents choose to decide on spherical numbers when they’re uncertain of the exact rely. Regardless of the cause, on this publish we’ll focus totally on the difficulty of zero inflation by evaluating fashions designed for zero-inflated rely information.

    A number of survey fields have been chosen as impartial variables to clarify the variety of bike days (e.g., age, gender, employee class, schooling stage, family dimension, and district traits). I deliberately excluded options that rely the variety of days spent on different actions (corresponding to utilizing taxis or shared bikes), since a few of them are extremely correlated with the result of curiosity. I need the mannequin to stay life like: predicting bike utilization over 30 days primarily based on taxi, automobile, or public transport utilization over the identical interval wouldn’t present significant insights.

    Poisson regression limits

    Earlier than introducing the zero inflated mannequin, I want to illustrate the restrict of the Poisson regression, which I first thought-about for this dataset. I’ve not regarded on the Damaging Binomial distribution within the part. Poisson regression assumes that the dependent random variable Y follows a Poisson distribution, conditional on the impartial variables X and the parameters β.

    Poisson regression distribution mannequin

    So, let’s take a take a look at some empirical distributions of Y∣X,β. Since I included many options, it’s tough to seek out numerous observations with precisely the identical values of X. To handle this, I used a clustering algorithm — AgglomerativeClustering from scikit-learn [2] — to group observations with comparable characteristic profiles.
    First, I preprocessed the info in order that it may possibly feed the regression fashions and in addition the clustering algorithm. I don’t wish to spend an excessive amount of explaining all of the preprocessing steps as this publish doesn’t deal with it. The complete preprocessing code is offered on a repo [8]. Briefly, I encoded the explicit options utilizing one-hot encoding. I additionally utilized a number of preprocessing steps to the opposite options: imputing lacking values, clipping outliers, and making use of transformation features the place applicable. Lastly, I carried out clustering on the remodeled dataset.

    from sklearn.cluster import AgglomerativeClustering
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    
    pipe = Pipeline(
        [
            ("scaler", StandardScaler()), # I normalized the data as some numerical features, like age, have range of value greater than the one hot encoded features and I know clustering works based on some distance
            ("cluster", AgglomerativeClustering(n_clusters=100)) # I chose 100 clusters to have many observations in the biggest groups
        ]
    )
    cluster_id = pipe.fit_predict(X_train_preprocessed) # right here X_train_preprocessed is numerical dataframe, after encoding the explicit options

    Then I estimated the parameter of the Poisson distribution Ⲗ with the unbiased estimator being the imply of the noticed random variables for every group of the cluster.

    Ⲗ estimator

    I then plotted the empirical histograms together with the likelihood mass features of the fitted Poisson distributions for a number of teams of observations. To evaluate the standard of the match, I computed each the cross-entropy and the entropy, noting that entropy serves as a decrease certain for cross-entropy in accordance with Gibbs’ inequality [3]. mannequin ought to produce a cross-entropy worth near the entropy (although barely bigger).

    For this evaluation, I centered on three of the biggest teams, since parameter estimation is extra dependable with bigger pattern sizes. That is notably essential right here as a result of the info is skewed attributable to zero inflation, making it obligatory to gather many observations. Among the many teams, two include bike customers, whereas one group (228 respondents) reported no bike utilization in any respect. For this final group, no Poisson distribution was fitted, because the Poisson parameter have to be strictly larger than zero. Lastly, I used a vertical log scale within the plots to account for the zero inflation.

    I discover it tough to judge the standard of the fitted distribution by trying on the entropy and the cross entropy. Nonetheless I can see that the histogram and the likelihood mass perform differ rather a lot. This is the reason I then thought-about the Zero Inflated Poisson (ZIP) distribution.

    Zero inflated information tailored fashions

    Fashions designed for zero-inflated information purpose to seize each the excessive likelihood of zeros and the comparatively low possibilities of different occasions. I explored two most important households of such fashions:

    • “Zero-inflated fashions […] mannequin the zeros utilizing a two-component combination mannequin. […] The likelihood of the variable being zero is decided by each the primary distribution and the combination weight”. “A zero-inflated mannequin can solely enhance the likelihood of P(x = 0)” [5]. For notation, I exploit the next setup (barely completely different from Wikipedia and different sources). Let X1 be a hidden variable following a Bernoulli distribution. In my notation, the likelihood of success is p (whereas Wikipedia makes use of 1-π). Let X2 be one other hidden variable following a distribution that permits zeros with nonzero likelihood. For my use case, I assume X2 is discrete. The noticed variable is then outlined as X=X1*X2 which results in the next likelihood mass perform:
      We are able to discover that X1 and X2 are partially hidden. When X=0, then we can’t know the values of X1 and X2, however as quickly as X>0, each variables X1 and X2 are recognized.
    • Hurdle models mannequin the observable “random variable […] utilizing two elements, the primary of which is the likelihood of accomplishing the worth 0, and the second half fashions the likelihood of the non-zero values” [5]. Not like zero-inflated fashions, the second part should observe a distribution wherein the likelihood of zero is strictly zero. Utilizing the identical notation as earlier than, X1 fashions whether or not the commentary is zero or non-zero (sometimes by way of a Bernoulli distribution). X2 follows a distribution that assigns no likelihood mass to zero. Consequently, the likelihood mass perform is:

    Zero Inflated Poisson mannequin

    Allow us to take a look a the Zero Inflated Poisson model [4]. The ZIP likelihood mass perform is:

    ZIP likelihood mass perform

    It’s now doable to increase the earlier histograms and Poisson-fitted likelihood mass features by including the ZIP-fitted likelihood mass features. To do that, estimators of the 2 parameters, p and λ, are required. I used the strategy of moments to derive these estimators: the primary two moments present a system of two equations with two unknowns, which might then be solved.

    Second technique to get ZIP parameter estimators

    So the parameter estimators are:

    ZIP parameter estimators

    Lastly I’ve plotted the identical two figures with the fitted ZIP distribution likelihood mass features in addition to the cross entropy measures.

    Each visible inspection and cross-entropy values present that the ZIP mannequin suits the noticed information higher than the Poisson mannequin. This offers an goal and quantifiable cause to choose ZIP regression over Poisson regression.

    Mannequin comparability

    Allow us to now examine a number of fashions. I break up the info into coaching and check units, however it was not instantly clear which analysis metrics can be most applicable. For example, ought to I depend on Poisson deviance, despite the fact that the info is zero-inflated? Or imply squared error, which closely penalizes outliers? In the long run, I selected to make use of a number of metrics to raised seize mannequin efficiency: imply absolute error, Poisson deviance, and correlation. The fashions I evaluated are:

    • A naïve mannequin predicting the imply worth of the coaching set,
    • Linear regression (lr),
    • Poisson regression (pr),
    • Zero-inflated Poisson regression (zip),
    • A chained Logistic–Poisson regression (hurdle mannequin, lr_pr),
    • A chained Logistic–Zero-Truncated Poisson regression (hurdle mannequin, lr_tpr).

    ZIP mannequin

    Allow us to take a look at the ZIP regression implementation. First the detrimental log chance of the noticed information, famous y, is:

    Damaging log chance

    The marginal chance of the noticed information, P(Y=y), will be expressed analytically with out the integral formulation of the joint distribution, P(Y=y, X1=x1). So it will be optimized immediately with no need to make use of the expectation minimization algorithm [6]. The 2 distribution parameters p and Ⲗ are features of the options X and the parameters of the mannequin β that can be learnt. I’ve chosen that p is outlined because the sigmoid of the dot product between X and β and Ⲗ is outlined because the exponential of the dot product between X and β. To make the mannequin extra versatile, I exploit separate units of parameters β: one for p and one other for λ.

    ZIP parameter expressions

    Furthermore, I’ve added a prior on the parameters β to regularize the mannequin, particularly helpful for the Poisson mannequin for which there’s few observations due to the zero inflation. I’ve assumed a Regular prior, therefore the L2 regularization phrases added to the loss perform. I’ve assumed two completely different priors, one on the β for the Bernoulli mannequin and one on the β for the Poisson mannequin, therefore the 2 α hyper parameters, famous as alpha_b and alpha_p attributes within the mannequin. I’ve optimized these values by the use of a hyper parameter optimization.

    I created a category that inherits from scikit-learn’s BaseEstimator. The Python implementation of the loss perform is proven beneath (carried out throughout the class, therefore the self argument):

    def _loss(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
        n_feat = X.form[1]
    
        # break up beta into two elements: one for bernoulli p and one for poisson lambda
        beta_p = beta[:n_feat]
        beta_lam = beta[n_feat:]
        
        # get bernoulli p and poisson lambda
        p = sigmoid.val(beta_p, X)
        lam = exp.val(beta_lam, X)
        
        # initialize detrimental log chance
        out = 0
        
        # y == 0
        y_e0_mask = np.the place(y == 0)[0]
        out += np.sum(-np.log((1 - p) + p * np.exp(-lam))[y_e0_mask])
        
        # y > 0
        y_gt0_mask = np.the place(y > 0)[0]
        out += np.sum(-np.log(p)[y_gt0_mask])
        out += np.sum(-xlogy(y, lam)[y_gt0_mask])
        out += np.sum(lam[y_gt0_mask])
        
        # prior
        mask_b = np.ones_like(beta)
        mask_b[n_feat:] = 0
        mask_p = np.ones_like(beta)
        mask_p[:n_feat] = 0
        if self.fit_intercept:
            mask_b[n_feat - 1] = 0
            mask_p[2 * n_feat - 1] = 0
        out += 0.5 * self.alpha_b * np.sum((beta * mask_b) ** 2)
        out += 0.5 * self.alpha_p * np.sum((beta * mask_p) ** 2)
        
        return out

    With a purpose to optimize the loss goal perform, I’ve additionally computed the jacobian of the loss.

    Jacobian of the detrimental log chance

    The Python implementation is:

    def _jac(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
        n_feat = X.form[1]
    
        # break up beta into two elements: one for bernoulli p and one for poisson lambda
        beta_p = beta[:n_feat]
        beta_lam = beta[n_feat:]
    
        # get bernoulli p and poisson lambda
        p = sigmoid.val(beta_p, X)
        lam = exp.val(beta_lam, X)
    
        # y == 0 & beta_p
        jac_e0_p = np.expand_dims(
            np.the place(
                y == 0,
                (1 - np.exp(-lam)) / ((1 - p) + p * np.exp(-lam)),
                np.zeros_like(y),
            ),
            axis=1,
        ) * sigmoid.jac(beta_p, X)
        # y == 0 & beta_lam
        jac_e0_lam = np.expand_dims(
            np.the place(
                y == 0,
                p * np.exp(-lam) / ((1 - p) + p * np.exp(-lam)),
                np.zeros_like(y),
            ),
            axis=1,
        ) * exp.jac(beta_lam, X)
    
        # y > 0 & beta_p
        jac_gt0_p = np.expand_dims(
            np.the place(y > 0, -1 / p, np.zeros_like(y)), axis=1
        ) * sigmoid.jac(beta_p, X)
        # y > 0 & beta_lam
        jac_gt0_lam = np.expand_dims(
            np.the place(y > 0, 1 - y / lam, np.zeros_like(y)), axis=1
        ) * exp.jac(beta_lam, X)
    
        # initialize jac
        out = np.concatenate((jac_e0_p + jac_gt0_p, jac_e0_lam + jac_gt0_lam), axis=1)
    
        # jac for prior
        mask_b = np.ones_like(beta)
        mask_b[n_feat:] = 0
        mask_p = np.ones_like(beta)
        mask_p[:n_feat] = 0
        if self.fit_intercept:
            mask_b[n_feat - 1] = 0
            mask_p[2 * n_feat - 1] = 0
    
        return (
            np.sum(out, axis=0)
            + self.alpha_b * beta * mask_b
            + self.alpha_p * beta * mask_p
        )

    Sadly the loss perform isn’t convex, an area minima isn’t assured to be a worldwide minima. I’ve chosen the sunshine implementation of Broyden-Fletcher-Goldfarb-Shanno from scipy as a result of it’s quicker than the gradient descent strategies that I’ve examined.

    res = decrease(
        self._loss,
        np.zeros(2 * n_feat),
        args=(X, y),
        jac=self._jac,
        technique="L-BFGS-B",
    )

    All the class is coded on this file from the shared repo.
    After performing an hyper optimization tuning section to get the very best regularization hyper parameters, I’ve lastly computed the chosen metrics on the check set. The becoming time has been displayed along with the metrics.

    Benchmark outcomes

    Zero-inflated fashions — each ZIP and hurdle — obtain higher metrics than the naïve mannequin, linear regression, and normal Poisson regression. I initially anticipated a bigger efficiency hole, on condition that the empirical histogram of the noticed Y extra carefully resembles a ZIP distribution than a Poisson distribution. The enchancment, nonetheless, comes at the price of longer becoming instances, notably for the ZIP mannequin. For this use case, hurdle fashions seem to supply the very best compromise, delivering robust efficiency whereas preserving coaching time comparatively low.

    One doable cause for the comparatively modest enchancment could also be that the info doesn’t strictly observe a ZIP distribution. To research this, I ran one other benchmark utilizing the identical fashions on an artificial dataset particularly generated to observe a ZIP distribution. This dataset was designed to have roughly the identical variety of observations and options as the unique one, however with a goal variable that follows ZIP distribution by design.

    Benchmark outcomes for a pretend ZIP distributed dataset

    When the goal actually follows a ZIP distribution, the ZIP mannequin outperforms all the opposite fashions thought-about. It is usually value noting that, on this artificial setup, the options are not sparse (by design), which can assist clarify the discount in becoming time.

    Conclusions

    Earlier than selecting a statistical mannequin, it’s essential to rigorously analyze the dataset slightly than relying solely on prior assumptions about its traits. Analyzing the empirical distribution — corresponding to by way of histograms — typically reveals insights that information the selection of an applicable likelihood mannequin.

    That is notably essential for zero-inflated information, the place normal fashions could wrestle. An artificial instance with a zero-inflated Poisson (ZIP) distribution reveals how the fitting mannequin can present a significantly better match in comparison with options, even when these options aren’t totally misguided.

    For zero-inflated datasets, fashions such because the zero-inflated Poisson or hurdle fashions are particularly helpful. Whereas each can seize extra zeros successfully, hurdle fashions typically provide comparable efficiency with quicker coaching.

    Additional readings

    When engaged on this subject and writing the publish, I discovered this medium post [7] that I extremely advocate.

    References



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleTool Masking: The Layer MCP Forgot
    Next Article AI Operations Under the Hood: Challenges and Best Practices
    ProfitlyAI
    • Website

    Related Posts

    Artificial Intelligence

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025
    Artificial Intelligence

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025
    Artificial Intelligence

    Creating AI that matters | MIT News

    October 21, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    3 Questions: Visualizing research in the age of AI | MIT News

    April 5, 2025

    How a leading underwriting provider transformed their document review process

    April 24, 2025

    Anthropic släpper Claude Haiku 4.5 en mindre och snabbare AI-modell

    October 18, 2025

    A Refined Training Recipe for Fine-Grained Visual Classification

    August 12, 2025

    A Beginner’s Guide To Large Language Model LLM Evaluation

    April 7, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    Most Popular

    Why Your Prompts Don’t Belong in Git

    August 25, 2025

    Sam Altman Admits: ChatGPT’s New Personality Is “Annoying”, Fix Coming This Week

    April 29, 2025

    Alibaba lanserar sin senaste flaggskepps-AI-modell Qwen 3

    April 29, 2025
    Our Picks

    Implementing DRIFT Search with Neo4j and LlamaIndex

    October 22, 2025

    Agentic AI in Finance: Opportunities and Challenges for Indonesia

    October 22, 2025

    Dispatch: Partying at one of Africa’s largest AI gatherings

    October 22, 2025
    Categories
    • AI Technology
    • AI Tools & Technologies
    • Artificial Intelligence
    • Latest AI Innovations
    • Latest News
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 ProfitlyAI All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.