on a regression drawback. I knew that the goal I wished to design a predictive mannequin for was countable (i.e. 0, 1, 2, …). Consequently, I instantly considered selecting a Generalized Linear Mannequin (GLM) with a related discrete distribution, just like the Poisson distribution or the Damaging binomial distribution. However all the things didn’t go in addition to anticipated. I mistook haste for pace.
Zero inflated information
To start with, allow us to take a look at a dataset appropriate for the publish. I’ve chosen the outcomes of the NextGen National Household Travel Survey [1]. The variable of curiosity, named “BIKETRANSIT”, is the variety of “days in final 30 days biking used”, so that is an integer worth between 0 and 30 for every day customers. Here’s a histogram of the variable in query.
We are able to clearly see the countable information is zero inflated. A lot of the respondents haven’t used a motorbike a single day during the last 30 days. I’ve additionally observed some attention-grabbing patterns: there are usually extra folks reporting bike use on precisely 5, 10, 15, 20, 25, or 30 days in comparison with the adjoining numbers. That is in all probability as a result of respondents choose to decide on spherical numbers when they’re uncertain of the exact rely. Regardless of the cause, on this publish we’ll focus totally on the difficulty of zero inflation by evaluating fashions designed for zero-inflated rely information.
A number of survey fields have been chosen as impartial variables to clarify the variety of bike days (e.g., age, gender, employee class, schooling stage, family dimension, and district traits). I deliberately excluded options that rely the variety of days spent on different actions (corresponding to utilizing taxis or shared bikes), since a few of them are extremely correlated with the result of curiosity. I need the mannequin to stay life like: predicting bike utilization over 30 days primarily based on taxi, automobile, or public transport utilization over the identical interval wouldn’t present significant insights.
Poisson regression limits
Earlier than introducing the zero inflated mannequin, I want to illustrate the restrict of the Poisson regression, which I first thought-about for this dataset. I’ve not regarded on the Damaging Binomial distribution within the part. Poisson regression assumes that the dependent random variable Y follows a Poisson distribution, conditional on the impartial variables X and the parameters β.

So, let’s take a take a look at some empirical distributions of Y∣X,β. Since I included many options, it’s tough to seek out numerous observations with precisely the identical values of X. To handle this, I used a clustering algorithm — AgglomerativeClustering from scikit-learn [2] — to group observations with comparable characteristic profiles.
First, I preprocessed the info in order that it may possibly feed the regression fashions and in addition the clustering algorithm. I don’t wish to spend an excessive amount of explaining all of the preprocessing steps as this publish doesn’t deal with it. The complete preprocessing code is offered on a repo [8]. Briefly, I encoded the explicit options utilizing one-hot encoding. I additionally utilized a number of preprocessing steps to the opposite options: imputing lacking values, clipping outliers, and making use of transformation features the place applicable. Lastly, I carried out clustering on the remodeled dataset.
from sklearn.cluster import AgglomerativeClustering
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline(
[
("scaler", StandardScaler()), # I normalized the data as some numerical features, like age, have range of value greater than the one hot encoded features and I know clustering works based on some distance
("cluster", AgglomerativeClustering(n_clusters=100)) # I chose 100 clusters to have many observations in the biggest groups
]
)
cluster_id = pipe.fit_predict(X_train_preprocessed) # right here X_train_preprocessed is numerical dataframe, after encoding the explicit options
Then I estimated the parameter of the Poisson distribution Ⲗ with the unbiased estimator being the imply of the noticed random variables for every group of the cluster.

I then plotted the empirical histograms together with the likelihood mass features of the fitted Poisson distributions for a number of teams of observations. To evaluate the standard of the match, I computed each the cross-entropy and the entropy, noting that entropy serves as a decrease certain for cross-entropy in accordance with Gibbs’ inequality [3]. mannequin ought to produce a cross-entropy worth near the entropy (although barely bigger).
For this evaluation, I centered on three of the biggest teams, since parameter estimation is extra dependable with bigger pattern sizes. That is notably essential right here as a result of the info is skewed attributable to zero inflation, making it obligatory to gather many observations. Among the many teams, two include bike customers, whereas one group (228 respondents) reported no bike utilization in any respect. For this final group, no Poisson distribution was fitted, because the Poisson parameter have to be strictly larger than zero. Lastly, I used a vertical log scale within the plots to account for the zero inflation.

I discover it tough to judge the standard of the fitted distribution by trying on the entropy and the cross entropy. Nonetheless I can see that the histogram and the likelihood mass perform differ rather a lot. This is the reason I then thought-about the Zero Inflated Poisson (ZIP) distribution.
Zero inflated information tailored fashions
Fashions designed for zero-inflated information purpose to seize each the excessive likelihood of zeros and the comparatively low possibilities of different occasions. I explored two most important households of such fashions:
- “Zero-inflated fashions […] mannequin the zeros utilizing a two-component combination mannequin. […] The likelihood of the variable being zero is decided by each the primary distribution and the combination weight”. “A zero-inflated mannequin can solely enhance the likelihood of P(x = 0)” [5]. For notation, I exploit the next setup (barely completely different from Wikipedia and different sources). Let X1 be a hidden variable following a Bernoulli distribution. In my notation, the likelihood of success is p (whereas Wikipedia makes use of 1-π). Let X2 be one other hidden variable following a distribution that permits zeros with nonzero likelihood. For my use case, I assume X2 is discrete. The noticed variable is then outlined as X=X1*X2 which results in the next likelihood mass perform:
We are able to discover that X1 and X2 are partially hidden. When X=0, then we can’t know the values of X1 and X2, however as quickly as X>0, each variables X1 and X2 are recognized.
- Hurdle models mannequin the observable “random variable […] utilizing two elements, the primary of which is the likelihood of accomplishing the worth 0, and the second half fashions the likelihood of the non-zero values” [5]. Not like zero-inflated fashions, the second part should observe a distribution wherein the likelihood of zero is strictly zero. Utilizing the identical notation as earlier than, X1 fashions whether or not the commentary is zero or non-zero (sometimes by way of a Bernoulli distribution). X2 follows a distribution that assigns no likelihood mass to zero. Consequently, the likelihood mass perform is:
Zero Inflated Poisson mannequin
Allow us to take a look a the Zero Inflated Poisson model [4]. The ZIP likelihood mass perform is:

It’s now doable to increase the earlier histograms and Poisson-fitted likelihood mass features by including the ZIP-fitted likelihood mass features. To do that, estimators of the 2 parameters, p and λ, are required. I used the strategy of moments to derive these estimators: the primary two moments present a system of two equations with two unknowns, which might then be solved.

So the parameter estimators are:

Lastly I’ve plotted the identical two figures with the fitted ZIP distribution likelihood mass features in addition to the cross entropy measures.

Each visible inspection and cross-entropy values present that the ZIP mannequin suits the noticed information higher than the Poisson mannequin. This offers an goal and quantifiable cause to choose ZIP regression over Poisson regression.
Mannequin comparability
Allow us to now examine a number of fashions. I break up the info into coaching and check units, however it was not instantly clear which analysis metrics can be most applicable. For example, ought to I depend on Poisson deviance, despite the fact that the info is zero-inflated? Or imply squared error, which closely penalizes outliers? In the long run, I selected to make use of a number of metrics to raised seize mannequin efficiency: imply absolute error, Poisson deviance, and correlation. The fashions I evaluated are:
- A naïve mannequin predicting the imply worth of the coaching set,
- Linear regression (lr),
- Poisson regression (pr),
- Zero-inflated Poisson regression (zip),
- A chained Logistic–Poisson regression (hurdle mannequin, lr_pr),
- A chained Logistic–Zero-Truncated Poisson regression (hurdle mannequin, lr_tpr).
ZIP mannequin
Allow us to take a look at the ZIP regression implementation. First the detrimental log chance of the noticed information, famous y, is:

The marginal chance of the noticed information, P(Y=y), will be expressed analytically with out the integral formulation of the joint distribution, P(Y=y, X1=x1). So it will be optimized immediately with no need to make use of the expectation minimization algorithm [6]. The 2 distribution parameters p and Ⲗ are features of the options X and the parameters of the mannequin β that can be learnt. I’ve chosen that p is outlined because the sigmoid of the dot product between X and β and Ⲗ is outlined because the exponential of the dot product between X and β. To make the mannequin extra versatile, I exploit separate units of parameters β: one for p and one other for λ.

Furthermore, I’ve added a prior on the parameters β to regularize the mannequin, particularly helpful for the Poisson mannequin for which there’s few observations due to the zero inflation. I’ve assumed a Regular prior, therefore the L2 regularization phrases added to the loss perform. I’ve assumed two completely different priors, one on the β for the Bernoulli mannequin and one on the β for the Poisson mannequin, therefore the 2 α hyper parameters, famous as alpha_b and alpha_p attributes within the mannequin. I’ve optimized these values by the use of a hyper parameter optimization.
I created a category that inherits from scikit-learn’s BaseEstimator
. The Python implementation of the loss perform is proven beneath (carried out throughout the class, therefore the self
argument):
def _loss(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> float:
n_feat = X.form[1]
# break up beta into two elements: one for bernoulli p and one for poisson lambda
beta_p = beta[:n_feat]
beta_lam = beta[n_feat:]
# get bernoulli p and poisson lambda
p = sigmoid.val(beta_p, X)
lam = exp.val(beta_lam, X)
# initialize detrimental log chance
out = 0
# y == 0
y_e0_mask = np.the place(y == 0)[0]
out += np.sum(-np.log((1 - p) + p * np.exp(-lam))[y_e0_mask])
# y > 0
y_gt0_mask = np.the place(y > 0)[0]
out += np.sum(-np.log(p)[y_gt0_mask])
out += np.sum(-xlogy(y, lam)[y_gt0_mask])
out += np.sum(lam[y_gt0_mask])
# prior
mask_b = np.ones_like(beta)
mask_b[n_feat:] = 0
mask_p = np.ones_like(beta)
mask_p[:n_feat] = 0
if self.fit_intercept:
mask_b[n_feat - 1] = 0
mask_p[2 * n_feat - 1] = 0
out += 0.5 * self.alpha_b * np.sum((beta * mask_b) ** 2)
out += 0.5 * self.alpha_p * np.sum((beta * mask_p) ** 2)
return out
With a purpose to optimize the loss goal perform, I’ve additionally computed the jacobian of the loss.

The Python implementation is:
def _jac(self, beta: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
n_feat = X.form[1]
# break up beta into two elements: one for bernoulli p and one for poisson lambda
beta_p = beta[:n_feat]
beta_lam = beta[n_feat:]
# get bernoulli p and poisson lambda
p = sigmoid.val(beta_p, X)
lam = exp.val(beta_lam, X)
# y == 0 & beta_p
jac_e0_p = np.expand_dims(
np.the place(
y == 0,
(1 - np.exp(-lam)) / ((1 - p) + p * np.exp(-lam)),
np.zeros_like(y),
),
axis=1,
) * sigmoid.jac(beta_p, X)
# y == 0 & beta_lam
jac_e0_lam = np.expand_dims(
np.the place(
y == 0,
p * np.exp(-lam) / ((1 - p) + p * np.exp(-lam)),
np.zeros_like(y),
),
axis=1,
) * exp.jac(beta_lam, X)
# y > 0 & beta_p
jac_gt0_p = np.expand_dims(
np.the place(y > 0, -1 / p, np.zeros_like(y)), axis=1
) * sigmoid.jac(beta_p, X)
# y > 0 & beta_lam
jac_gt0_lam = np.expand_dims(
np.the place(y > 0, 1 - y / lam, np.zeros_like(y)), axis=1
) * exp.jac(beta_lam, X)
# initialize jac
out = np.concatenate((jac_e0_p + jac_gt0_p, jac_e0_lam + jac_gt0_lam), axis=1)
# jac for prior
mask_b = np.ones_like(beta)
mask_b[n_feat:] = 0
mask_p = np.ones_like(beta)
mask_p[:n_feat] = 0
if self.fit_intercept:
mask_b[n_feat - 1] = 0
mask_p[2 * n_feat - 1] = 0
return (
np.sum(out, axis=0)
+ self.alpha_b * beta * mask_b
+ self.alpha_p * beta * mask_p
)
Sadly the loss perform isn’t convex, an area minima isn’t assured to be a worldwide minima. I’ve chosen the sunshine implementation of Broyden-Fletcher-Goldfarb-Shanno from scipy as a result of it’s quicker than the gradient descent strategies that I’ve examined.
res = decrease(
self._loss,
np.zeros(2 * n_feat),
args=(X, y),
jac=self._jac,
technique="L-BFGS-B",
)
All the class is coded on this file from the shared repo.
After performing an hyper optimization tuning section to get the very best regularization hyper parameters, I’ve lastly computed the chosen metrics on the check set. The becoming time has been displayed along with the metrics.

Zero-inflated fashions — each ZIP and hurdle — obtain higher metrics than the naïve mannequin, linear regression, and normal Poisson regression. I initially anticipated a bigger efficiency hole, on condition that the empirical histogram of the noticed Y extra carefully resembles a ZIP distribution than a Poisson distribution. The enchancment, nonetheless, comes at the price of longer becoming instances, notably for the ZIP mannequin. For this use case, hurdle fashions seem to supply the very best compromise, delivering robust efficiency whereas preserving coaching time comparatively low.
One doable cause for the comparatively modest enchancment could also be that the info doesn’t strictly observe a ZIP distribution. To research this, I ran one other benchmark utilizing the identical fashions on an artificial dataset particularly generated to observe a ZIP distribution. This dataset was designed to have roughly the identical variety of observations and options as the unique one, however with a goal variable that follows ZIP distribution by design.

When the goal actually follows a ZIP distribution, the ZIP mannequin outperforms all the opposite fashions thought-about. It is usually value noting that, on this artificial setup, the options are not sparse (by design), which can assist clarify the discount in becoming time.
Conclusions
Earlier than selecting a statistical mannequin, it’s essential to rigorously analyze the dataset slightly than relying solely on prior assumptions about its traits. Analyzing the empirical distribution — corresponding to by way of histograms — typically reveals insights that information the selection of an applicable likelihood mannequin.
That is notably essential for zero-inflated information, the place normal fashions could wrestle. An artificial instance with a zero-inflated Poisson (ZIP) distribution reveals how the fitting mannequin can present a significantly better match in comparison with options, even when these options aren’t totally misguided.
For zero-inflated datasets, fashions such because the zero-inflated Poisson or hurdle fashions are particularly helpful. Whereas each can seize extra zeros successfully, hurdle fashions typically provide comparable efficiency with quicker coaching.
Additional readings
When engaged on this subject and writing the publish, I discovered this medium post [7] that I extremely advocate.