Zero-inflation negative binomial regression model (ZINB)
Zero-inflation negative binomial regression model (ZINB)
Introduction
This post is primarily application-oriented, so I'll skip the principles of the model. The zero-inflation negative binomial regression model is suitable for data with over-discrete and zero-inflation characteristics. For example, the number of hospitalizations and the number of disease attacks, a large number of which may never have had the event, which is called zero expansion. In problem C of the 2025 MCM competition, we need to predict which countries will win their first Olympic medal, this scenario should also use the zero-inflation model. Here I have simplified this model to a black box. Its input is a set of eigenvariables
Once you know this, you need three more parameters. Two of them are from the first part of the model, which is negative binomial regression (NB):
Regression coefficient
Discrete Parameter
The last logistic regression coefficient
Joint Modeling Formula:
Codes
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP
# This file contains all the independent and dependent fields
file_path = "final.csv"
data = pd.read_csv(file_path)
independent_vars = [
'Participation_Count',
'k_medal_ratio',
'total',
'gold',
'discipline_Count',
'Athlete_Count',
'discipline_Event_ratio',
'Avg_Participation',
'Gender_Ratio',
'host',
'super',
'Consecutive_Medals',
'Disruption_Count',
'young_ratio'
]
dependent_vars = ['Total']
If you have more than one dependent variable, set up a negative binomial regression model for each dependent variable:
for target in dependent_vars:
print(f"Analyzing: {target}")
# Extract X and y
X = data[independent_vars]
y = data[target]
# Add a constant term
X = sm.add_constant(X)
# Constract a zero-inflation negative binomial regression model
model = ZeroInflatedNegativeBinomialP(y, X, exog_infl=X)
result = model.fit()
# Results
summary = result.summary2().tables[1]
summary['OR'] = np.exp(summary['Coef.'])
# Print regression coefficients, standard errors, z-values, p-values, OR values
print(summary[['Coef.', 'Std.Err.', 'z', 'P>|z|', 'OR']])
print("\n")
Notice that four parameters are printed here, which are Coef., Std.Err., z, P>|z|, OR. Here's what they mean:
Regression coefficients | Reflects the strength and direction of the influence of the independent variable on the dependent variable |
---|---|
Standard error | Measure the uncertainty of the estimated regression coefficient. The smaller this value, the more accurate the regression coefficient estimate |
p-value | Measure whether the regression coefficient is significantly not equal to zero. If |z| is large, the influence of this variable is more significant |
z-value | Used to test if the regression coefficient is significantly different from 0 (i.e., whether the variable is statistically significant for Y) |
OR | Represents the effect of the independent variable on the probability of zero inflation |