Zero-inflation negative binomial regression model (ZINB)

Fuufhjn...

Zero-inflation negative binomial regression model (ZINB)

Introduction

This post is primarily application-oriented, so I'll skip the principles of the model. The zero-inflation negative binomial regression model is suitable for data with over-discrete and zero-inflation characteristics. For example, the number of hospitalizations and the number of disease attacks, a large number of which may never have had the event, which is called zero expansion. In problem C of the 2025 MCM competition, we need to predict which countries will win their first Olympic medal, this scenario should also use the zero-inflation model. Here I have simplified this model to a black box. Its input is a set of eigenvariables $X$ (independent variables); and a set of dependent variables, $Y$ , which is the zero-bloated data that needs to be predicted. The output is the probability value of the dependent variable of 0.

Once you know this, you need three more parameters. Two of them are from the first part of the model, which is negative binomial regression (NB):

Regression coefficient $β$ : describes the effect of independent variables on the count data;

Discrete Parameter $θ$ : Used to control excessive discretization.

The last logistic regression coefficient $γ$ comes from the zero-inflation part and is used to determine whether an observation belongs to the "zero-inflation" part.

Joint Modeling Formula:

P (Y = y) = {\begin{matrix} π + (1 - π) P_{N B} (0), y = 0 \\ (1 - π) P_{N B} (y), y > 0 \end{matrix}

Codes

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.discrete.count_model import ZeroInflatedNegativeBinomialP

# This file contains all the independent and dependent fields
file_path = "final.csv"
data = pd.read_csv(file_path)

independent_vars = [
    'Participation_Count',
    'k_medal_ratio',
    'total',
    'gold',
    'discipline_Count',
    'Athlete_Count',
    'discipline_Event_ratio',
    'Avg_Participation',
    'Gender_Ratio',
    'host',
    'super',
    'Consecutive_Medals',
    'Disruption_Count',
    'young_ratio'
]
dependent_vars = ['Total']

If you have more than one dependent variable, set up a negative binomial regression model for each dependent variable:

for target in dependent_vars:
    print(f"Analyzing: {target}")

    # Extract X and y
    X = data[independent_vars]
    y = data[target]

    # Add a constant term
    X = sm.add_constant(X)

    # Constract a zero-inflation negative binomial regression model
    model = ZeroInflatedNegativeBinomialP(y, X, exog_infl=X)
    result = model.fit()

    # Results
    summary = result.summary2().tables[1]
    summary['OR'] = np.exp(summary['Coef.'])

    # Print regression coefficients, standard errors, z-values, p-values, OR values
    print(summary[['Coef.', 'Std.Err.', 'z', 'P>|z|', 'OR']])
    print("\n")

Notice that four parameters are printed here, which are Coef., Std.Err., z, P>|z|, OR. Here's what they mean:

Regression coefficients	Reflects the strength and direction of the influence of the independent variable on the dependent variable
Standard error	Measure the uncertainty of the estimated regression coefficient. The smaller this value, the more accurate the regression coefficient estimate
p-value	Measure whether the regression coefficient is significantly not equal to zero. If \|z\| is large, the influence of this variable is more significant
z-value	Used to test if the regression coefficient is significantly different from 0 (i.e., whether the variable is statistically significant for Y)
OR	Represents the effect of the independent variable on the probability of zero inflation

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

Zero-inflation negative binomial regression model (ZINB)

Zero-inflation negative binomial regression model (ZINB)

Introduction

Codes

Preview: