Saturday, January 11, 2020

Run Logit Regressions in Python and Beware of Regression vs Forecasting Differences


I don't have access to Stata on all my computer since Stata licenses cost $$. Excel would work for basic OLS but it's a lot more involve to get it to do logit. Python is a workaround and it can do a lot more as well.

However, there's a difference in using "logit" to forecast (typically done via sklearn) versus doing a econometrics logitic regression. For econometrics analysis, sklearn should not be used. For the latter purpose, we generally try to estimate coefficients and interpret causality via hypothesis testing. Forecasting via sklearn often ignores this aspect. Furtherore, the coefficients you get via sklearn are often different from what you would get running it under Stata or any econometrics software.

I have written a Colab/Python notebook on this, and you can find it here (link). But since you will need to sign into a google account to view and run the code, I included the code below for quick reference. Note I had to break up the line "model = logit('diagnosis ~ ...", that should all be one line.

To use sklearn to forecast with logit:

import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression  
from sklearn.metrics import confusion_matrix
data = pd.read_csv("fertility_Diagnosis_mod.txt")
  # You can get the data at 
# https://archive.ics.uci.edu/ml/datasets/Fertility. 
# I added variable labels and recoded the dependent variable (N=0,O=1)

X = data.iloc[:,:-1].values
Y = data.iloc[:,-1].values
y = Y

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.15,
random_state=5)
classifier = LogisticRegression(class_weight='balanced')
classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)
classifier.coef_  

To use statsmodel for logit regression:

from statsmodels.formula.api import ols
from statsmodels.formula.api import logit

datadata = pd.read_csv('fertility_Diagnosis_mod.txt')
model = logit('diagnosis ~ season + age + diseases + accident + 
surgery + fever + alcohol_intake + smoking_habit + 
sitting_hours', datadata).fit()
print(model.summary())