The EDRR Blog: Run Logit Regressions in Python and Beware of Regression vs Forecasting Differences

Saturday, January 11, 2020

Run Logit Regressions in Python and Beware of Regression vs Forecasting Differences

I don't have access to Stata on all my computer since Stata licenses cost $$. Excel would work for basic OLS but it's a lot more involve to get it to do logit. Python is a workaround and it can do a lot more as well.

However, there's a difference in using "logit" to forecast (typically done via sklearn) versus doing a econometrics logitic regression. For econometrics analysis, sklearn should not be used. For the latter purpose, we generally try to estimate coefficients and interpret causality via hypothesis testing. Forecasting via sklearn often ignores this aspect. Furtherore, the coefficients you get via sklearn are often different from what you would get running it under Stata or any econometrics software.

I have written a Colab/Python notebook on this, and you can find it here (link). But since you will need to sign into a google account to view and run the code, I included the code below for quick reference. Note I had to break up the line "model = logit('diagnosis ~ ...", that should all be one line.

To use sklearn to forecast with logit:

import numpy as np

import pandas as pd

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression  

from sklearn.metrics import confusion_matrix

data = pd.read_csv("fertility_Diagnosis_mod.txt")

  # You can get the data at 

  #   https://archive.ics.uci.edu/ml/datasets/Fertility. 

  # I added variable labels and recoded the dependent variable (N=0,O=1)

X = data.iloc[:,:-1].values

Y = data.iloc[:,-1].values

y = Y

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.15,

random_state=5)

classifier = LogisticRegression(class_weight='balanced')

classifier.fit(X_train,y_train)

y_pred = classifier.predict(X_test)

classifier.coef_   

To use statsmodel for logit regression:

from statsmodels.formula.api import ols

from statsmodels.formula.api import logit

datadata = pd.read_csv('fertility_Diagnosis_mod.txt')

model = logit('diagnosis ~ season + age + diseases + accident + 

  surgery + fever + alcohol_intake + smoking_habit + 

  sitting_hours', datadata).fit()

print(model.summary())

More of My Stuffs

Browser-based Offline Pomodoro Timer
EDRR/p/et-timer.html?25.Pomodoro
A browser-based timer you can set via your url text box. The link above is a demo that works online. Instructions to create it so you can use it offline is here.

Dividend Aristocrat Tableau Viz
EDRR/2015/10/fuzzy-dividend-aristocrats-companion.html

Get information on dividend aristocrat stocks to help with buying decisions? Try a Tableau Viz I build using financial data parsed from Yahoo Finance. (This is still a nice demo but because Yahoo has made getting its financial data very difficult I no longer update this.)