Machine Learning: Lasso Regression using Python – Alcohol intake based on physical and social attributes

Lasso regression (AKA Penalized regression method) is often used to select a subset of variables. It is a supervised machine learning method which stands for “Least Absolute Selection and Shrinkage Operator”.

lasso

The shrinkage process identifies the variables most strongly associated with the selected target variable. We will be using the same target and explanatory variables from the post on Random Forests. This helps us compare the different ways to select important variables that affect the target.

The variables are again:
1. SEX
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)

The target is set to “DRANK AT LEAST 12 ALCOHOLIC DRINKS IN LAST 12 MONTHS”Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701).

Data were randomly split into a training set that included 70% of the observations and a test set that included 30% of the observations. The syntax is as follows:

# -*- coding: UTF-8 -*-

#from pandas import Series, DataFrame
import pandas as pd
import os
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

#Remember to replace file directory with the active folder
os.chdir("file directory")
#Load the dataset
data = pd.read_csv("nesarc_pds.csv", low_memory=False)

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

data_clean = data.dropna()

#Split into training and testing sets
predvar = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']]

target = data_clean.S2AQ2

#standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing

# scaling every variable
predictors['SEX']=preprocessing.scale(predictors['SEX'].astype('float64'))
predictors['S1Q1C']=preprocessing.scale(predictors['S1Q1C'].astype('float64'))
predictors['S1Q1D1']=preprocessing.scale(predictors['S1Q1D1'].astype('float64'))
predictors['S1Q1D2']=preprocessing.scale(predictors['S1Q1D2'].astype('float64'))
predictors['S1Q1D3']=preprocessing.scale(predictors['S1Q1D3'].astype('float64'))
predictors['S1Q1D4']=preprocessing.scale(predictors['S1Q1D4'].astype('float64'))
predictors['S1Q1D5']=preprocessing.scale(predictors['S1Q1D5'].astype('float64'))
predictors['S1Q5A']=preprocessing.scale(predictors['S1Q5A'].astype('float64'))
predictors['S1Q7A1']=preprocessing.scale(predictors['S1Q7A1'].astype('float64'))
predictors['S1Q14A']=preprocessing.scale(predictors['S1Q14A'].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target,
test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

The output is as follows.

Screen Shot 2017-07-26 at 7.18.13 PM

As you can see, the three variables with the highest weights in order are gender, having a full-time job and having fewer children. The variables with more weights are similar to the results using the random forest. All 3 variables show positive regression coefficients as you can see in the first graph below, indicated by blue, yellow and magenta respectively.

Screen Shot 2017-07-26 at 7.13.32 PM

The second graph shows that prediction error decreases as more variables are added to the model. However, as more variables are added, the reduction in mean squared error became negligible. A model with too few predictors has a risk of being under-chosen and a model with too many predictors has a risk of being over-fitted.

It is interesting to know that people with full-time jobs and fewer children tend to drink more. Perhaps this is due to having more resources.

Advertisements
This entry was posted in Blog/Updates, Software, Tutorials and tagged , , , , , . Bookmark the permalink.

One Response to Machine Learning: Lasso Regression using Python – Alcohol intake based on physical and social attributes

  1. Pingback: Machine Learning: K-Means Cluster Analysis using Python – Alcohol intake based on physical and social attributes | XELLINK Solutions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s