Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

The decision tree in the previous posts is useful in exploring how variables can predict a particular target or response.However, small changes in data can lead to different results. Like decision trees, Random Forests also assesses variables with respect to the data but applies a set of simple rules repeatedly to decide which variables have the highest importance.beer

In this example, I have used the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) dataset. This looks at alcohol intake and I have specified 10 of the following variables:
1. SEX
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)

The target is set to “DRANK AT LEAST 12 ALCOHOLIC DRINKS IN LAST 12 MONTHS”

The syntax is provided as follows:

# -*- coding: UTF-8 -*-

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier
# -*- coding: UTF-8 -*-
#Remember to replace file directory with the active folder
os.chdir("file directory")

#Load the dataset

AH_data = pd.read_csv("nesarc_pds.csv", low_memory=False)

data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

#Split into training and testing sets
predictors = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']]
targets = data_clean.S2AQ2

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)

The results are as follows.

Screen Shot 2017-07-24 at 9.30.18 PM

The explanatory variables with the highest relative importance scores were sex, the number of children ever had and if the individual was working a full-time job. The accuracy of the random forest was 63%. This means that 63 percent of the sample was classified correctly as having drunk at least 12 alcoholic drinks in the last 12 months.

True positive is 5196
True negative is 5675
False negative is 3213
False positive is 3140

Advertisements
This entry was posted in Blog/Updates, Software, Tutorials and tagged , , , , , . Bookmark the permalink.

3 Responses to Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

  1. Pingback: Using Python and machine learning to make Decision Trees: Adolescent Sex and Parenting | XELLINK Solutions

  2. Pingback: Machine Learning: Lasso Regression using Python – Alcohol intake based on physical and social attributes | XELLINK Solutions

  3. Pingback: Machine Learning: K-Means Cluster Analysis using Python – Alcohol intake based on physical and social attributes | XELLINK Solutions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s