Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

The decision tree in the previous posts is useful in exploring how variables can predict a particular target or response.However, small changes in data can lead to different results. Like decision trees, Random Forests also assesses variables with respect to the data but applies a set of simple rules repeatedly to decide which variables have the highest importance.beer

In this example, I have used the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) dataset. This looks at alcohol intake and I have specified 10 of the following variables:
1. SEX
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)

The target is set to “DRANK AT LEAST 12 ALCOHOLIC DRINKS IN LAST 12 MONTHS”

 

The syntax is provided as follows:

[code language="python"]
 -- coding: UTF-8 --
 from pandas import Series, DataFrame
 import pandas as pd
 import numpy as np
 import os
 import matplotlib.pylab as plt
 from sklearn.cross_validation import train_test_split
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.metrics import classification_report
 import sklearn.metrics
 Feature Importance
 from sklearn import datasets
 from sklearn.ensemble import ExtraTreesClassifier
 -- coding: UTF-8 --
 Remember to replace file directory with the active folder
 os.chdir("file directory")
 Load the dataset
 AH_data = pd.read_csv("nesarc_pds.csv", low_memory=False)
 data_clean = AH_data.dropna()
 data_clean.dtypes
 data_clean.describe()
 Split into training and testing sets
 predictors = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']]
 targets = data_clean.S2AQ2
 pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
 pred_train.shape
 pred_test.shape
 tar_train.shape
 tar_test.shape
 Build model on training data
 from sklearn.ensemble import RandomForestClassifier
 classifier=RandomForestClassifier(n_estimators=25)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 sklearn.metrics.confusion_matrix(tar_test,predictions)
 sklearn.metrics.accuracy_score(tar_test, predictions)
 fit an Extra Trees model to the data
 model = ExtraTreesClassifier()
 model.fit(pred_train,tar_train)
 display the relative importance of each attribute
 print(model.feature_importances_)
 trees=range(25)
 accuracy=np.zeros(25)
 for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 plt.cla()
 plt.plot(trees, accuracy)
 [/code]

The results are as follows.

Screen Shot 2017-07-24 at 9.30.18 PM

The explanatory variables with the highest relative importance scores were sex, the number of children ever had and if the individual was working a full-time job. The accuracy of the random forest was 63%. This means that 63 percent of the sample was classified correctly as having drunk at least 12 alcoholic drinks in the last 12 months.

True positive is 5196
True negative is 5675
False negative is 3213
False positive is 3140

This entry was posted in Blog/Updates, Software, Tutorials and tagged , , , , , . Bookmark the permalink.

3 Responses to Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

  1. Pingback: Using Python and machine learning to make Decision Trees: Adolescent Sex and Parenting | XELLINK Solutions

  2. Pingback: Machine Learning: Lasso Regression using Python – Alcohol intake based on physical and social attributes | XELLINK Solutions

  3. Pingback: Machine Learning: K-Means Cluster Analysis using Python – Alcohol intake based on physical and social attributes | XELLINK Solutions

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.