Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

The decision tree in the previous posts is useful in exploring how variables can predict a particular target or response.However, small changes in data can lead to different results. Like decision trees, Random Forests also assesses variables with respect to the data but applies a set of simple rules repeatedly to decide which variables have the highest importance.

In this example, I have used the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) dataset. This looks at alcohol intake and I have specified 10 of the following variables:
1. SEX
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)

The target is set to “DRANK AT LEAST 12 ALCOHOLIC DRINKS IN LAST 12 MONTHS”

The syntax is provided as follows:

[code language="python"]
 -- coding: UTF-8 --
 from pandas import Series, DataFrame
 import pandas as pd
 import numpy as np
 import os
 import matplotlib.pylab as plt
 from sklearn.cross_validation import train_test_split
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.metrics import classification_report
 import sklearn.metrics
 Feature Importance
 from sklearn import datasets
 from sklearn.ensemble import ExtraTreesClassifier
 -- coding: UTF-8 --
 Remember to replace file directory with the active folder
 os.chdir("file directory")
 Load the dataset
 AH_data = pd.read_csv("nesarc_pds.csv", low_memory=False)
 data_clean = AH_data.dropna()
 data_clean.dtypes
 data_clean.describe()
 Split into training and testing sets
 predictors = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']]
 targets = data_clean.S2AQ2
 pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
 pred_train.shape
 pred_test.shape
 tar_train.shape
 tar_test.shape
 Build model on training data
 from sklearn.ensemble import RandomForestClassifier
 classifier=RandomForestClassifier(n_estimators=25)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 sklearn.metrics.confusion_matrix(tar_test,predictions)
 sklearn.metrics.accuracy_score(tar_test, predictions)
 fit an Extra Trees model to the data
 model = ExtraTreesClassifier()
 model.fit(pred_train,tar_train)
 display the relative importance of each attribute
 print(model.feature_importances_)
 trees=range(25)
 accuracy=np.zeros(25)
 for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 plt.cla()
 plt.plot(trees, accuracy)
 [/code]

The results are as follows.

The explanatory variables with the highest relative importance scores were sex, the number of children ever had and if the individual was working a full-time job. The accuracy of the random forest was 63%. This means that 63 percent of the sample was classified correctly as having drunk at least 12 alcoholic drinks in the last 12 months.

True positive is 5196
True negative is 5675
False negative is 3213
False positive is 3140

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

3 Responses to Machine Learning: Random Forests using Python – Alcohol intake based on physical and social attributes

Leave a ReplyCancel reply

Categories

Top Posts