The decision tree in the previous posts is useful in exploring how variables can predict a particular target or response.However, small changes in data can lead to different results. Like decision trees, Random Forests also assesses variables with respect to the data but applies a set of simple rules repeatedly to decide which variables have the highest importance.
In this example, I have used the U.S. National Epidemiological Survey on Alcohol and Related Conditions (NESARC) dataset. This looks at alcohol intake and I have specified 10 of the following variables:
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)
The target is set to “DRANK AT LEAST 12 ALCOHOLIC DRINKS IN LAST 12 MONTHS”
The syntax is provided as follows:
# -*- coding: UTF-8 -*- from pandas import Series, DataFrame import pandas as pd import numpy as np import os import matplotlib.pylab as plt from sklearn.cross_validation import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import classification_report import sklearn.metrics # Feature Importance from sklearn import datasets from sklearn.ensemble import ExtraTreesClassifier # -*- coding: UTF-8 -*- #Remember to replace file directory with the active folder os.chdir("file directory") #Load the dataset AH_data = pd.read_csv("nesarc_pds.csv", low_memory=False) data_clean = AH_data.dropna() data_clean.dtypes data_clean.describe() #Split into training and testing sets predictors = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']] targets = data_clean.S2AQ2 pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4) pred_train.shape pred_test.shape tar_train.shape tar_test.shape #Build model on training data from sklearn.ensemble import RandomForestClassifier classifier=RandomForestClassifier(n_estimators=25) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) sklearn.metrics.confusion_matrix(tar_test,predictions) sklearn.metrics.accuracy_score(tar_test, predictions) # fit an Extra Trees model to the data model = ExtraTreesClassifier() model.fit(pred_train,tar_train) # display the relative importance of each attribute print(model.feature_importances_) trees=range(25) accuracy=np.zeros(25) for idx in range(len(trees)): classifier=RandomForestClassifier(n_estimators=idx + 1) classifier=classifier.fit(pred_train,tar_train) predictions=classifier.predict(pred_test) accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions) plt.cla() plt.plot(trees, accuracy)
The results are as follows.
The explanatory variables with the highest relative importance scores were sex, the number of children ever had and if the individual was working a full-time job. The accuracy of the random forest was 63%. This means that 63 percent of the sample was classified correctly as having drunk at least 12 alcoholic drinks in the last 12 months.
True positive is 5196
True negative is 5675
False negative is 3213
False positive is 3140