Machine Learning: Making Decision Trees using Python – Adolescent Sex and Parenting

A decision tree can predict a particular target or response. The decision tree below was made by me using machine learning to test against several relationships which can be found in the National Longitudinal Study of Adolescent Health survey performed in the United States.

tree.png

The syntax is provided at the end of the post.

In this example, I demonstrate using Python, a way to create a decision tree based on the following variables: which initially included variables such as
1. Gender
2. Whether parents decide what you wear
3. Whether parents decide on the people you hang out with
4. Whether parents decide on which television programs you watch
5. Whether parents decide on what you eat
6. Over the last week, if at least one parent was present during dinner
7. Closeness to mother (either biological or adoptive)
8. Whether the individual kissed a non-family member
9. Whether the individual held hands with a non-family member
The initial decision tree was too large to be included. A pruned version can be seen here.

treepic

The tree was subsequently further selectively pruned to give the final image seen at the start of the post. Three variables were selected: Sex, whether the individual kissed a non-relative and if parents made decisions on what they wear. The four boxes at the bottom represent the results. In cases where the percentage of those who have had sexual intercourse has exceeded the baseline of 61%, the boxes are highlighted in blue.

The modified dataset is provided here.

Only in one category did the baseline remain lower, at 56%. This box is coloured white, which represents individuals who have not kissed a stranger and whose parents have given them freedom in what they wear. It will be interesting to further evaluate the trust relationship between the individual and the parent(s).

Screen Shot 2017-07-22 at 5.43.17 PM

The overall accuracy is about 61%. Repeating the steps will produce almost similar accuracy but you will notice that this fluctuates slightly along with the true positives, true negatives, false positives and false positives. This is further explored in the next post.

The syntax is as follows which will generate a .dot file which can then be used to create a png image or PDF either using python itself or with a simple dot command:

# -*- coding: UTF-8 -*-

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
# -*- coding: UTF-8 -*-

#Remember to replace file directory with the active folder
os.chdir("file directory")
"""
Data Engineering and Analysis
"""
#Load the dataset

AH_data = pd.read_csv("addhealth_pds.csv")

data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

"""
Modeling and Prediction
"""
#Split into training and testing sets

predictors = data_clean[['BIO_SEX','H1LR2','H1WP3']]

targets = data_clean.H1CO1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

sklearn.metrics.confusion_matrix(tar_test,predictions)
sklearn.metrics.accuracy_score(tar_test, predictions)

#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import BytesIO as StringIO
#from StringIO import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, out_file='out.dot')

Subsequently, run the following command in a terminal to convert the dot file into a png:

dot -T png out.dot -o treepic.png

Pruning

Pruning can be done either by selecting the best variables manually or allowing the machine to do it for you. In this case, use the syntax (max_leaf_nodes=n) under classifier=DecisionTreeClassifier, which will generate a completely different tree like so:

classifier=DecisionTreeClassifier(max_leaf_nodes=5)

After removing cases where individuals did not answer either yes or no in the question “Have you kissed a non-family member?”, the tree is now as follows:

treepicpruned

Advertisements
This entry was posted in Blog/Updates, Software, Tutorials and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s