Machine Learning: K-Means Cluster Analysis using Python – Alcohol intake based on physical and social attributes

Plotting data on the graph is like looking at a bunch of stars. They all look the same and the data is difficult to interpret. What if there was a method to colour the stars?

cluster.jpg

Red stars are associated with red flowers and blue stars are associated with yellow flowers

Cluster analysis groups observations into subsets based on the similarity of responses on various relational variables, called clusters. The data is plotted on a graph and different clusters are coloured and separated to make sense of data. Today we will be using the same data set (NESARC wave 1) as the previous two examples: 1, 2

The target will be alcohol intake (≥12 drinks in 12 months, lower score = higher intake) and the cluster variables are again as follows:
1. SEX
2. HISPANIC OR LATINO ORIGIN (S1Q1C)
3. “AMERICAN INDIAN OR ALASKA NATIVE” CHECKED IN MULTIRACE CODE (S1Q1D1)
4. “ASIAN” CHECKED IN MULTIRACE CODE (S1Q1D2)
5. “BLACK OR AFRICAN AMERICAN” CHECKED IN MULTIRACE CODE (S1Q1D3)
6. “NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER” CHECKED IN MULTIRACE CODE (S1Q1D4)
7. “WHITE” CHECKED IN MULTIRACE CODE (S1Q1D5)
8. NUMBER OF CHILDREN EVER HAD, INCLUDING ADOPTIVE, STEP AND FOSTER CHILDREN (S1Q5A)
9. PRESENT SITUATION INCLUDES WORKING FULL TIME (35+ HOURS A WEEK) (S1Q7A1)
10. PERSONALLY RECEIVED FOOD STAMPS IN LAST 12 MONTHS (S1Q14A)

The syntax is as follows:

# -*- coding: UTF-8 -*-

#from pandas import Series, DataFrame
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans
import os

#Remember to replace file directory with the active folder
os.chdir("file directory")

#Load the dataset
data = pd.read_csv("nesarc_pds.csv", low_memory=False)

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

data_clean = data.dropna()

#Clustering Variables
cluster = data_clean[['SEX','S1Q1C','S1Q1D1','S1Q1D2','S1Q1D3','S1Q1D4','S1Q1D5','S1Q5A','S1Q7A1','S1Q14A']]
cluster.describe()
#target = data_clean.S2AQ2

# standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
clustervar['SEX']=preprocessing.scale(clustervar['SEX'].astype('float64'))
clustervar['S1Q1C']=preprocessing.scale(clustervar['S1Q1C'].astype('float64'))
clustervar['S1Q1D1']=preprocessing.scale(clustervar['S1Q1D1'].astype('float64'))
clustervar['S1Q1D2']=preprocessing.scale(clustervar['S1Q1D2'].astype('float64'))
clustervar['S1Q1D3']=preprocessing.scale(clustervar['S1Q1D3'].astype('float64'))
clustervar['S1Q1D4']=preprocessing.scale(clustervar['S1Q1D4'].astype('float64'))
clustervar['S1Q1D5']=preprocessing.scale(clustervar['S1Q1D5'].astype('float64'))
clustervar['S1Q5A']=preprocessing.scale(clustervar['S1Q5A'].astype('float64'))
clustervar['S1Q7A1']=preprocessing.scale(clustervar['S1Q7A1'].astype('float64'))
clustervar['S1Q14A']=preprocessing.scale(clustervar['S1Q14A'].astype('float64'))

# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=111)

# k-means cluster analysis for 1-9 clusters
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1))
/ clus_train.shape[0])

#Plot
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')
plt.show()

# 2 cluster solution
model2=KMeans(n_clusters=2)
model2.fit(clus_train)
clusassign=model2.predict(clus_train)

# plot clusters
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model2.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 2 Clusters')
plt.show()

#merge cluster assignment with clustering variables to examine cluster variable means by cluster

clus_train.reset_index(inplace=True)
cluslist=list(clus_train['index'])
labels=list(model2.labels_)
newlist=dict(zip(cluslist, labels))
newlist
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
newclus.columns = ['cluster']

# now do the same for the cluster assignment variable
newclus.reset_index(inplace=True)
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
merged_train.cluster.value_counts()

# calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)

# first have to merge ALC with clustering variables and cluster assignment data
alc_data=data_clean['S2AQ2']
# split GPA data into train and test sets
alc_train, alc_test = train_test_split(alc_data, test_size=.3, random_state=111)
alc_train1=pd.DataFrame(alc_train)
alc_train1.reset_index(inplace=True)
merged_train_all=pd.merge(alc_train1, merged_train, on='index')
sub1 = merged_train_all[['S2AQ2', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

alcmod = smf.ols(formula='S2AQ2 ~ C(cluster)', data=sub1).fit()
print (alcmod.summary())

print ('Means for Higher Alcohol Intake By Cluster')
m1= sub1.groupby('cluster').mean()
print (m1)

print ('standard deviations for ALC by cluster')
m2= sub1.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(sub1['S2AQ2'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

A series of k-means cluster analyses were conducted on the training data specifying k=1-9 clusters, using Euclidean distance. The elbow curve demonstrates the variance in clustering variables and shows sharp turning point at 2 clusters, although 4-cluster-analysis is also suggested. For the purpose of simplicity, 2-cluster-analysis is obtained although the number of clusters should be determined by canonical discriminant analyses.

Screen Shot 2017-08-03 at 12.50.45 PM

The second graph is a plot of the first two canonical variables for the clustering variables by cluster. This shows that the clusters are distinct with little overlap.

The output is as follows:

cluster2

The clusters are named Cluster 0 and Cluster 1.

Cluster 0
Main characteristics include being white, having fewer children and being employed.

Cluster 1
Main characteristics include being black, having more children, being unemployed and receiving food stamps.

cluster1

The difference between the groups is determined to be significant and cluster 0 is associated with higher alcohol intake, having a lower number assigned according to the codebook.

As per the previous results, the cluster consisting of whites with fewer children, being employed and not having received food stamps suggests that it is the availability of resources that determines higher alcohol intake. However, having more alcohol does not indicate dependence and having 12 drinks in 12 months remains acceptable by societal standards.

Advertisements
This entry was posted in Blog/Updates, Software, Tutorials and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s