# k-Means Cluster Analysis – Machine Learning

Machine Learning Data Analysis

This is the last lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to run a k-Means Cluster Analysis using my variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. Here, GDP per Capita in 2007 is used as the validation variable. I am trying to identify if there are clusters of characteristics that associate with certain values of GDP per Capita based on national data from 2007.

As before, the data is split into 70% training data and 30% test data. However, the k-means cluster analysis will only be run on the training data set.

The Elbow Curve Graph shows that 2, 3, and 4 clusters could be interpreted, though it is inconclusive. I decided to analyze 2 clusters instead, because I believe the greatest change in the Elbow Curve occurs at cluster = 2. If we looked at the Scatter Plots of both 3 and 2 Clusters, it is obvious that with 3 clusters there is less correlation. With 2 clusters, though the second cluster is much more spread out, the first cluster is fairly bunched together and contain little overlap.The second cluster has much more in-cluster variance.  Clustering Variable Means by Cluster
cluster             index          UrbanPop2007       UrbanPopGrowth2007                GDP2007
0                   128.867470       -0.209996                    -0.106216                             3622.924772
1                    114.470588        1.021819                         0.134405                               29775.197180

cluster      GDPGrowth2007          PopGrow2007         Employment2007        Energy2007
0                     0.070663                         -0.136930                   -0.154707                   -0.271868
1                      -0.431318                        0.309997                     0.117959                      1.271157

If we look at the clustering variable means, it is obvious there are significant differences between the two clusters. Furthermore, it would appear again that Urban Population and Energy Use per Capita have the greatest associations with GDP per Capita. The first cluster had low Urban Population and Energy per Capita means associating with a very low GDP per Capita, while the second cluster has high Urban Population and Energy per Capita means associating with a comparatively high GDP per Capita. This would suggest that in countries that are highly urbanized and consume more energy, there is more GDP per Capita. This is logical considering that the most technologically advance nations on Earth tend to be much richer, more urbanized, and much more energy intensive in terms of industry. The question then becomes, does high urbanization and high energy consumption result in higher GDP per Capita or are they just characteristics of nations with higher GDP per Capita?

To run an external validation of the clusters, an ANOVA was run and demonstrated significant differences between the two clusters. The p-value was <0.0001 and both clusters had very different mean values: 3622.92 and 29775.20. However, the standard deviations were large, demonstrating the high in-cluster variance experienced by both clusters. This could very much be the result of the small sample size. With training data set at 70%, there are only 100 observations to be made.

This is my code in Python:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = urbandata.replace(0, np.nan)

RegData = urbandata[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

Data=RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
Data.describe()

Data[‘UrbanPop2007’] = preprocessing.scale(Data[‘UrbanPop2007’].astype(‘float64’))
Data[‘UrbanPopGrowth2007’] = preprocessing.scale(Data[‘UrbanPopGrowth2007’].astype(‘float64’))
Data[‘GDPGrowth2007’] = preprocessing.scale(Data[‘GDPGrowth2007’].astype(‘float64’))
Data[‘PopGrow2007’] = preprocessing.scale(Data[‘PopGrow2007’].astype(‘float64’))
Data[‘Employment2007’] = preprocessing.scale(Data[‘Employment2007’].astype(‘float64’))
Data[‘Energy2007’] = preprocessing.scale(Data[‘Energy2007’].astype(‘float64’))

print(Data.describe())

clus_train, clus_test = train_test_split(Data, test_size=.3, random_state=123)

from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, ‘euclidean’), axis=1))
/ clus_train.shape)

plt.plot(clusters, meandist)
plt.xlabel(‘Number of clusters’)
plt.ylabel(‘Average distance’)
plt.title(‘Selecting k with the Elbow Method’)

model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
fig2 = plt.figure(figsize=(12,8))
fig2 = plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel(‘Canonical variable 1’)
plt.ylabel(‘Canonical variable 2’)
plt.title(‘Scatterplot of Canonical Variables for 3 Clusters’)
plt.show()

model2=KMeans(n_clusters=2)
model2.fit(clus_train)
clusassign=model2.predict(clus_train)

pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
fig3 = plt.figure(figsize=(12,8))
fig3 = plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model2.labels_,)
plt.xlabel(‘Canonical variable 1’)
plt.ylabel(‘Canonical variable 2’)
plt.title(‘Scatterplot of Canonical Variables for 2 Clusters’)
plt.show()

clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train[‘index’])
labels=list(model2.labels_)
newlist=dict(zip(cluslist, labels))
newlist
newclus=DataFrame.from_dict(newlist, orient=’index’)
newclus
newclus.columns = [‘cluster’]

newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on=’index’)
merged_train.cluster.value_counts()

clustergrp = merged_train.groupby(‘cluster’).mean()
print (“Clustering variable means by cluster”)
print(clustergrp)

GDP=RegData[‘GDP2007’]
GDP_train, GDP_test = train_test_split(GDP, test_size=.3, random_state=123)
GDP_train1=pd.DataFrame(GDP_train)
GDP_train1.reset_index(level=0, inplace=True)
sub1 = merged_train[[‘GDP2007’, ‘cluster’]].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

GDPmod = smf.ols(formula=’GDP2007 ~ C(cluster)’, data=sub1).fit()
print (GDPmod.summary())

print (‘means for GDP by cluster’)
m1= sub1.groupby(‘cluster’).mean()
print (m1)

print (‘standard deviations for GDP by cluster’)
m2= sub1.groupby(‘cluster’).std()
print (m2)

mc1 = multi.MultiComparison(sub1[‘GDP2007’], sub1[‘cluster’])
res1 = mc1.tukeyhsd()
print(res1.summary())