Cities, Coursera, Data Analysis, Research, Society, Urban Planning, World Affairs
Leave a Comment

Decision Trees – Machine Learning

Machine Learning Data Analysis

This is the start of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

Now, as I have started working, I do not have as much time. For this course, I decided to focus solely on Python, instead of both Python and SAS as in the past. I am not abandoning SAS but I will probably take the time to learn SAS after this course ends.

For this assignment, the goal is to create a decision tree that correct classifies samples according to a binary, categorical response variable. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high.

When my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with 6 explanatory variables: Urban Population 2007, Urban Population Growth 2007, GDP Growth 2007, Population Growth 2007, Employment Rate 2007 and Energy Use 2007.

This is demonstrated in the output below:
pred_train.shape = (85, 6)
pred_test.shape   = (58, 6)
tar_train.shape    = (85,)
tar_test.shape      = (58,)

picture_out1

Classification Matrix:
[41       1]
[ 2      14]

Accuracy Score = 0.948275862069

The classification matrix showed that the model classified 41 negatives and 14 positives correctly, while there were two false negatives and one false positive. With an accuracy score of 95%, this means that the model classified 95% of the sample correctly as either a high income or a low income country. The model then, fits the sample very well, though because this is a very small sample size by a lot of measures, this could mean issues when applied to the greater sample and over time. Of course, my current work is not longitudinal, which makes the application limited.

Lastly, the decision tree seems to suggest a relationship between energy use, GDP growth, and employment rate has predictors of GDP per Capita instead of urbanization variables such as urban population and urban population growth. However, energy use is directly related to the degree of urbanization, so there may be hidden relationships not on display here.

This is my code in Python:

from pandas import Series, DataFrame
import pandas
import numpy
import os
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

import sklearn.metrics

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

def GDPCat (row):
if row[‘GDP2007’] <= 10000:
return 0
elif row[‘GDP2007’] > 10000:
return 1

RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )

RegData.dtypes
print (RegData.describe())

predictors = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

targets = RegData.GDPCat

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

#Build Model on Test Sample
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train,tar_train)

predictions = classifier.predict(pred_test)

print (sklearn.metrics.confusion_matrix(tar_test, predictions))

print (sklearn.metrics.accuracy_score(tar_test, predictions))

from sklearn import tree

from io import StringIO

from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, feature_names=(‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’), out_file=out)

import pydotplus
graph=pydotplus.graph_from_dot_data(out.getvalue())
Image(graph.create_png())

with open(‘picture_out1.png’, ‘wb’) as f:

f.write(graph.create_png())

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s