Cities, Coursera, Data Analysis, Research, Society, Urban Planning, World Affairs
Leave a Comment

Random Forests – Machine Learning

Machine Learning Data Analysis

This is the second lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to create a random forest that identifies the varying importance of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high.

Just as in the last assignment, when my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with 6 explanatory variables: Urban Population 2007, Urban Population Growth 2007, GDP Growth 2007, Population Growth 2007, Employment Rate 2007 and Energy Use 2007.

This is demonstrated in the output below:
pred_train.shape = (85, 6)
pred_test.shape   = (58, 6)
tar_train.shape    = (85,)
tar_test.shape      = (58,)

Classification Matrix
[41    3]
[ 4   10]

Accuracy Score = 0.879310344828

The classification matrix showed that the model classified 41 negatives and 10 positives correctly, while there were four false negatives and three false positives. With an accuracy score of 88%, this means that the model classified 88% of the sample correctly as either a high income or a low income country.

Measure of Importance of Explanatory Variables (Importance Scores)
Urban Population                    = 0.26979787
Urban Population Growth    = 0.10315226
GDP Growth                              = 0.14031019
Population Growth                 = 0.04450653
Employment Rate                   = 0.07961939
Energy Use per Capita           = 0.36261374

From these measures, the random forest results show that Energy Use per Capita is actually the most important variable in predicting GDP per Capita and Population Growth is the least important. Urban Population is second in importance, which provides some support for the idea that urban agglomerations drives economic growth.

Finally, let’s look at the number of trees that is needed to generate a reasonably accurate result:

RandomForest.png

The graph shows that accuracy climbs to the maximum 95% at ten trees, which then falls to between 90 to 93% with successive trees. This means that with the current data and parameters, at least ten decision trees will need to be generated and interpreted for the best model.

This is my code in Python:

from pandas import Series, DataFrame
import pandas
import numpy
import os
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

import sklearn.metrics

from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

def GDPCat (row):
if row[‘GDP2007’] <= 10000:
return 0
elif row[‘GDP2007’] > 10000:
return 1

RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )

RegData.dtypes
print (RegData.describe())

predictors = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

targets = RegData.GDPCat

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

#Build Model on Test Sample
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25)
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print (sklearn.metrics.confusion_matrix(tar_test, predictions))

print (sklearn.metrics.accuracy_score(tar_test, predictions))

model = ExtraTreesClassifier ()
model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)
accuracy = numpy.zeros(25)

for idx in range (len(trees)):
classifier = RandomForestClassifier(n_estimators=idx+1)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s