*Machine Learning Data Analysis*

*This is the second lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.*

*If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?*

For this assignment, the goal is to create a random forest that identifies the varying importance of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high.

Just as in the last assignment, when my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with 6 explanatory variables: Urban Population 2007, Urban Population Growth 2007, GDP Growth 2007, Population Growth 2007, Employment Rate 2007 and Energy Use 2007.

This is demonstrated in the output below:

pred_train.shape = (85, 6)

pred_test.shape = (58, 6)

tar_train.shape = (85,)

tar_test.shape = (58,)

Classification Matrix

[41 3]

[ 4 10]

Accuracy Score = 0.879310344828

The classification matrix showed that the model classified 41 negatives and 10 positives correctly, while there were four false negatives and three false positives. With an accuracy score of 88%, this means that the model classified 88% of the sample correctly as either a high income or a low income country.

Measure of Importance of Explanatory Variables (Importance Scores)

Urban Population = 0.26979787

Urban Population Growth = 0.10315226

GDP Growth = 0.14031019

Population Growth = 0.04450653

Employment Rate = 0.07961939

Energy Use per Capita = 0.36261374

From these measures, the random forest results show that Energy Use per Capita is actually the most important variable in predicting GDP per Capita and Population Growth is the least important. Urban Population is second in importance, which provides some support for the idea that urban agglomerations drives economic growth.

Finally, let’s look at the number of trees that is needed to generate a reasonably accurate result:

The graph shows that accuracy climbs to the maximum 95% at ten trees, which then falls to between 90 to 93% with successive trees. This means that with the current data and parameters, at least ten decision trees will need to be generated and interpreted for the best model.

**This is my code in Python:**

from pandas import Series, DataFrame

import pandas

import numpy

import os

import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import classification_report

import sklearn.metrics

from sklearn import datasets

from sklearn.ensemble import ExtraTreesClassifier

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()

RegData = RegData.dropna()

def GDPCat (row):

if row[‘GDP2007’] <= 10000:

return 0

elif row[‘GDP2007’] > 10000:

return 1

RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )

RegData.dtypes

print (RegData.describe())

predictors = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

targets = RegData.GDPCat

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

print (pred_train.shape)

print (pred_test.shape)

print (tar_train.shape)

print (tar_test.shape)

#Build Model on Test Sample

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators=25)

classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print (sklearn.metrics.confusion_matrix(tar_test, predictions))

print (sklearn.metrics.accuracy_score(tar_test, predictions))

model = ExtraTreesClassifier ()

model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)

accuracy = numpy.zeros(25)

for idx in range (len(trees)):

classifier = RandomForestClassifier(n_estimators=idx+1)

classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()

plt.plot(trees, accuracy)