This is the start of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.
If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?
Now, as I have started working, I do not have as much time. For this course, I decided to focus solely on Python, instead of both Python and SAS as in the past. I am not abandoning SAS but I will probably take the time to learn SAS after this course ends.
For this assignment, the goal is to create a decision tree that correct classifies samples according to a binary, categorical response variable. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high.
When my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with 6 explanatory variables: Urban Population 2007, Urban Population Growth 2007, GDP Growth 2007, Population Growth 2007, Employment Rate 2007 and Energy Use 2007.
This is demonstrated in the output below:
pred_train.shape = (85, 6)
pred_test.shape = (58, 6)
tar_train.shape = (85,)
tar_test.shape = (58,)
[ 2 14]
Accuracy Score = 0.948275862069
The classification matrix showed that the model classified 41 negatives and 14 positives correctly, while there were two false negatives and one false positive. With an accuracy score of 95%, this means that the model classified 95% of the sample correctly as either a high income or a low income country. The model then, fits the sample very well, though because this is a very small sample size by a lot of measures, this could mean issues when applied to the greater sample and over time. Of course, my current work is not longitudinal, which makes the application limited.
Lastly, the decision tree seems to suggest a relationship between energy use, GDP growth, and employment rate has predictors of GDP per Capita instead of urbanization variables such as urban population and urban population growth. However, energy use is directly related to the degree of urbanization, so there may be hidden relationships not on display here.
This is my code in Python:
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)
urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)
urbandata = urbandata.replace(0, numpy.nan)
RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()
def GDPCat (row):
if row[‘GDP2007’] <= 10000:
elif row[‘GDP2007’] > 10000:
RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )
predictors = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
targets = RegData.GDPCat
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
#Build Model on Test Sample
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train,tar_train)
predictions = classifier.predict(pred_test)
print (sklearn.metrics.confusion_matrix(tar_test, predictions))
print (sklearn.metrics.accuracy_score(tar_test, predictions))
from sklearn import tree
from io import StringIO
from IPython.display import Image
out = StringIO()
tree.export_graphviz(classifier, feature_names=(‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’), out_file=out)
with open(‘picture_out1.png’, ‘wb’) as f: