Again, for those who have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background on this project, please visit this page.
Even though I am learning both SAS and Python, I will only be presenting my work in SAS here. (Python code included as reference).The reason is that at the moment, for my current abilities, SAS produces output that is easier to present.
For this last assignment, I will be presenting my data in visual form – through graphing variables individuals and to relate my explanatory variable (urbanization rate) with my response variable (GDP growth rate).
Before I present my work, I must admit to a mistake I made in the previous assignments. As I am compiling my own data downloaded from Gapminder instead of the data provided by the course, I needed to ensure that the raw data was compiled correctly. Yet, I only realized that data was lined up incorrectly when I started this assignment. Each of the variables used a different list of countries, so when I copied and pasted the data in Excel, I had actually matched up data with the wrong countries. The only variable that held correct data matched with the right countries was GDP per Capita in 2010. However, this was a great opportunity for me to learn how to merge data in Excel from different worksheets. Turns out, all I needed to use was the VLOOKUP function. I believe that learning cannot occur without application and making mistakes – there is no learning if everything is perfect.
Well, after fixing my raw data, everything ran smoothly. I actually acquired more data points, instead of 126 observations, I now have 164 to work with.
This is my code in SAS:
FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
PROC CONTENTS DATA=Gapminder2010; RUN;
/*to importat data from Excel file upload*/
LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;
/*LIBNAME describes where the file is located = look in which folder for data*/
DATA new; set Gapminder2010 ;
/*DATA names a new data set for the program to use and which data set in the folder to use as raw data*/
LABEL ‘GDP per Capita 2010’n=”GDP2010″;
LABEL ‘GDP Growth 2010’n=”GDPGrowth2010″;
LABEL ‘Urban Population 2010’n=”UrbanPop2010″;
LABEL ‘Urban Pop Growth’n=”UrbanPopGrowth2010″;
LABEL ‘Pop in Large Cities’n=”UrbanAgg2007″;
IF GDP2010=”0″ THEN GDP2010=”.” ;
IF GDPGrowth2010=”0″ THEN GDPGrowth2010=”.” ;
IF UrbanPop2010=”0″ THEN UrbanPop2010=”.” ;
IF UrbanPopGrowth2010=”0″ THEN UrbanPopGrowth2010=”.” ;
/*Statements remove 0 from data – there are no 0’s in source data*/
IF GDP2010 ne . ;
IF GDPGrowth2010 ne . ;
IF UrbanPop2010 ne . ;
IF UrbanPopGrowth2010 ;
/*Statements remove empty or missing data from table*/
PROC SORT; by country;
PROC PRINT; VAR country GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010;
/*Print function to show data for listed variables*/
PROC FREQ; TABLES GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanAgg2007;
PROC GCHART; VBAR GDP2010 ;
PROC GCHART; VBAR GDPGrowth2010 ;
PROC GCHART; VBAR UrbanPop2010 ;
PROC GCHART; VBAR UrbanPopGrowth2010 ;
PROC GCHART; VBAR UrbanAgg2007 ;
var GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanAgg2007;
PROC GPLOT; PLOT GDPGrowth2010*UrbanPopGrowth2010;
The code seems much cleaner – another benefit of cleaning up the raw data.
This is my code in Python:
import matplotlib.pyplot as plt
gapminder = pandas.read_csv(‘Data1.csv’, low_memory=False)
gapminder[‘GDP2010’] = pandas.to_numeric(gapminder[‘GDP2010’])
gapminder[‘GDPGrowth2010’] = pandas.to_numeric(gapminder[‘GDPGrowth2010’])
gapminder[‘UrbanPop2010’] = pandas.to_numeric(gapminder[‘UrbanPop2010’])
gapminder[‘UrbanPopGrowth2010’] = pandas.to_numeric(gapminder[‘UrbanPopGrowth2010’])
gapminder[‘UrbanAgg2007’] = pandas.to_numeric(gapminder[‘UrbanAgg2007’])
#replace 0 with NaN values, and dropna() removes rows with any NaN values
gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan)
gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan)
gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan)
gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan)
gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010’])
gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘UrbanAgg2007’]]
GDPDes = gapminder[‘GDP2010’].describe()
GDPGrowthDes = gapminder[‘GDPGrowth2010’].describe()
UrbanPopDes = gapminder[‘UrbanPop2010’].describe()
UrbanGrowthDes = gapminder[‘UrbanPopGrowth2010’].describe()
plt.title(‘Distribution of GDP in 2010’)
plt.xlabel(‘GDP Growth 2010’)
plt.title(‘Distribution of GDP Growth in 2010’)
seaborn.distplot(gapminder[‘UrbanPop2010’].dropna(), bins=16, kde=False)
plt.xlabel(‘Urban Population 2010’)
plt.title(‘Distribution of Urban Population in 2010’)
plt.xlabel(‘Urban Population Growth 2010’)
plt.title(‘Distribution of Urban Population Growth in 2010’)
plt.xlabel(‘Urban Population in Large Cities 2007’)
plt.title(‘Distribution of Urban Population in Large Cities in 2007′)
seaborn.regplot(x=’UrbanPopGrowth2010′, y=’GDPGrowth2010’, fit_reg=False, data=gapminder)
plt.xlabel(‘Urban Population Growth 2010’)
plt.ylabel(‘GDP Growth 2010’)
plt.title(‘Urbanization and Economic Growth’)
From the graphs, GDP growth rates and urban population growth rates demonstrate a normal distribution, with both skewed to the left, meaning countries tend to cluster closer to a lower rate of growth.
The GDP growth statistics seemed to be skewed by an outlier at 23.39% growth in 2010. This was nearly double the next highest growth rate at 13.05%. Surprisingly, this fastest growing economy was Macau, China. Like Hong Kong, it is not technically a country, but a special administrative region with a semi-autonomous political structure. It was set up to re-integrate the former colony back into China.
Mean = 2.86
Median = 2.59
SD = 3.54
Range = 29.70
Urban Population Growth
Mean = 2.03
Median = 1.85
SD = 1.59
Range = 9.84
Looking at the GDP variable, it is not surprising to see it as a unimodal graph skewed far to the left. The fact has been known for a long time that most of the world live in economic conditions unseen in the Western world for much of the past century. This can be confirmed by looking at the median value of 2228.06 and the mean of 7541.82.
Mean = 7541.82 USD (2000 value)
Median = 2228.06
SD = 11523.35
Range = 62931
On the other hand, urban population, appear to be a bimodal graph skewed to the right. This is also not surprising as most of the world now live in urbanized areas.
Mean = 55.75% (expressed as percent of total population)
Median = 55.85
SD = 23.56
Range = 89.35
Lastly, because I am interested in the relationship between urbanization rate and economic growth, I plotted GDP growth rate against urban population growth rate. The resulting scatter plot seems to suggest a positive relationship between the two. However, the clustering seems to be to loose to be significant. I am looking forward to the next assignments in the Data Analysis Tools course to better elucidate this relationship, if any. As my fellow classmates suggested, I may need to categorize countries for better analysis. For the time being, I still believe I do not know enough yet to do so; I rather work with data points as individual countries until further notice.
If you are interested in my current project, please look forward to the coming posts as I will be continuing my investigation into the relationships between urbanization and the economy.