*Last Lesson in Data Analysis Tools…*

If you have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: * does urbanization drive economic growth? *For those of you interested in reading my literature review to gain a background on this project, please visit this page.

This is the last lesson in the Data Analysis Tools course. After analyzing for correlations between variables, this assignment focuses on moderating variables. A moderating variable is one that influences the strength and direction of the association between the explanatory and response variables.

Last time, I established that there were correlations between the amount of urbanization, as measured by percentage of total population in cities with over 1 million people, urban population growth, and GDP per capita. Additionally, I found that there was a correlation between total populations in cities and urban population growth. I suspect that one of these two variables might be a moderating variable.

I first looked at total population in cities over 1 million people as a moderating variable by categorizing it into three groups: low urbanization, medium urbanization, and high urbanization. The cut-offs were <20%, <37%, and >37%. These cut-offs were based on my Chi-Square test cut-offs. By grouping them as such, I found that there was a negative correlation between urban population growth and GDP per capita in countries with low urbanization or less than 20% of their total population live in large cities.

r = -0.428

p-value = 0.00212

I then categorized urban population growth to look at the correlation between total urban population and GDP per capita. I used cut-offs of 1.5% and 3% for urban population growth based on the quantile data from earlier assignments. I found that were was a strong correlation when data is grouped as such. The data suggests that at a medium urban population growth rate, between 1.5 to 3%, there was a strong positive correlation with GDP per capita. The other two groups have close to significant positive correlations, but both p-values were slightly above 0.06. If I included data from more years than 2010, maybe there will be a significant correlation. Perhaps both urban variables affect each other.

r = 0.895

p-value = 8.89e-13

Lastly, I flipped the equation around and look at GDP per capita as a moderating variable in the correlation between urban population and urban population growth. I categorized GDP into four groups based on its quantile statistics, using cut-offs at 700, 2000, and 7000 USD. In this case, I found absolutely no correlations when GDP per capita is used as a moderating variable.

I believe the issue here is related to the fact that my data comes from a snapshot in time – I am only using data from 2010 (or 2007 in the case of urban population in large cities). To develop my project and gain a better understanding of the relationships between urban population and GDP, I should look to using data from a longer time period. That would be the next step in my learning.

**This is my code in Python:**

import pandas

import numpy

import seaborn

import scipy.stats

import matplotlib.pyplot as plt

gapminder = pandas.read_csv(‘Data1.csv’, low_memory=False)

gapminder[‘GDP2010’] = pandas.to_numeric(gapminder[‘GDP2010’])

gapminder[‘GDPGrowth2010’] = pandas.to_numeric(gapminder[‘GDPGrowth2010’])

gapminder[‘UrbanPop2010’] = pandas.to_numeric(gapminder[‘UrbanPop2010’])

gapminder[‘UrbanPopGrowth2010’] = pandas.to_numeric(gapminder[‘UrbanPopGrowth2010’])

gapminder[‘UrbanAgg2007’] = pandas.to_numeric(gapminder[‘UrbanAgg2007’])

#replace 0 with NaN values, and dropna() removes rows with any NaN values

gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan)

gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan)

gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan)

gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan)

gapminder[‘UrbanAgg2007’] = gapminder[‘UrbanAgg2007’].replace(0, numpy.nan)

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010’, ‘UrbanAgg2007’])

gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘UrbanAgg2007’]]

def UrbanRate (row):

if row[‘UrbanAgg2007’] <= 20:

return 1

elif row[‘UrbanAgg2007’] <= 37:

return 2

elif row[‘UrbanAgg2007’] > 37:

return 3

gapminder[‘UrbanRate’] = gapminder.apply (lambda row: UrbanRate (row), axis = 1)

Urban1 = gapminder[(gapminder[‘UrbanRate’]==1)]

Urban2 = gapminder[(gapminder[‘UrbanRate’]==2)]

Urban3 = gapminder[(gapminder[‘UrbanRate’]==3)]

print (scipy.stats.pearsonr(Urban1[‘UrbanPopGrowth2010’], Urban1[‘GDP2010’]))

print (scipy.stats.pearsonr(Urban2[‘UrbanPopGrowth2010’], Urban2[‘GDP2010’]))

print (scipy.stats.pearsonr(Urban3[‘UrbanPopGrowth2010’], Urban3[‘GDP2010′]))

seaborn.regplot(x=’UrbanPopGrowth2010′, y=’GDP2010’, fit_reg=True, data=Urban1)

plt.xlabel(‘Urban Population Growth in 2010’)

plt.ylabel(‘GDP Per Capita in 2010’)

plt.title(‘Urbanization and Economic Growth’)

**Second Moderator Test (using urban population growth rate):**

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010’, ‘UrbanAgg2007’])

gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘UrbanAgg2007’]]

def PopRate (row):

if row[‘UrbanPopGrowth2010’] <= 1.5:

return 1

elif row[‘UrbanPopGrowth2010’] <= 3:

return 2

elif row[‘UrbanPopGrowth2010’] > 3:

return 3

gapminder[‘PopRate’] = gapminder.apply (lambda row: PopRate (row), axis = 1)

chk = gapminder[‘PopRate’].value_counts(sort=False, dropna=False)

print (chk)

Urban1 = gapminder[(gapminder[‘PopRate’]==1)]

Urban2 = gapminder[(gapminder[‘PopRate’]==2)]

Urban3 = gapminder[(gapminder[‘PopRate’]==3)]

print (scipy.stats.pearsonr(Urban1[‘UrbanAgg2007’], Urban1[‘GDP2010’]))

print (scipy.stats.pearsonr(Urban2[‘UrbanAgg2007’], Urban2[‘GDP2010’]))

print (scipy.stats.pearsonr(Urban3[‘UrbanAgg2007’], Urban3[‘GDP2010′]))

seaborn.regplot(x=’UrbanAgg2007′, y=’GDP2010’, fit_reg=True, data=Urban2)

plt.xlabel(‘Urban Population in Large Cities’)

plt.ylabel(‘GDP Per Capita in 2010’)

plt.title(‘Urbanization and Economic Growth’)

**Third Moderator Test (using GDP per capita):**

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010’, ‘UrbanAgg2007’])

gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘UrbanAgg2007’]]

def Wealth (row):

if row[‘GDP2010’] <= 700:

return 1

elif row[‘GDP2010’] <= 2000:

return 2

elif row[‘GDP2010’] <= 7000:

return 3

elif row[‘GDP2010’] > 7000:

return 4

gapminder[‘Wealth’] = gapminder.apply (lambda row: Wealth (row), axis = 1)

chk = gapminder[‘Wealth’].value_counts(sort=False, dropna=False)

print (chk)

Urban1 = gapminder[(gapminder[‘Wealth’]==1)]

Urban2 = gapminder[(gapminder[‘Wealth’]==2)]

Urban3 = gapminder[(gapminder[‘Wealth’]==3)]

Urban4 = gapminder[(gapminder[‘Wealth’]==4)]

print (scipy.stats.pearsonr(Urban1[‘UrbanPopGrowth2010’], Urban1[‘UrbanAgg2007’]))

print (scipy.stats.pearsonr(Urban2[‘UrbanPopGrowth2010’], Urban2[‘UrbanAgg2007’]))

print (scipy.stats.pearsonr(Urban3[‘UrbanPopGrowth2010’], Urban3[‘UrbanAgg2007’]))

print (scipy.stats.pearsonr(Urban4[‘UrbanPopGrowth2010’], Urban4[‘UrbanAgg2007′]))

seaborn.regplot(x=’UrbanPopGrowth2010′, y=’UrbanAgg2007’, fit_reg=False, data=gapminder)

plt.xlabel(‘Urban Population Growth in 2010’)

plt.ylabel(‘Urban Population in Large Cities’)

plt.title(‘Urbanization and Economic Growth’)

**This is my code in SAS (only the first moderator test):**

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE

DBMS=XLSX

OUT=Gapminder2010;

GETNAMES=YES;

RUN;

PROC CONTENTS DATA=Gapminder2010; RUN;

LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;

DATA new; set Gapminder2010 ;

LABEL ‘GDP per Capita 2010’n=”GDP2010″;

LABEL ‘GDP Growth 2010’n=”GDPGrowth2010″;

LABEL ‘Urban Population 2010’n=”UrbanPop2010″;

LABEL ‘Urban Pop Growth’n=”UrbanPopGrowth2010″;

LABEL ‘Pop in Large Cities’n=”UrbanAgg2007″;

IF GDP2010=”0″ THEN GDP2010=”.” ;

IF GDPGrowth2010=”0″ THEN GDPGrowth2010=”.” ;

IF UrbanPop2010=”0″ THEN UrbanPop2010=”.” ;

IF UrbanPopGrowth2010=”0″ THEN UrbanPopGrowth2010=”.” ;

IF UrbanAgg2007=”0″ THEN UrbanAgg2007=”.”;

IF GDP2010 ne . ;

IF GDPGrowth2010 ne . ;

IF UrbanPop2010 ne . ;

IF UrbanPopGrowth2010 ne . ;

IF UrbanAgg2007 = “.” THEN UrbanRate = “.”;

ELSE IF UrbanAgg2007 LE 20 THEN UrbanRate = 1;

ELSE IF UrbanAgg2007 LE 37 THEN UrbanRate = 2;

ELSE IF UrbanAgg2007 GT 37 THEN UrbanRate = 3;

IF UrbanRate ne . ;

PROC SORT; by country;

PROC SORT; by UrbanRate;

PROC PRINT; VAR country GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanAgg2007 UrbanRate;

SYMBOL1 C=BLUE I=R V=DOT;

PROC GPLOT; PLOT GDP2010*UrbanPopGrowth2010; by UrbanRate;

PROC CORR; VAR UrbanPopGrowth2010 GDP2010; by UrbanRate;

RUN;