*Continuing with Regression Modelling in Practice…*

*If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?*

Through the past two courses, Data Analysis Tools and Data Management and Visualization, I established that there was a correlation between urban population and GDP per capita.

For this assignment, my primary explanatory variable is Urban Population Growth rate and response variable is GDP per capita, both figures are from 2010.

**This is my code in Python:**

import pandas

import numpy

import seaborn

import matplotlib.pyplot as plt

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

gapminder = pandas.read_csv(‘Data1.csv’, low_memory=False)

gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan)

gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan)

gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan)

gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan)

gapminder = gapminder[[‘Country’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘GDP2010’, ‘GDPGrowth2010’]]

gapminder = gapminder.dropna()

PopDes = gapminder[‘UrbanPopGrowth2010’].describe()

print (PopDes)

RegData = gapminder[[‘Country’, ‘UrbanPopGrowth2010’, ‘GDP2010’]]

RegData[‘UrbanPopGrowth2010’] = RegData[‘UrbanPopGrowth2010’] – RegData[‘UrbanPopGrowth2010′].mean()

print (RegData.describe())

UrbanReg = smf.ols(formula=’GDP2010 ~ UrbanPopGrowth2010′, data=RegData).fit()

print (UrbanReg.summary())

seaborn.regplot(x=’UrbanPopGrowth2010′, y=’GDP2010’, fit_reg=True, data=RegData)

plt.xlabel(‘Urban Population Growth (Centered)’)

plt.ylabel(‘GDP Per Capita 2010’)

plt.title(‘Urbanization and Economic Growth’)

PopDes = gapminder[‘UrbanPop2010’].describe()

print (PopDes)

RegData = gapminder[[‘Country’, ‘UrbanPop2010’, ‘GDP2010’]]

RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010′].mean()

print (RegData.describe())

UrbanReg = smf.ols(formula=’GDP2010 ~ UrbanPop2010′, data=RegData).fit()

print (UrbanReg.summary())

seaborn.regplot(x=’UrbanPop2010′, y=’GDP2010’, fit_reg=True, data=RegData)

plt.xlabel(‘Urban Population 2010 (Centered)’)

plt.ylabel(‘GDP Per Capita 2010’)

plt.title(‘Urbanization and Economic Growth’)

print(‘Urbanization and GDP Growth’)

print(scipy.stats.pearsonr(RegData[‘UrbanPop2010’], RegData[‘GDP2010’]))

The mean for Urban Population Growth before centering is 2.034. Using the python code [‘UrbanPopGrowth2010’] – [‘UrbanPopGrowth2010’].mean(), I centered the data where mean = 0. These are the statistics of the variable after centering:

count 164

mean -1.733031e-16 or 0

std 1.585665

min -3.513519

25% -1.201703

50% -0.1883302

75% 1.026519e

max 6.323352

The regression output is:

From this table, the intercept is 7541.82 while the slope of the best fit line is -2383.484. This means a negative correlation between GDP per capita and urban population growth. With a p-value of <0.01, this relationship is significant, with urban population growth predicting 10.8% of the variability in GDP per capita.

For every 1% increase in urban population growth, it predicts a decrease of $2383.47 USD in GDP per capita of a country.

To better understand this correlation, I ran the same program using Urban Population as a percentage of total population as the explanatory variable. There is an even stronger correlation between urban population and GDP per capita. The p-value is also <0.01 while the R-squared value is 0.389. There is a strong positive relationship with a r value of 0.6238.

The data suggests that the more urbanized a country is, the richer it tends to be, while the countries that are urbanizing the fastest tend to be poorer. Perhaps the poorer countries are urbanizing to catch up to the rich, but interestingly, the data does not suggest a relationship between urbanization and GDP % growth.

Hopefully, I can slowly uncover more of these relationships. I believe the correlations will become clearly once I add in more time series data to develop this into a longitudinal study.

**This is my code in SAS (only for Urban Population Growth regression):**

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE

DBMS=XLSX

OUT=Gapminder2010;

GETNAMES=YES;

RUN;

PROC CONTENTS DATA=Gapminder2010; RUN;

LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;

DATA new; set Gapminder2010 ;

LABEL ‘GDP per Capita 2010’n=”GDP2010″;

LABEL ‘GDP Growth 2010’n=”GDPGrowth2010″;

LABEL ‘Urban Population 2010’n=”UrbanPop2010″;

LABEL ‘Urban Pop Growth’n=”UrbanPopGrowth2010″;

LABEL ‘Pop in Large Cities’n=”UrbanAgg2007″;

IF GDP2010=”0″ THEN GDP2010=”.” ;

IF GDPGrowth2010=”0″ THEN GDPGrowth2010=”.” ;

IF UrbanPop2010=”0″ THEN UrbanPop2010=”.” ;

IF UrbanPopGrowth2010=”0″ THEN UrbanPopGrowth2010=”.” ;

IF GDP2010 ne . ;

IF GDPGrowth2010 ne . ;

IF UrbanPop2010 ne . ;

IF UrbanPopGrowth2010 ne . ;

UrbanPopGrowth2010 = UrbanPopGrowth2010 – 2.034;

PROC SORT; by country;

PROC PRINT; VAR country GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010;

SYMBOL1 C=BLUE I=R V=DOT;

PROC GPLOT; PLOT GDP2010*UrbanPopGrowth2010 ;

PROC GLM; model GDP2010=UrbanPopGrowth2010;

RUN;