*Continuing with Regression Modelling in Practice…*

*If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?*

Through the past two courses, Data Analysis Tools and Data Management and Visualization, I looked at the correlation between urbanization and economic development and established that there was a correlation between urban population and GDP per capita. For this assignment, I decided to look at another measure of economic development – employment rate.

However, because data for 2010 is unavailable for some of the new variables I wanted to include, I decided to use data from the year 2007. It is the most recent year where I get the most data for all my variables. For each of the variables, I downloaded data directly from Gapminder and extracted the relevant information for 2007 and compiled a new CSV file. I define my response variable as * Employment Rate in 2007*. Now that my data set has been adjusted for the right year, my results are as follows:

I first ran multiple regression on my data with Employment Rate being the response variable and Urban Population, Urban Population Growth, Energy Use per Capita, GDP per Capita, and GDP Growth as explanatory variables. Energy Use and GDP Growth did not display a significant correlation, with *p-values* above 0.05.

After removing Energy Use and GDP Growth as explanatory variables, I ran the multiple regression again and Urban Population, Urban Population Growth, and GDP per Capita continued to show significance, with *p-values* < 0.05.

I then created scatter plots for each of the explanatory variables:

The resulting graphs showed that there appear to be outliers in the Urban Population Growth data and that GDP per Capita, though demonstrating a significance with a *p-value* of 0.002, does not graphically demonstrate a significant correlation with a majority of the data clustering randomly to the left. The outliers appear to be significant. After I removed them, the Beta coefficient changed quite a bit.

This leaves only Urban Population and Urban Population Growth (after removing outliers) as explanatory variables with significant correlations with Employment Rate. ** Neither of these variables appear to confound each other, with p-values significantly lower than 0.05**. Urban Population demonstrated a negative correlation with a Beta coefficient of -0.0447 while Urban Population Growth showed a positive correlation with a Beta coefficient of 2.2785. Though I added a polynomial term to check for correlation, there was no significance with

*p-values*much larger than 0.05.

With that in mind, I looked at the residuals for Urban Population and Urban Population Growth. From the residual graph, the residuals follow a fairly close pattern to the fit line suggesting a normal distribution, though there are slight deviations at lower and higher values.

Looking the standardized residuals, there are a few outliers but none are greater than 2.5 deviations. Both explanatory variables also demonstrated a fair consistent distribution of residuals across all values. However, the data is not generally close to y=0, meaning that the fit is not great.

If we looked at the partial regression plots, there is a general correlation though again the cluster of data points is not close to the line, meaning that the explanatory power of these variables is limited. This is demonstrated by a r-squared value of 0.162, which means only 16.2% of the variability in Employment Rate can be predicated if Urban Population and Urban Population Growth Rate were known.

Lastly, looking at the leverage plot, there are no new outliers that were not accounted for from the residual plots and only one data point that had a greater than average influence on the result – however, the effect is not pronounced as the leverage value is only 0.15. Most of the data points cluster between leverage values of 0 and 0.05.

These results suggest that though there is a significant correlation between urbanization and economic development (here measured as employment rate), there could be other explanatory variables that will enhance the explanatory power of this model. With a low r-squared value and residuals that are not centered around 0, the predicative power of Urban Population and Urban Population Growth on Employment Rate is relatively low. This lends some support for my hypothesis that urbanization and economic development is related.

**This is my code in Python:**

import pandas

import numpy

import seaborn

import scipy

import matplotlib.pyplot as plt

import statsmodels.api as sm

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()

RegData = RegData.dropna()

RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]] = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]] – RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007′]].mean()

print (RegData.describe())

print (RegData)

UrbanReg = smf.ols(formula=’Employment2007 ~ UrbanPop2007 + UrbanPopGrowth2007 + Energy2007 + GDP2007 + GDPGrowth2007′, data=RegData).fit()

print (UrbanReg.summary())

EmployReg = smf.ols(formula=’Employment2007 ~ UrbanPop2007 + UrbanPopGrowth2007 + GDP2007′, data=RegData).fit()

print (EmployReg.summary())

UrbanPopG = plt.figure(figsize=(8,5))

UrbanPopG = seaborn.regplot(x=’UrbanPop2007′, y=’Employment2007’, scatter=True, data=RegData)

plt.xlabel(‘Urban Population 2007(Centered)’)

plt.ylabel(‘Employment Rate 2007’)

plt.title(‘Urbanization and Economic Development’)

UrbanPopGrowthG = plt.figure(figsize=(8,5))

UrbanPopGrowthG = seaborn.regplot(x=’UrbanPopGrowth2007′, y=’Employment2007′, scatter=True, data=RegData)

plt.xlabel(‘Urban Population Growth 2007(Centered)’)

plt.ylabel(‘Employment Rate 2007’)

plt.title(‘Urbanization and Economic Development’)

GDPG = plt.figure(figsize=(8,5))

GDPG = seaborn.regplot(x=’GDP2007′, y=’Employment2007′, scatter=True, data=RegData)

plt.xlabel(‘GDP per Capita 2007(Centered)’)

plt.ylabel(‘Employment Rate 2007’)

plt.title(‘Economic Development’)

NoOut = RegData[(RegData[‘UrbanPopGrowth2007′] <10)]

Reg1 = smf.ols(formula=’Employment2007 ~ UrbanPop2007 + UrbanPopGrowth2007′, data=NoOut).fit()

print (Reg1.summary())

Reg2 = smf.ols(formula=’Employment2007 ~ UrbanPop2007 + UrbanPopGrowth2007 + I(UrbanPopGrowth2007**2)’, data=NoOut).fit()

print (Reg2.summary())

Reg3 = smf.ols(formula=’Employment2007 ~ UrbanPop2007′, data=NoOut).fit()

print (Reg3.summary())

Reg4 = smf.ols(formula=’Employment2007 ~ UrbanPopGrowth2007′, data=NoOut).fit()

print (Reg4.summary())

NoOutGraph = plt.figure(figsize=(8,5))

NoOutGraph = seaborn.regplot(x=’UrbanPopGrowth2007′, y=’Employment2007′, scatter=True, data=NoOut)

plt.xlabel(‘Urban Population Growth 2007(Centered/No Outliers)’)

plt.ylabel(‘Employment Rate 2007’)

plt.title(‘Urbanization and Economic Development’)

Resid = plt.figure(figsize=(8,5))

Resid = sm.qqplot(Reg1.resid, line=’r’)

stres = pandas.DataFrame(Reg1.resid_pearson)

fig2 = plt.plot(stres, ‘o’, ls=’None’)

l= plt.axhline(y=0, color=’r’)

plt.ylabel(‘Standardized Residual’)

plt.xlabel(‘Observation Number’)

print (fig2)

fig3 = plt.figure(figsize=(12,8))

fig3 = sm.graphics.plot_regress_exog(Reg1, “UrbanPop2007”, fig=fig3)

fig4 = plt.figure(figsize=(12,8))

fig4 = sm.graphics.plot_regress_exog(Reg1, “UrbanPopGrowth2007”, fig=fig4)

UrbanPopLev = sm.graphics.influence_plot(Reg1, size=8)

print (UrbanPopLev)

**This is my code in SAS (After removing the three outliers in the Urban Population Growth Data and I used PROC PLOT instead of GPLOT as GPLOT will not let me plot non-numeric data):**

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE

DBMS=XLSX

OUT=Gapminder2007;

GETNAMES=YES;

RUN;

PROC CONTENTS DATA=Gapminder2007;

RUN;

DATA new; set Gapminder2007;

LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;

IF GDP2007=”0″ THEN GDP2007=”.” ;

IF GDPGrowth2007=”0″ THEN GDPGrowth2007=”.” ;

IF UrbanPop2007=”0″ THEN UrbanPop2007=”.” ;

IF UrbanPopGrowth2007=”0″ THEN UrbanPopGrowth2007=”.” ;

IF UrbanAgg2007=”0″ THEN UrbanAgg2007=”.”;

IF Employment2007=”0″ THEN Employment2007=”.”;

IF Energy2007=”0″ THEN Energy2007=”.”;

IF UrbanPopGrowth2007 GT 10 THEN UrbanPopGrowth2007=”.”;

IF GDP2007 ne . ;

IF GDPGrowth2007 ne . ;

IF UrbanPop2007 ne . ;

IF UrbanPopGrowth2007 ne . ;

IF Employment2007 ne . ;

IF Energy2007 ne . ;

PROC SORT; by country;

SYMBOL1 C=BLUE I=R V=DOT;

PROC GPLOT; PLOT Employment2007*UrbanPop2007;

SYMBOL1 C=BLUE I=R V=DOT;

PROC GPLOT; PLOT Employment2007*UrbanPopGrowth2007;

SYMBOL1 C=BLUE I=R V=DOT;

PROC GPLOT; PLOT Employment2007*GDP2007;

PROC Means;

var UrbanPop2007 UrbanPopGrowth2007;

RUN;

Data Centered;

Set new;

UrbanPop2007=UrbanPop2007 – 58.5474243;

UrbanPopGrowth2007=UrbanPopGrowth2007 – 2.0261345;

RUN;

PROC Means;

var UrbanPop2007 UrbanPopGrowth2007;

RUN;

PROC GLM PLOTS(unpack)=all;

model Employment2007=UrbanPop2007 UrbanPopGrowth2007;

output residual=res student=stdres out=results;

RUN;

PROC PLOT;

label stdres=”Standardized Residual” country=”Country”;

plot stdres*country/vref=0;

RUN;

PROC REG plots=partial;

model Employment2007=UrbanPop2007 UrbanPopGrowth2007/partial;

RUN;