*Last lesson of Regression Modelling in Practice…*

*If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?*

Through the past two courses, Data Analysis Tools and Data Management and Visualization, I looked at the correlation between urbanization and economic development and established that there was a correlation between urban population and GDP per capita. For this last assignment in the course *Regression Modelling in Practice*, I am again examining GDP per Capita as the response variable. I am using the new data set I created in the last assignment from Gapminer, which as I explained, holds a more complete set of data if I used the year 2007 instead of 2010.

As a logistic regression is performed on a categorical response variable with two levels and multiple explanatory variables, I had to bin GDP per Capita into two and recode them:

0 = Countries with a GDP per Capita less than $10,000 USD(2000)

1 = Countries ith a GDP per Capita greater than $10,000 USD(2000)

I started off by using Urban Population, Urban Population Growth, Employment Rate, and Energy Use per Capita as explanatory variables. Only Urban Population and Energy Use per Capita demonstrated significant correlation with GDP per Capita, with *p-values* of 0.003 and 0.000.

After removing the non-significant explanatory variables, the *p-values *of Urban Population and Energy Use per Capita remained at 0.003 and 0.000 respectively. Furthermore, when I ran the logistics regression with each explanatory variables by itself, the *p-values *remained at 0.000, far below 0.05. This means that there is no confounding between Urban Population and Energy Use per Capita as explanatory variables.

The confidence intervals and odds ratios are as follows:

The confidence interval for Urban Population is small and the odds ratio is only slightly above one. An odds ratio of one means that there is no correlation between the two variables. This suggests that although GDP per Capita increases as Urban Population increases, the correlation is weak.

On the other hand, with an odds ratio of 3.168, there is a strong, positive correlation between Energy Use per Capita and GDP per Capita. The range of the confidence interval is relatively large and does not overlap with that of Urban Population. This suggests that there is a stronger correlation with Energy Use per Capita than Urban Population.

With these results, there is still support for my hypothesis that urbanization has a positive correlation with economic development. However, the explanatory power of urbanization might be weaker than I had originally expected.

**This is my code in Python:**

import pandas

import numpy

import seaborn

import scipy

import matplotlib.pyplot as plt

import statsmodels.api as sm

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()

RegData = RegData.dropna()

GDPGD = RegData[‘GDP2007’].describe()

print (GDPGD)

GDPDistG = plt.figure(figsize=(8,5))

GDPDistG = seaborn.distplot(RegData[‘GDP2007’].dropna(), kde=False)

plt.xlabel(‘GDP 2007’)

plt.title(‘Distribution of GDP 2007’)

def GDPCat (row):

if row[‘GDP2007’] <= 10000:

return 0

elif row[‘GDP2007’] > 10000:

return 1

RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )

chk = RegData[‘GDPCat’].value_counts(sort=False, dropna=False)

print (chk)

LogReg = smf.logit(formula=’GDPCat ~ UrbanPop2007 + UrbanPopGrowth2007 + Employment2007 + Energy2007′, data=RegData).fit()

print (LogReg.summary())

LogReg2 = smf.logit(formula=’GDPCat ~ UrbanPop2007 + Energy2007′, data=RegData).fit()

print (LogReg2.summary())

LogReg3 = smf.logit(formula=’GDPCat ~ Energy2007′, data=RegData).fit()

print (LogReg3.summary())

LogReg4 = smf.logit(formula=’GDPCat ~ UrbanPop2007′, data=RegData).fit()

print (LogReg4.summary())

GDPParams = LogReg2.params

GDPConf = LogReg2.conf_int()

GDPConf[‘OR’] = LogReg2.params

GDPConf.columns = [‘Lower CI’, ‘Upper CI’, ‘OR’]

print (numpy.exp(GDPConf))

**This is my code in SAS:**

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE

DBMS=XLSX

OUT=Gapminder2007;

GETNAMES=YES;

RUN;

PROC CONTENTS DATA=Gapminder2007;

RUN;

DATA new; set Gapminder2007;

LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;

IF GDP2007=”0″ THEN GDP2007=”.” ;

IF GDPGrowth2007=”0″ THEN GDPGrowth2007=”.” ;

IF UrbanPop2007=”0″ THEN UrbanPop2007=”.” ;

IF UrbanPopGrowth2007=”0″ THEN UrbanPopGrowth2007=”.” ;

IF UrbanAgg2007=”0″ THEN UrbanAgg2007=”.”;

IF Employment2007=”0″ THEN Employment2007=”.”;

IF Energy2007=”0″ THEN Energy2007=”.”;

IF GDP2007 ne . ;

IF GDPGrowth2007 ne . ;

IF UrbanPop2007 ne . ;

IF UrbanPopGrowth2007 ne . ;

IF Employment2007 ne . ;

IF Energy2007 ne . ;

IF GDP2007 = “.” THEN GDPCat = “.”;

ELSE IF GDP2007 LE 10000 THEN GDPCat = 0;

ELSE IF GDP2007 GT 10000 THEN GDPCat = 1;

PROC Univariate; VAR GDP2007;

PROC GCHART; VBAR GDP2007;

PROC LOGISTIC descending; model GDPCat=UrbanPop2007 UrbanPopGrowth2007 Employment2007 Energy2007;

PROC LOGISTIC descending; model GDPCat=UrbanPop2007 Energy2007;

RUN;