# A Revelation…Through ANOVA

Now that I finished the first course in the Data Analysis and Interpretations Specialization, this is the start of the second called Data Analysis Tools

If you have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background on this project, please visit this page.

Since I have been presenting my SAS work in the previous course, I will be presenting my Python work for this course. I actually enjoy working with Python, it seems to have more flexibility and I am more used to its language, having some experience with R during graduate school. As always, I am also including the other set of code for reference, so please see my SAS code at the very bottom of the post.

There will be two parts to this presentation. The first part will be a discussion of what happened with the analysis of my data using urban population growth rate and GDP growth rate in 2010. The second part will be the presentation of a successful and significant ANOVA test.

This is my Python code for Part I:

import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gapminder[‘GDP2010’] = pandas.to_numeric(gapminder[‘GDP2010’])
gapminder[‘GDPGrowth2010’] = pandas.to_numeric(gapminder[‘GDPGrowth2010’])
gapminder[‘UrbanPop2010’] = pandas.to_numeric(gapminder[‘UrbanPop2010’])
gapminder[‘UrbanPopGrowth2010’] = pandas.to_numeric(gapminder[‘UrbanPopGrowth2010’])
gapminder[‘Pop2010’] = pandas.to_numeric(gapminder[‘Pop2010’])
gapminder[‘UrbanAgg2007’] = pandas.to_numeric(gapminder[‘UrbanAgg2007’])

#replace 0 with NaN values, and dropna() removes rows with any NaN values
gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan)
gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan)
gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan)
gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan)
gapminder[‘Pop2010’] = gapminder[‘Pop2010’].replace(0,numpy.nan)

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010′,’Pop2010’])
gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘Pop2010’, ‘UrbanAgg2007’]]

print (gapminder)

GDPDes = gapminder[‘GDP2010’].describe()
print (GDPDes)

GDPGrowthDes = gapminder[‘GDPGrowth2010’].describe()
print (GDPGrowthDes)

UrbanPopDes = gapminder[‘UrbanPop2010’].describe()
print (UrbanPopDes)

UrbanGrowthDes = gapminder[‘UrbanPopGrowth2010’].describe()
print (UrbanGrowthDes)

PopGrowthDes = gapminder[‘Pop2010’].describe()
print (PopGrowthDes)

bins = [-1.5, 0, 1.5, 3, 4.5]
gapminder[‘UrbanRate4’]=pandas.cut(gapminder.UrbanPopGrowth2010, bins, labels=[“Neg”, “Low”, “Medium”, “High”])
test = gapminder[‘UrbanRate4’].value_counts(sort=False)
print(test)

test = gapminder[‘UrbanRate4′].describe()
print (test)

seaborn.factorplot(x=’UrbanRate4′, y=’GDPGrowth2010’, data=gapminder, kind=”bar”, ci=None, size=10, aspect=1)
plt.xlabel(“Urban Growth Rate”)
plt.ylabel(“GDP Growth in 2010 (Mean % Growth)”)
plt.title(“Urban Growth Rate Against GDP Growth in 2010”)

urbanbin = smf.ols(formula = ‘GDPGrowth2010 ~ UrbanRate4’, data=gapminder).fit()
print (urbanbin.summary())

groupmean = gapminder.groupby(‘UrbanRate4’).mean()
print (groupmean)

In order to conduct an ANOVA test, I created four categories for my urban population growth data based on the mean, range, and standard deviation. Then I charted the data as bar graphs with mean growth rate for each category. The graph seemed to suggest a relationship between urban population growth rate and GDP growth rate based on 2010 data. However, when I ran the ANOVA test, my P-value came out to be 0.0694. This is greater than the 0.05 threshold under which the null hypothesis can be rejected, suggesting that there are no significant differences between the means of each Urban Growth Rate category. Furthermore, by looking at the scatter plot from my previous assignment (which did not demonstrate significant correlation), it calls into question if there are actually relationships between urbanization and economic growth. Perhaps there is another variable that affects urban growth and GDP growth. This leads to part two of my work. Instead of looking at urban growth rate, I used the UrbanAgg2007 variable instead. Perhaps the level of urbanization still matters. I looked into absolute degree instead of growth rate. UrbanAgg2007 is how much of a country’s total population is in cities with over 1 million people. Here, I aggregated the data using the pandas.cut function to produce three categorical groups for comparison. Using the quantile data, I used intervals of [0, 20, 37, 54] based on the mean and standard deviation (23.79 and 17.26 respectively). Below is the graph and descriptive metrics of this comparison: Name: UrbanAgg2007
count 97.000000
mean 23.795882
std 17.265522
min 0.000000
25% 12.036148
50% 19.385539
75% 31.485011
max 100.000000

From this graph, there seems to be a significant difference between the three groups, particularly between the Medium and High/Low groups. The data included is also fairly representative as there are 91 out of the 97 observations that fit into the three categorical groups. I then ran the ANOVA analysis and Tukey Honestly Significant Difference Test with significant results.

This is my Python code for Part II:

import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gapminder[‘GDP2010’] = pandas.to_numeric(gapminder[‘GDP2010’])
gapminder[‘GDPGrowth2010’] = pandas.to_numeric(gapminder[‘GDPGrowth2010’])
gapminder[‘UrbanPop2010’] = pandas.to_numeric(gapminder[‘UrbanPop2010’])
gapminder[‘UrbanPopGrowth2010’] = pandas.to_numeric(gapminder[‘UrbanPopGrowth2010’])
gapminder[‘Pop2010’] = pandas.to_numeric(gapminder[‘Pop2010’])
gapminder[‘UrbanAgg2007’] = pandas.to_numeric(gapminder[‘UrbanAgg2007’])

#replace 0 with NaN values, and dropna() removes rows with any NaN values
gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan)
gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan)
gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan)
gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan)
gapminder[‘Pop2010’] = gapminder[‘Pop2010’].replace(0,numpy.nan)

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010′,’Pop2010’])
gapminder = gapminder[[‘Country’,’GDP2010′, ‘GDPGrowth2010’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘Pop2010’, ‘UrbanAgg2007’]]

print (gapminder)

GDPDes = gapminder[‘GDP2010’].describe()
print (GDPDes)

GDPGrowthDes = gapminder[‘GDPGrowth2010’].describe()
print (GDPGrowthDes)

UrbanPopDes = gapminder[‘UrbanPop2010’].describe()
print (UrbanPopDes)

UrbanGrowthDes = gapminder[‘UrbanPopGrowth2010’].describe()
print (UrbanGrowthDes)

UrbanDes = gapminder[‘UrbanAgg2007’].describe()
UrbanMedian = gapminder[‘UrbanAgg2007’].median()
print (UrbanDes)
print (UrbanMedian)

gapminder[‘UrbanRate4’]=pandas.cut(gapminder.UrbanAgg2007, [0, 20, 37, 54], labels=[“Low”, “Medium”, “High”])
test = gapminder[‘UrbanRate4’].value_counts(sort=False)
print(test)

#To remove NaN values from new variable UrbanRate4
gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010′,’Pop2010’, ‘UrbanAgg2007’, ‘UrbanRate4’])

print(gapminder)

UrbanRate = gapminder[‘UrbanRate4′].describe()
print (UrbanRate)

seaborn.factorplot(x=’UrbanRate4′, y=’GDPGrowth2010’, data=gapminder, kind=”bar”, ci=None, size=10, aspect=1)
plt.xlabel(“Urbanization (% Total Pop in Cities Over 1 Million People)”)
plt.ylabel(“GDP Growth in 2010 (Mean % Growth)”)
plt.title(“Urbanization Against GDP Growth in 2010”)

urbanbin = smf.ols(formula = ‘GDPGrowth2010 ~ UrbanRate4’, data=gapminder).fit()
print (urbanbin.summary())

Post = multi.MultiComparison(gapminder[‘GDPGrowth2010’], gapminder[‘UrbanRate4’])
PostHoc = Post.tukeyhsd()
print (PostHoc.summary())

groupmean = gapminder.groupby(‘UrbanRate4’).mean()
print (groupmean)

The result:

OLS Regression Results
==============================================================================
R-squared: 0.094
R-squared: 0.073
F-statistic: 4.557
Prob (F-statistic): 0.0131
No. Observations: 91
================================================================================

Multiple Comparison of Means – Tukey HSD,FWER=0.05
=============================================
group1 group2     meandiff    lower       upper    reject
———————————————
High    Low          -0.8289    -3.2843   1.6265    False
High    Medium    -2.6635    -5.2463   -0.0807  True
Low     Medium    -1.8346    -3.5235   -0.1457  True

Here, the P-value from the ANOVA test is 0.0131, which suggests that the null hypothesis can be rejected and that there is a correlation between absolute degree of urbanization and economic growth. Furthermore, the significant difference is between the Low (49 countries) – Medium (31 countries) and Medium – High (11 countries) groups. One caveat remains. Instead of a large sample size (164 countries with the urban population growth data), I am now working with only 91 observations. This might be a problem I need to solve in the near future, but for now, I must say this was a successful conclusion to my assignment.

Notes:

I encountered a fair amount of difficulty trying to write this program. The biggest issues was that I did not realize the UrbanRate4 variable I created with pandas.cut function had NaN values which stopped the Tukey Test function from working properly.

Another thing I learned is that data can easily be manipulated to fit your needs. As a scientist, I always knew it was possible, but I did not realize how frighteningly easy it was. To be honest, I could have made the urban population growth data work with ANOVA depending on intervals I used for the pandas.cut function. I did not want to do that because it would not be answering my own question truthfully and realistically.

I believe any research and analysis must be grounded in real-world observations and data instead of manipulated.

This is my SAS code:

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
DBMS=XLSX
OUT=Gapminder2010;
GETNAMES=YES;
RUN;

PROC CONTENTS DATA=Gapminder2010; RUN;
/*to importat data from Excel file upload*/

/*LIBNAME describes where the file is located = look in which folder for data*/

DATA new; set Gapminder2010 ;
/*DATA names a new data set for the program to use and which data set in the folder to use as raw data*/

IF GDP2010=”0″ THEN GDP2010=”.” ;
IF GDPGrowth2010=”0″ THEN GDPGrowth2010=”.” ;
IF UrbanPop2010=”0″ THEN UrbanPop2010=”.” ;
IF UrbanPopGrowth2010=”0″ THEN UrbanPopGrowth2010=”.” ;
IF UrbanAgg2007=”0″ THEN UrbanAgg2007=”.” ;
/*Statements remove 0 from data – there are no 0’s in source data*/

IF GDP2010 ne . ;
IF GDPGrowth2010 ne . ;
IF UrbanPop2010 ne . ;
IF UrbanPopGrowth2010 ne . ;
IF UrbanAgg2007 ne . ;
/*Statements remove empty or missing data from table*/

IF UrbanAgg2007 le 20 THEN UrbanRate = 1;
ELSE IF UrbanAgg2007 le 37 THEN UrbanRate = 2;
ELSE IF UrbanAgg2007 le 54 THEN UrbanRate = 3;
/*Creates categorical groups*/

PROC SORT; by country;
PROC PRINT; VAR country GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanRate;
/*Print function to show data for listed variables*/

PROC FREQ; TABLES GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanAgg2007 UrbanRate;

PROC GCHART; VBAR UrbanRate/discrete type=mean SUMVAR=GDPGrowth2010;

PROC ANOVA; CLASS UrbanRate;
MODEL GDPGrowth2010=UrbanRate;
MEANS UrbanRate/DUNCAN;

RUN;