# Chi-Square Testing…*Warning: It’s Painful*

Continuing with Data Analysis Tools

If you have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background on this project, please visit this page.

For this assignment, I had to run Chi-Square tests on my variables. As always, both my Python and SAS codes are posted. Since all my data are quantitative, I had to first categorize them. Since I found a relationship between the absolute measure of urbanization (population in cities with over 1 million people) and GDP Growth rate, I decided to categorize GDP growth rate. Additionally, I wanted to see if there is a relationship between urbanization with the absolute measure of GDP  (GDP per capita).

To categorize GDP per capita, I used cut-offs of 5000, 10000, and 100000 to produce three distinctive ranks whereby a country is poor if its GDP per capita is below \$5000 USD in 2000, medium if it is between 5000 and 10000, and rich if it is above 100000. This was based on both the GDP statistics from earlier assignments (mean of 7541, median of 2228, SD of 11523). I believe these cut-offs demonstrate the fact that most countries are poor (63 out of 99) and any country above the standard deviation is rich.

As for GDP Growth Rate, I used cut-offs of -7, 0 , 2, 4, 25.  I included negative growth values because I am interested if there is a relationship there. Otherwise, since most economists seem to agree that 2~4% GDP growth is ideal, anything below will be low economic growth and anything above will be high. These cut-offs actually divided the countries fairly evenly between low, medium, and high growth rates.

GDP Growth Rate – # of Countries by Ranking
None:            12
Low:              29
Medium:      20
High:             38

Using the cross tabs function, I was able to create two tables for my variables – between urbanization and GDP and between urbanization and GDP Growth: As the tables show, there appear to be some relationship between urbanization and absolute GDP and GDP growth rate of a country. This was confirmed by the chi-square tests, with p-values of 0.00217 and 0.009138 respectively. With p-values below the 0.05 threshold, this means I need to perform post-hoc tests.

This is the hard or rather, tedious, part of the process. With Python and SAS, there are no easy formulas or programs to run to perform pair-wise comparisons simultaneously. I must admit, I am not very familiar with Chi-Square Tests and its post-hoc tests, so if anyone sees a mistake, please correct me! It will be much appreciated, especially since I am working with multiple levels in both my response and explanatory variables.

With GDP and Urbanization Rate, I worked with a 3×3 table and nine pair-wise comparisons. This meant an adjusted p-value of 0.05/9 or 0.0056. After running the program, two comparisons differed significantly from the rest, which is countries with a low urbanization rate and those with a high urbanization rate.

URCOMP1v3
GDPCOMP1v2
chi-square value              p value
10.536585365853659        0.0011703444685277223

URCOMP1v3
GDPCOMP1v3
chi-square value              p value
7.9856792531434628       0.0047148802335932527

On the other hand, GDP Growth Rate was a 3×4 table and had 18 pair-wise comparisons. The adjusted p-value is o.05/18 or 0.0028. Here there was only one comparison that stood out from the rest and that is between countries with low urbanization rates and with medium urbanization rates. The effect is most pronounced in producing either negative GDP growth rates or very high GDP growth rates.

URCOMP1v2
GRCOMP1v4
chi-square value              p value
8.9358990147783235       0.0027961980818559736

These findings seem to correlate with the ANOVA findings, where these is a relationship between urbanization and the economy. This effect is most significant in countries with either low urbanization or high urbanization. I believe as we move on to quantitative to quantitative tests, more light can be shed on the relationships. By categorizing data, I have influence on the data analysis because I am choosing the cut-off points. So, these results may not reflect the true nature of the relationships between urbanization and economic growth.

This is my code in Python:

import pandas
import numpy
import scipy

gapminder = gapminder.dropna(subset = [‘GDP2010′,’GDPGrowth2010′,’UrbanPop2010′,’UrbanPopGrowth2010′,’UrbanAgg2007’])

gapminder[‘UrbanRate4’]=pandas.cut(gapminder.UrbanAgg2007, [0, 20, 37, 54], labels=[“Low”, “Medium”, “High”])
UrbanRate = gapminder[‘UrbanRate4’].value_counts(sort=False)
print(UrbanRate)

gapminder[‘GDP’]=pandas.cut(gapminder.GDP2010, [0, 5000, 10000, 100000], labels=[“Poor”, “Middle”, “Rich”])
GDP = gapminder[‘GDP’].value_counts(sort=False)
print(GDP)

gapminder[‘GDPRate4’]=pandas.cut(gapminder.GDPGrowth2010, [-7, 0, 2, 4, 25], labels=[“None”, “Low”, “Medium”, “High”])
GDPRate = gapminder[‘GDPRate4’].value_counts(sort=False)
print(GDPRate)

gapminder = gapminder.dropna(subset = [‘UrbanRate4’, ‘GDP’, ‘GDPRate4’])

#Crosstab for table of values
cross1 = pandas.crosstab(gapminder[‘GDP’], gapminder[‘UrbanRate4’])
print(cross1)

cross2 = pandas.crosstab(gapminder[‘GDPRate4’], gapminder[‘UrbanRate4’])
print(cross2)

#Sum Column and Divide each value in cell by sum for percentages
sum1 = cross1.sum(axis=0)
pct1 = cross1/sum1
print(pct1)

sum2 = cross2.sum(axis=0)
pct2 = cross2/sum2
print(pct2)

#Chi-Square Test
print(‘chi-square value, p value, expected counts’)
cs1 = scipy.stats.chi2_contingency(cross1)
print(cs1)

print(‘chi-square value, p value, expected counts’)
cs2 = scipy.stats.chi2_contingency(cross2)
print(cs2)

First Post-Hoc Test (GDP and Urbanization Rate):

gapminder[‘URCOMP1v3’]=gapminder[‘UrbanRate4’].map(recode13)

ct4=pandas.crosstab(gapminder[‘GDPCOMP1v2’], gapminder[‘URCOMP1v3’])
print (ct4)

sum4 = ct4.sum(axis=0)
pct4 = ct4/sum4
print(pct4)

print(‘chi-square value, p value, expected counts’)
ct4 = scipy.stats.chi2_contingency(ct4)
print(ct4)

#Post Hoc Test 1:3 v 1:3
ct5=pandas.crosstab(gapminder[‘GDPCOMP1v3’], gapminder[‘URCOMP1v3’])
print (ct5)

sum5 = ct5.sum(axis=0)
pct5 = ct5/sum5
print(pct5)

print(‘chi-square value, p value, expected counts’)
ct5 = scipy.stats.chi2_contingency(ct5)
print(ct5)

#Post Hoc Test 1:3 v 2:3
ct6=pandas.crosstab(gapminder[‘GDPCOMP2v3’], gapminder[‘URCOMP1v3’])
print (ct6)

sum6 = ct6.sum(axis=0)
pct6 = ct6/sum6
print(pct6)

print(‘chi-square value, p value, expected counts’)
ct6 = scipy.stats.chi2_contingency(ct6)
print(ct6)

#Post Hoc Test 2:3 v 1:2
gapminder[‘URCOMP2v3’]=gapminder[‘UrbanRate4’].map(recode23)

ct7=pandas.crosstab(gapminder[‘GDPCOMP1v2’], gapminder[‘URCOMP2v3’])
print (ct7)

sum7 = ct7.sum(axis=0)
pct7 = ct7/sum7
print(pct7)

print(‘chi-square value, p value, expected counts’)
ct7 = scipy.stats.chi2_contingency(ct7)
print(ct7)

#Post Hoc Test 2:3 v 1:3
ct8=pandas.crosstab(gapminder[‘GDPCOMP1v3’], gapminder[‘URCOMP2v3’])
print (ct8)

sum8 = ct8.sum(axis=0)
pct8 = ct8/sum8
print(pct8)

print(‘chi-square value, p value, expected counts’)
ct8 = scipy.stats.chi2_contingency(ct8)
print(ct8)

#Post Hoc Test 2:3 v 2:3
ct9=pandas.crosstab(gapminder[‘GDPCOMP2v3’], gapminder[‘URCOMP2v3’])
print (ct9)

sum9 = ct9.sum(axis=0)
pct9 = ct9/sum9
print(pct9)

print(‘chi-square value, p value, expected counts’)
ct9 = scipy.stats.chi2_contingency(ct9)
print(ct9)

Second Post-Hoc Test (GDP Growth Rate and Urbanization Rate):

recode13 = {1: 1, 3: 3}
gapminder[‘GRCOMP1v3’]=gapminder[‘GDPRate4’].map(recode13)

ct2=pandas.crosstab(gapminder[‘GRCOMP1v3’], gapminder[‘URCOMP1v2’])
print (ct2)

sum2 = ct2.sum(axis=0)
pct2 = ct2/sum2
print(pct2)

print(‘chi-square value, p value, expected counts’)
ct2 = scipy.stats.chi2_contingency(ct2)
print(ct2)

#Post Hoc Test 1:2 v 1:4
recode14 = {1: 1, 4: 4}
gapminder[‘GRCOMP1v4’]=gapminder[‘GDPRate4’].map(recode14)

ct3=pandas.crosstab(gapminder[‘GRCOMP1v4’], gapminder[‘URCOMP1v2’])
print (ct3)

sum3 = ct3.sum(axis=0)
pct3 = ct3/sum3
print(pct3)

print(‘chi-square value, p value, expected counts’)
ct3 = scipy.stats.chi2_contingency(ct3)
print(ct3)

#Post Hoc Test 1:2 v 2:3
recode23 = {2: 2, 3: 3}
gapminder[‘GRCOMP2v3’]=gapminder[‘GDPRate4’].map(recode23)

ct4=pandas.crosstab(gapminder[‘GRCOMP2v3’], gapminder[‘URCOMP1v2’])
print (ct4)

sum4 = ct4.sum(axis=0)
pct4 = ct4/sum4
print(pct4)

print(‘chi-square value, p value, expected counts’)
ct4 = scipy.stats.chi2_contingency(ct4)
print(ct4)

#Post Hoc Test 1:2 v 2:4
recode24 = {2: 2, 4: 4}
gapminder[‘GRCOMP2v4’]=gapminder[‘GDPRate4’].map(recode24)

ct5=pandas.crosstab(gapminder[‘GRCOMP2v4’], gapminder[‘URCOMP1v2’])
print (ct5)

sum5 = ct5.sum(axis=0)
pct5 = ct5/sum5
print(pct5)

print(‘chi-square value, p value, expected counts’)
ct5 = scipy.stats.chi2_contingency(ct5)
print(ct5)

#Post Hoc Test 1:2 v 3:4
recode34 = {3: 3, 4: 4}
gapminder[‘GRCOMP3v4’]=gapminder[‘GDPRate4’].map(recode34)

ct6=pandas.crosstab(gapminder[‘GRCOMP3v4’], gapminder[‘URCOMP1v2’])
print (ct6)

sum6 = ct6.sum(axis=0)
pct6 = ct6/sum6
print(pct6)

print(‘chi-square value, p value, expected counts’)
ct6 = scipy.stats.chi2_contingency(ct6)
print(ct6)

#Post Hoc Test 1:3 v 1:2
gapminder[‘URCOMP1v3’]=gapminder[‘UrbanRate4’].map(recode13)

ct7=pandas.crosstab(gapminder[‘GRCOMP1v2’], gapminder[‘URCOMP1v3’])
print (ct7)

sum7 = ct7.sum(axis=0)
pct7 = ct7/sum7
print(pct7)

print(‘chi-square value, p value, expected counts’)
ct7 = scipy.stats.chi2_contingency(ct7)
print(ct7)

#Post Hoc Test 1:3 v 1:3
ct8=pandas.crosstab(gapminder[‘GRCOMP1v3’], gapminder[‘URCOMP1v3’])
print (ct8)

sum8 = ct8.sum(axis=0)
pct8 = ct8/sum8
print(pct8)

print(‘chi-square value, p value, expected counts’)
ct8 = scipy.stats.chi2_contingency(ct8)
print(ct8)

#Post Hoc Test 1:3 v 1:4
ct9=pandas.crosstab(gapminder[‘GRCOMP1v4’], gapminder[‘URCOMP1v3’])
print (ct9)

sum9 = ct9.sum(axis=0)
pct9 = ct9/sum9
print(pct9)

print(‘chi-square value, p value, expected counts’)
ct9 = scipy.stats.chi2_contingency(ct9)
print(ct9)

#Post Hoc Test 1:3 v 2:3
ct10=pandas.crosstab(gapminder[‘GRCOMP2v3’], gapminder[‘URCOMP1v3’])
print (ct10)

sum10 = ct10.sum(axis=0)
pct10 = ct10/sum10
print(pct10)

print(‘chi-square value, p value, expected counts’)
ct10 = scipy.stats.chi2_contingency(ct10)
print(ct10)

#Post Hoc Test 1:3 v 2:4
ct11=pandas.crosstab(gapminder[‘GRCOMP2v4’], gapminder[‘URCOMP1v3’])
print (ct11)

sum11 = ct11.sum(axis=0)
pct11 = ct11/sum11
print(pct11)

print(‘chi-square value, p value, expected counts’)
ct11 = scipy.stats.chi2_contingency(ct11)
print(ct11)

#Post Hoc Test 1:3 v 3:4
ct12=pandas.crosstab(gapminder[‘GRCOMP3v4’], gapminder[‘URCOMP1v3’])
print (ct12)

sum12 = ct12.sum(axis=0)
pct12 = ct12/sum12
print(pct12)

print(‘chi-square value, p value, expected counts’)
ct12 = scipy.stats.chi2_contingency(ct12)
print(ct12)

#Post Hoc Test 2:3 v 1:2
gapminder[‘URCOMP2v3’]=gapminder[‘UrbanRate4’].map(recode23)

ct13=pandas.crosstab(gapminder[‘GRCOMP1v2’], gapminder[‘URCOMP2v3’])
print (ct13)

sum13 = ct13.sum(axis=0)
pct13 = ct13/sum13
print(pct13)

print(‘chi-square value, p value, expected counts’)
ct13 = scipy.stats.chi2_contingency(ct13)
print(ct13)

#Post Hoc Test 2:3 v 1:3
ct14=pandas.crosstab(gapminder[‘GRCOMP1v3’], gapminder[‘URCOMP2v3’])
print (ct14)

sum14 = ct14.sum(axis=0)
pct14 = ct14/sum14
print(pct14)

print(‘chi-square value, p value, expected counts’)
ct14 = scipy.stats.chi2_contingency(ct14)
print(ct14)

#Post Hoc Test 2:3 v 1:4
ct15=pandas.crosstab(gapminder[‘GRCOMP1v4’], gapminder[‘URCOMP2v3’])
print (ct15)

sum15 = ct15.sum(axis=0)
pct15 = ct15/sum15
print(pct15)

print(‘chi-square value, p value, expected counts’)
ct15 = scipy.stats.chi2_contingency(ct15)
print(ct15)

#Post Hoc Test 2:3 v 2:3
ct16=pandas.crosstab(gapminder[‘GRCOMP2v3’], gapminder[‘URCOMP2v3’])
print (ct16)

sum16 = ct16.sum(axis=0)
pct16 = ct16/sum16
print(pct16)

print(‘chi-square value, p value, expected counts’)
ct16 = scipy.stats.chi2_contingency(ct16)
print(ct16)

#Post Hoc Test 2:3 v 2:4
ct17=pandas.crosstab(gapminder[‘GRCOMP2v4’], gapminder[‘URCOMP2v3’])
print (ct17)

sum17 = ct17.sum(axis=0)
pct17 = ct17/sum17
print(pct17)

print(‘chi-square value, p value, expected counts’)
ct17 = scipy.stats.chi2_contingency(ct17)
print(ct17)

#Post Hoc Test 2:3 v 3:4
ct18=pandas.crosstab(gapminder[‘GRCOMP3v4’], gapminder[‘URCOMP2v3’])
print (ct18)

sum18 = ct18.sum(axis=0)
pct18 = ct18/sum18
print(pct18)

print(‘chi-square value, p value, expected counts’)
ct18 = scipy.stats.chi2_contingency(ct18)
print(ct18)

This is my code in SAS (GDP and Urbanization Rate Only):

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
DBMS=XLSX
OUT=Gapminder2010;
GETNAMES=YES;
RUN;

PROC CONTENTS DATA=Gapminder2010; RUN;

DATA new; set Gapminder2010 ;

IF GDP2010=”0″ THEN GDP2010=”.” ;
IF GDPGrowth2010=”0″ THEN GDPGrowth2010=”.” ;
IF UrbanPop2010=”0″ THEN UrbanPop2010=”.” ;
IF UrbanPopGrowth2010=”0″ THEN UrbanPopGrowth2010=”.” ;
IF UrbanAgg2007=”0″ THEN UrbanAgg2007=”.” ;

IF CMISS(of _all_) THEN delete;

IF UrbanAgg2007 le 20 THEN UrbanRate = 1;
ELSE IF UrbanAgg2007 le 37 THEN UrbanRate = 2;
ELSE IF UrbanAgg2007 le 54 THEN UrbanRate = 3;

IF GDP2010 le 5000 THEN GDP = 1;
ELSE IF GDP2010 le 10000 THEN GDP = 2;
ELSE IF GDP2010 le 100000 THEN GDP = 3;

PROC SORT; by country;
PROC PRINT; VAR country GDP2010 GDPGrowth2010 UrbanPop2010 UrbanPopGrowth2010 UrbanRate GDP;

PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS1; SET NEW;
IF UrbanRate = 1 or UrbanRate = 2;
IF GDP = 1 or GDP = 2;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS2; SET NEW;
IF UrbanRate = 1 or UrbanRate = 2;
IF GDP = 1 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS3; SET NEW;
IF UrbanRate = 1 or UrbanRate = 2;
IF GDP = 2 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS4; SET NEW;
IF UrbanRate = 1 or UrbanRate = 3;
IF GDP = 1 or GDP = 2;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS5; SET NEW;
IF UrbanRate = 1 or UrbanRate = 3;
IF GDP = 1 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS6; SET NEW;
IF UrbanRate = 1 or UrbanRate = 3;
IF GDP = 2 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS7; SET NEW;
IF UrbanRate = 2 or UrbanRate = 3;
IF GDP = 1 or GDP = 2;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS8; SET NEW;
IF UrbanRate = 2 or UrbanRate = 3;
IF GDP = 1 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

DATA COMPARIONS9; SET NEW;
IF UrbanRate = 2 or UrbanRate = 3;
IF GDP = 2 or GDP = 3;
PROC SORT; By Country;
PROC FREQ; TABLES GDP*UrbanRate/CHISQ;

RUN;