Career, Cities, Coursera, Data Analysis, Personal, Research, Society
Leave a Comment

Data Management – Missing Values

Continuing my course in Data Management and Visualization with Coursera…

For those that haven’t followed on my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background in this project, please visit this page.

Though I am working in SAS and Python in an attempt to learn both, I will only be presenting my work in SAS here (though I will also include my Python code for reference). The output format in SAS is easier on the eyes in my opinion.

For this third assignment, I was originally going to try to calculate the urban population growth rate for 2010. Instead, I found the data available on Gapminder so I did not have to create a secondary variable for population growth. After inserting the latest variable, I decided to code out missing data. Since my focus is on GDP, GDP Growth, Urban Population, and Urban Population Growth, I had to make sure only countries that had all four variables were included in the data set. Furthermore, though the histograms showed distinct clusters of countries, the distributions are not consistent across these four variables. Thus, I have yet to decide how to subset the data. Perhaps after further analysis I will have a better idea. For now, the only step I took was to make sure I only had countries with all four variables in my data set.

This is my code in SAS:

FILENAME REFFILE “/home/wfhsu.taiwan0/my_courses/Data1.xlsx” TERMSTR=CR;
PROC IMPORT DATAFILE=REFFILE
DBMS=XLSX
OUT=Gapminder2010;
GETNAMES=YES;
RUN;

PROC CONTENTS DATA=Gapminder2010; RUN;
/*to importat data from Excel file upload*/

LIBNAME mydata “/saswork/SAS_work2EC30000E95E_odaws04-prod-us/SAS_work9F7B0000E95E_odaws04-prod-us ” access=readonly;
/*LIBNAME describes where the file is located = look in which folder for data*/

DATA new; set Gapminder2010 ;
/*DATA names a new data set for the program to use and which data set in the folder to use as raw data*/

LABEL ‘GDP per Capita 2010’n=”GDP Per Capita”
LABEL ‘GDP Growth 2010’n=”GDP Growth”
LABEL ‘Urban Population 2010’n=”Urban Population”
LABEL ‘Urban Pop Growth’n=”Urban Pop Growth”
LABEL ‘Urban Pop in Agglomerations over’n=”Urban Pop in Large Agglomerations”;

IF ‘GDP per Capita 2010’n ne .;
IF ‘GDP Growth 2010’n ne .;
IF ‘Urban Population 2010’n ne .;
IF ‘Urban Pop Growth’n ne .;
/*Statements remove empty or missing data from table*/

PROC SORT; by country;
PROC PRINT; VAR country ‘GDP per Capita 2010’n ‘GDP Growth 2010’n ‘Urban Population 2010’n ‘Urban Pop Growth’n;
/*Print function to show data for listed variables*/

PROC FREQ; TABLES ‘GDP per Capita 2010’n ‘GDP Growth 2010’n ‘Urban Population 2010’n ‘Urban Pop Growth’n ‘Urban Pop in Agglomerations over’n;
PROC UNIVARIATE;
ods select Histogram;
var ‘GDP per Capita 2010’n ‘GDP Growth 2010’n ‘Urban Population 2010’n ‘Urban Pop Growth’n ‘Urban Pop in Agglomerations over’n;
histogram ‘GDP per Capita 2010’n ‘GDP Growth 2010’n ‘Urban Population 2010’n ‘Urban Pop Growth’n ‘Urban Pop in Agglomerations over’n;
RUN;

This is my code in Python:

import pandas
import numpy

gapminder = pandas.read_csv(‘Data1.csv’, low_memory=False)

gapminder[‘GDP per Capita 2010’] = pandas.to_numeric(gapminder[‘GDP per Capita 2010’])
gapminder[‘GDP Growth 2010’] = pandas.to_numeric(gapminder[‘GDP Growth 2010’])
gapminder[‘Urban Population 2010’] = pandas.to_numeric(gapminder[‘Urban Population 2010’])
gapminder[‘Urban Pop Growth’] = pandas.to_numeric(gapminder[‘Urban Pop Growth’])
gapminder[‘Urban Pop in Agglomerations over 1 Million People 2007’] = pandas.to_numeric(gapminder[‘Urban Pop in Agglomerations over 1 Million People 2007’])

#dropna() removes rows with any NaN values
gapminder = gapminder.dropna(subset = [‘GDP per Capita 2010′,’GDP Growth 2010′,’Urban Population 2010′,’Urban Pop Growth’])
gapminder = gapminder[[‘Country’,’GDP per Capita 2010′, ‘GDP Growth 2010’, ‘Urban Population 2010’, ‘Urban Pop Growth’, ‘Urban Pop in Agglomerations over 1 Million People 2007’]]

print (gapminder)

print(“GDP per Capita”)
g1 = gapminder[“GDP per Capita 2010”].value_counts(sort=False)
print (g1)

print(“Percentages of GDP per Capita”)
g2 = gapminder[“GDP per Capita 2010”].value_counts(sort=False, normalize=True)
print (g2)

print(“GDP Growth”)
g3 = gapminder[“GDP Growth 2010”].value_counts(sort=False)
print (g3)

print(“Percentages of GDP Growth”)
g4 = gapminder[“GDP Growth 2010”].value_counts(sort=False, normalize=True)
print (g4)

print(“Urban Population”)
g5 = gapminder[“Urban Population 2010”].value_counts(sort=False)
print (g5)

print(“Percentages of Urban Population”)
g6 = gapminder[“Urban Population 2010”].value_counts(sort=False, normalize=True)
print (g6)

print(“Urban Populatin Growth”)
g7 = gapminder[“Urban Pop Growth”].value_counts(sort=False)
print (g7)

print(“Percentages of Urban Pop Growth”)
g8 = gapminder[“Urban Pop Growth”].value_counts(sort=False, normalize=True)
print (g8)

print(“Urban Population in Large Aggolmerations”)
g9 = gapminder[“Urban Pop in Agglomerations over 1 Million People 2007”].value_counts(sort=False)
print (g9)

print(“Percentages of Urban Pop in Large Agglomerations”)
g10 = gapminder[“Urban Pop in Agglomerations over 1 Million People 2007”].value_counts(sort=False, normalize=True)
print (g10)

Count Table:

count

Frequency Tables:

GDPCapita2010clean

GDPGrowthClean

UrbanPopClean

UrbanPopGrowthClean

disturbanpopgrowth

The other distribution graphs can be seen here.

Unlike the first frequency tables, these do not show missing values. Furthermore, because some countries might have data for GDP but not other variables or only a combination of variables instead of all four, the original missing values counts were not very informative (there were values of 93, 96, and 65). By coding out countries without data for all four variables, I ended up with 126 observations out of the original 275. This measure is more meaningful – 149 countries do not have data for all four variables. Finally because all four variables have unique values for each country, each value has a frequency of 1.

Though I am still intent on working in the Urban Population in Agglomerations Over 1 Million People, that variable has the greatest missing values count at 167. If I coded out countries that were also missing this data, I would have ended up with a list of only 64 countries, which would be much smaller with more uncertainty if I am to perform analysis. I am still unsure how to proceed with this part of the data set.

Lastly, it is interesting to see the Urban Population Growth in 2010 taking a normal distribution. This is very different from the distribution tables of the other variables (except GDP Growth, which seemed to also take on a normal distribution). For example, GDP had a large cluster of countries on the low end of the spectrum while Urban Population was fairly evenly distributed across the board. I am quite eager to try to unpack these relationships, if there are any, between urbanization and economic growth.

I must say, this has been a really fun project – learning SAS and Python while exploring a topic I have been interested in for the last three to four years.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s