# All posts tagged: Urbanization

## k-Means Cluster Analysis – Machine Learning

Machine Learning Data Analysis This is the last lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? For this assignment, the goal is to run a k-Means Cluster Analysis using my variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. Here, GDP per Capita in 2007 is used as the validation variable. I am trying to identify if there are clusters of characteristics that associate with certain values of GDP per Capita based on national data from 2007. As before, the data is split into 70% training data and 30% test data. However, the k-means cluster analysis will only be run on the training data set. The Elbow Curve Graph shows that 2, 3, and 4 clusters could be interpreted, though it is …

## Lasso Regression – Machine Learning

Machine Learning Data Analysis This is the third lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? For this assignment, the goal is to run a Lasso Regression that identifies the impact of each of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. As it is a linear regression model, I am able to use a quantitative variable. Unlike the previous lesson, I can use GDP per Capita 2007 as is, without having to convert it into a categorical variable. This time, the training data set is 70% and the test data set is 30% of the original data, which means there are 100 observations in my training data set vs. 43 in my test data set. pred_train.shape = (100, 6) …

## Random Forests – Machine Learning

Machine Learning Data Analysis This is the second lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? For this assignment, the goal is to create a random forest that identifies the varying importance of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high. Just as in the last assignment, when my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with …

## Decision Trees – Machine Learning

Machine Learning Data Analysis This is the start of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? Now, as I have started working, I do not have as much time. For this course, I decided to focus solely on Python, instead of both Python and SAS as in the past. I am not abandoning SAS but I will probably take the time to learn SAS after this course ends. For this assignment, the goal is to create a decision tree that correct classifies samples according to a binary, categorical response variable. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is …

## Logistics Regression on Economic Development

Last lesson of Regression Modelling in Practice… If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? Through the past two courses, Data Analysis Tools and Data Management and Visualization, I looked at the correlation between urbanization and economic development and established that there was a correlation between urban population and GDP per capita. For this last assignment in the course Regression Modelling in Practice, I am again examining GDP per Capita as the response variable. I am using the new data set I created in the last assignment from Gapminer, which as  I explained, holds a more complete set of data if I used the year 2007 instead of 2010. As a logistic regression is performed on a categorical response variable with two levels and multiple explanatory variables, I had to bin GDP per Capita into two and recode them: 0 = Countries with a GDP per Capita less than …

## Employment and Urbanization

Continuing with Regression Modelling in Practice… If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? Through the past two courses, Data Analysis Tools and Data Management and Visualization, I looked at the correlation between urbanization and economic development and established that there was a correlation between urban population and GDP per capita. For this assignment, I decided to look at another measure of economic development – employment rate. However, because data for 2010 is unavailable for some of the new variables I wanted to include, I decided to use data from the year 2007. It is the most recent year where I get the most data for all my variables. For each of the variables, I downloaded data directly from Gapminder and extracted the relevant information for 2007 and compiled a new CSV file. I define my response variable as Employment Rate in 2007. Now that my data …

## Basic Regression on Urban Population Growth and GDP per Capita

Continuing with Regression Modelling in Practice… If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? Through the past two courses, Data Analysis Tools and Data Management and Visualization, I established that there was a correlation between urban population and GDP per capita. For this assignment, my primary explanatory variable is Urban Population Growth rate and response variable is GDP per capita, both figures are from 2010. This is my code in Python: import pandas import numpy import seaborn import matplotlib.pyplot as plt import statsmodels.formula.api as smf import statsmodels.stats.multicomp as multi gapminder = pandas.read_csv(‘Data1.csv’, low_memory=False) gapminder[‘GDP2010’] = gapminder[‘GDP2010’].replace(0,numpy.nan) gapminder[‘GDPGrowth2010’] = gapminder[‘GDPGrowth2010’].replace(0,numpy.nan) gapminder[‘UrbanPop2010’] = gapminder[‘UrbanPop2010’].replace(0,numpy.nan) gapminder[‘UrbanPopGrowth2010’] = gapminder[‘UrbanPopGrowth2010’].replace(0,numpy.nan) gapminder = gapminder[[‘Country’, ‘UrbanPop2010’, ‘UrbanPopGrowth2010’, ‘GDP2010’, ‘GDPGrowth2010’]] gapminder = gapminder.dropna() PopDes = gapminder[‘UrbanPopGrowth2010’].describe() print (PopDes) RegData = gapminder[[‘Country’, ‘UrbanPopGrowth2010’, ‘GDP2010’]] RegData[‘UrbanPopGrowth2010’] = RegData[‘UrbanPopGrowth2010’] – RegData[‘UrbanPopGrowth2010′].mean() print (RegData.describe()) UrbanReg = smf.ols(formula=’GDP2010 ~ UrbanPopGrowth2010′, data=RegData).fit() print (UrbanReg.summary()) seaborn.regplot(x=’UrbanPopGrowth2010′, y=’GDP2010’, fit_reg=True, data=RegData) plt.xlabel(‘Urban Population Growth …

## In Speaking of Data: Gapminder

This is the start of the third course, Regression Modeling in Practice, in the Data Analysis and Interpretations Specialization by Wesleyan University through Coursera. The first assignment is to provide a description of the data I have been working with – what is the sample, how the data is collected and how I managed the data. If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth? My sample consists of countries, territories, and other political entities such as disputed territories, dependent territories, or semi-autonomous city-states like Hong Kong. According to Gapminder, where my data was downloaded, this list consists of 193 UN Nations, 51 other entities, 4 French overseas territories, 10 former states, and 2 ad-hoc areas totaling 260 (or N=260). However, because not every entity has data in the indicators I am using, the number of entities in my work is reduced to 164 (or N=164). In the case …

## The Moderating Variable

Last Lesson in Data Analysis Tools… If you have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background on this project, please visit this page. This is the last lesson in the Data Analysis Tools course. After analyzing for correlations between variables, this assignment focuses on moderating variables. A moderating variable is one that influences the strength and direction of the association between the explanatory and response variables. Last time, I established that there were correlations between the amount of urbanization, as measured by percentage of total population in cities with over 1 million people, urban population growth, and GDP per capita. Additionally, I found that there was a correlation between total populations in cities and urban population growth. I suspect that one of these two variables might be a moderating variable. I first looked at total …

## Correlations! Urbanization and Economic Development in Rich and Poor Countries

Continuing with Data Analysis Tools… If you have not read my previous posts, I am currently enrolled in a Data Analysis Specialization with Wesleyan University through Coursera. With data from Gapminder, I am exploring a broad and basic question: does urbanization drive economic growth? For those of you interested in reading my literature review to gain a background on this project, please visit this page. Finally! Quantitative to quantitative variable analysis! This is the lesson I have been waiting for. With my interest in urbanization and economic development, the data I pulled from Gapminder are all quantitative. As I previously mentioned, I do not like categorizing quantitative data because I believe it introduces too much subjectivity. Unless the data is qualitative to begin with, it makes little sense to categorize data. Compared to the other types of correlation tests, Pearson’s Correlation was relatively easy to perform in both Python and SAS. I looked at the relationships between urbanization rate, as measured by both urban population growth rate and percentage of population in large cities with over 1 …