Coursera, Data Analysis, Research, Society, World Affairs
Leave a Comment

Capstone Project: Methods

For those following my blog on my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera, this is the final course and the Capstone project. Unlike previous courses, I will move away from urbanization data and try to tackle one of the problems provided by the course’s industry partner.

This is my introduction.

Below is our second assignment – the data management and analysis methods.



Out of the 211 World Bank recognized sovereignties, 8 (N=8) were chosen for this study. Countries that has the Ensure Environmental Sustainability goal were selected: three countries with the lowest GDP per capita (Burundi, Ethiopia, Liberia), three countries with the highest GDP per capita (Canada, Ireland, United States), and two from the median (Estonia, Seychelles). In addition to identifying associations between variables and the four sustainability indicators, this selection was used to also investigate how variable relationships differ in countries with varying degrees of economic development.

Each country, depending on available data, has between 26 to 43 indicators for analysis with 36 years of data from 1972 to 2007. If a variable has missing data, the most recently recorded data is used. For example, for Burundi, the Achieve Universal Primary Education indicator has missing data for the years 1994 to 1999. The most recently recorded data from 2000 was used to fill in the missing records. In the case that an indicator is missing more than half its data, the indicator will not be used for analysis. This management of missing data can result in over-simplification of the trends and fluctuations of the indicator over the years, but this method is simple and effective without having to create a model to extrapolate for missing data.


The response variables in question are the Ensure Environmental Sustainability indicator (as an overall measure), Forest Area (% of total land area), Terrestrial and Marine Protect Areas (% of total territorial area), Terrestrial Protected Areas (% of total land area), Improved Sanitation Facilities (% of population with access), and Improved Water Source (% of rural and % of urban population with access). These are the indicators as defined by the Ensure Environmental Sustainability Goal.

The main predicators included Agricultural Land (% of land area), Fertility Rate (births per woman), Foreign Direct Investment (% of GDP), Household Final Consumption Expenditure per capita (constant 2005 US$), Population Growth (annual %), GDP per capita (constant 2005 US$), GDP per capita Growth (annual %), Industry Value Added (% of GDP), Urban Population (% of total population), and Adjust Savings: Net Forest Depletion (% of GNI). Due to the differences in data availability for each country, additional predicators may be included. All of these variables are quantitative.


For each predicator, their scatter plots were examined for trends over the years from 1972 to 2007. The Pearson correlation test was used for bivariate associations between the predictors and the response variables.

With such a large number of predictor variables, lasso regression with least angle regression algorithm was used to identify the subset of variables most correlated with each response variable. This analysis allows for the exclusion of variables that have regression variables reduced to zero at each step of the selection process and allows the identification of the predictors most strongly associated with the response variable. Each of the predictors were standardized to have a mean of zero (AVG = 0) and a standard deviation of one (SD = 1) prior to running the analysis. The lasso regression model was tested on a training set of a random sample of 70% of the total data and a test data set of the remaining 30%. The k-fold cross validation, specifying 10 folds, was performed. The regression coefficients identified the predictors used in the final model and how strongly each predictor associated with the response variables.

Multiple regression analysis was used to for an independent analysis of the predictor variables that were selected by the lasso regression analysis.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s