Coursera, Data Analysis, Projects, Research, Society, World Affairs
Leave a Comment

Data Analysis and Interpretation Capstone

So, this is the end. It took six months, but today I completed and was certified for the Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. When I first started in October 2015, I had no idea how to write code in Python, let alone produce graphs and run statistical analysis. It has been a fun experience learning how to write code in Python and learning the different kinds of statistical methods. Ironically, I learned these after I left graduate school. One would think that these are method courses you would take in school.

For the Capstone Project, I do wish the data was more complete and over a longer period of time. It is difficult to run analysis on data that only goes back as far as 1972 and in many cases, missing records for many years in between. The results can be quite misleading, as it pointed to fertility rate as being highly correlated with environmental sustainability. However, fertility rate, in many cases is contingent on many different factors that are both quantitative and qualitative. It is difficult to untangle the relationships.

Furthermore, I long held the belief that each country is very different. I believe this project actually points that out. Every country had a different subset of correlated variables, though there were similarities between countries of similar GDP per capita.

Anyhow, I look forward to my next adventure. What follows is my Capstone Project Report and my code in Python for one of the countries in question (Ethiopia).


Predicting Variables Associated with Environmental Sustainability


Using data provided by the World Bank, through DrivenData, this study looks to identify factors associated with the Ensure Environmental Sustainability goal defined as by the United Nations as one of the United Nations Millennium Development Goals (MDGs). The four indicators that comprise this goal are forestation, protected ecosystems, access to improved sources of water and access to improved sanitation facilities. Some hypothetical explanatory variables are Gross National Income, Forest Area, CO2 Emissions, Employment, Foreign Direct Investments, Household Final Consumption Expenditure, Adult Literacy Rate, Urban Population, Investments in Energy, and Energy Use. A mix of both economic and social factors will be examined for associations with the UN-MDG indicator of environmental sustainability. After the associated variables are identified, they will be used to create a model to predict data for the years 2008 and 2012.

As a social/urban scientist interested in analyzing and planning for better urban environments, I am always looking for data and analysis that can influence the development of urban environments that limit environmental impacts and maximize livability. I hope that through the understanding of the relationships between various social and economic variables and their effects on the environment, policy makers can create better policies and make informed decisions to positively benefit development and to improve the environmental conditions in countries around the world.

With better predictive models and better understanding of the relationships between the society, the economy, and the environment, organizations such as the World Bank and the United Nations can then create more specific economic or social solutions, for example investments in energy, to alleviate poverty and improve environmental conditions around the world.



Out of the 211 World Bank recognized sovereignties, 8 (N=8) were chosen for this study. Countries that has the Ensure Environmental Sustainability goal were selection: three countries with the lowest GDP per capita (Burundi, Ethiopia, Liberia), three countries with the highest GDP per capita (Canada, Ireland, United States), and two from the median (Estonia, Seychelles). In addition to identifying associations between variables and the Ensure Environmental Sustainability indicators, this selection was used to also investigate how variable relationships differ in countries with varying degrees of economic development.

In this project, though the World Bank has compiled more than 450 possible indicators, only between 26 and 43 indicators were chosen for each country, with data from 1972 to 2007. If a variable has missing data, the most recently recorded data is used. For example, for Burundi, the Achieve Universal Primary Education indicator has missing data for the years 1994 to 1999. The most recently recorded data from 2000 was used to fill in the missing records. In the case that an indicator is missing more than half its data, the indicator will not be used for analysis. This management of missing data can result in over-simplification of the trends and fluctuations of the indicator over the years, but this method is simple and effective without having to create a model to extrapolate for missing data.


The response variable in question is the Ensure Environmental Sustainability indicator, which is an overall measure consisting of Forest Area (% of total land area), Terrestrial and Marine Protect Areas (% of total territorial area), Terrestrial Protected Areas (% of total land area), Improved Sanitation Facilities (% of population with access), and Improved Water Source (% of rural and % of urban population with access). These are the indicators as defined by the Ensure Environmental Sustainability Goal.

The main predictors included Agricultural Land (% of land area), Fertility Rate (births per woman), Foreign Direct Investment (% of GDP), Household Final Consumption Expenditure per capita (constant 2005 US$), Population Growth (annual %), GDP per capita (constant 2005 US$), GDP per capita Growth (annual %), Industry Value Added (% of GDP), Urban Population (% of total population), and Adjust Savings: Net Forest Depletion (% of GNI). Due to the differences in data availability for each country, additional predictors may be included. All of these variables are quantitative.

The following is the complete list of possible predicators:

Adjusted net national income per capita (constant 2005 US$)
Adjusted savings: carbon dioxide damage (% of GNI)
Adjusted savings: consumption of fixed capital (% of GNI)
Adjusted savings: energy depletion (% of GNI)
Adjusted savings: natural resources depletion (% of GNI)
Adjusted savings: net forest depletion (% of GNI)
Adjusted savings: particulate emission damage (% of GNI)
Agricultural land (% of land area)
Alternative and nuclear energy (% of total energy use)
Birth rate, crude (per 1,000 people)
CO2 emissions (metric tons per capita)
Cereal production (metric tons)
Cereal yield (kg per hectare)
Electric power consumption (kWh per capita)
Electricity production (kWh)
Electricity production from renewable sources (kWh)
Energy use (kg of oil equivalent per capita)
Fertility rate, total (births per woman)
Foreign direct investment, net inflows (% of GDP)
Fossil fuel energy consumption (% of total)
GDP per capita (constant 2005 US$)
GDP per capita growth (annual %)
Household final consumption expenditure per capita (constant 2005 US$)
Industry, value added (% of GDP)
Industry, value added (annual % growth)
Organic water pollutant (BOD) emissions (kg per day)
Marine protected areas (% of territorial waters)
Population density (people per sq. km of land area)
Population growth (annual %)
Research and development expenditure (% of GDP)
Researchers in R&D (per million people)
Rural population (% of total population)
Rural population growth (annual %)
Terrestrial and marine protected areas (% of total territorial area)
Terrestrial protected areas (% of total land area)
Urban population (% of total)
Urban population growth (annual %)


For each country, the Ensure Environmental Index was plotted to examine the trends over the years between 1972 and 2007. The distributions of the Ensure Environmental Sustainability Index were evaluated through descriptive statistics.

With such a large number of predictor variables, lasso regression with least angle regression algorithm was used to identify the subset of variables most correlated with each response variable. This analysis allows for the exclusion of variables that have regression variables reduced to zero at each step of the selection process and allows the identification of the predictors most strongly associated with the response variable. Each of the predictors were standardized to have a mean of zero (AVG = 0) and a standard deviation of one (SD = 1) prior to running the analysis. The lasso regression model was tested on a training set of a random sample of 70% of the total data and a test data set of the remaining 30%. The k-fold cross validation, specifying 10 folds, was performed. The regression coefficients identified the predictors used in the final model and how strongly each predictor associated with the response variables.

For each identified predictor, their scatter plots were examined for trends over the years from 1972 to 2007. Plots of both the predictor and response variable were used to visualize their relationships and lines of fit. Bivariate correlation analysis, using the Pearson correlation test, was conducted on each predictor variable.


Only the results for Burundi, Ethiopia, and Liberia will be reported, as the other countries demonstrated no change or very slight change in the ensure environmental sustainability index.

Descriptive Statistics:

The following table shows the descriptive statistics for the Ensure Environmental Sustainability Index for each of the selected countries, starting from the lowest GDP per capita group to the highest.

The standard deviations are much greater for the lowest GDP per capita group compared to the others. In three countries, Seychelles, Canada, and Ireland, no change in the value of the index was observed. It would appear that countries that reach a certain GDP per capita will have achieved a mean Ensure Environmental Sustainability Index value above 0.9 and demonstrate little change.


The following graphs are the Ensure Sustainability Index for Burundi, Ethiopia, and Liberia:







Bivariate and Lasso Regression Analysis:

Lasso Regression was performed on each of the country’s ensure environmental sustainability index and their predictors. As Seychelles, Canada, and Ireland had index values that did not change, there were no observed correlations.

Each country demonstrated a different set of predictors that correlated with the ensure environmental sustainability index. However, in the low GDP per capita group, all three countries showed very strong correlations between fertility rate and the ensure environmental sustainability index (as demonstrated by the following graphs). The fertility rate predictor all had correlation coefficients on twice the order of magnitude compared to the other predictors.


The follow table shows the correlation coefficients for the fertility rate predictor along with the mean squared errors for both the training and test data sets.


In all three countries, as the fertility rate lowered, the ensure environmental sustainability index value rose. This main predictor accounts for above 90% of the variance observed in the ensure environmental sustainability index. However, the mean squared errors differed between the test and training data sets. This suggests that the predicative accuracy of the model lowered when applied to the test data set.


Overview and Implications:

Lasso regression analysis was used to identify predictors for each country’s Ensure Environmental Sustainability Index. By choosing eight countries from different GDP per capita levels, sub-group differences became apparent. For the countries around or above the global median GDP per capita, the Ensure Environmental Index values actually showed little to no change over the years between 1972 and 2007. This meant that there were no demonstrated correlations between the various possible predictors and the index for these countries. Seychelles, Canada, and Ireland all had a standard deviation of zero in their index values. Meanwhile, United States had slight change, with a standard deviation of 0.00221 and the predictor identified was particulate emissions damage, calculated as a percentage of the gross national income.

Only Burundi, Ethiopia, and Liberia (the Low GDP per Capita group) had signification results from the lasso regression. For these countries, fertility rate was the strongest predictor, accounting for over 90% of the variance in the Ensure Environmental Sustainability Index. The trend demonstrated that as fertility rate declined, the Ensure Environmental Sustainability Index rose. These suggests that there are significant differences between countries of varying degrees of economic development and in low GDP per capita countries, fertility rate demonstrates strong correlations with the Ensure Environmental Sustainability Index.

Despite these results for countries with low GDP per capita, there are serious limitations to both the data and the model. With so many possible predicators (each country has more than 450 World Bank created indicators, of which only around 30 each are selected for this project), it would be inappropriate to implement policy that focuses on fertility rate in hopes of creating more environmental sustainability.


There are several serious limitations that must be accounted for in the interpretation of the results of this project. First, data accuracy is limited. Many indicators have missing data between the years of 1972 and 2007. The collection of the data depends on the government agency and the quality and accuracy of the data may not be comparable between different countries. Even the Ensure Environmental Sustainability Index has limitations, with most countries having values for only 18 out of the 35 possible years. Second, a 35 year time frame is quite short for data analysis. This meant that each country only has a maximum of 35 data points for each indicator and in many cases much less due to lack of data. Variances and outliers have much greater effect on the data analysis. Third, fertility rate is associated with a number of other variables and may not be causative for environmental sustainability. It is known that wealthier countries, with associated higher education and other social conditions, tend to have lower fertility rates. In this case, it would be an oversimplification to focus on fertility rate as a predictor, despite the results. Quantitative data analysis without a qualitative evaluation of a country’s condition is too limited. Lastly, the number of countries included in this project is very small (N=8). There are 211 recognized sovereignties in this data set. Each GDP per capita group can be expanded with more samples.

Future Directions:

To gain a more complete picture for the various relationships between the World Bank indicators, more indicators and countries should be included in the future to develop a more solid algorithm. Based on the results of this current project, there are likely significant differences between countries of different wealth groups, as measured by GDP per capita. The subset of predicators that most strongly correlate with the Ensure Environmental Sustainability Index is likely to be very different depending on the GDP per capita levels. Furthermore, the results demonstrate that despite a common, strongly correlated predictor in the Low GDP per capita group, each country is unique. With that in mind, future efforts to develop a better understanding of environmental sustainability will require more longitudinal data, more accurate and higher quality data, and qualitative data to generate a more complete picture of each country.


Python Code (Ethiopia):

import pandas
import numpy as np
import scipy
import os

#Graphing and Regression
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import statsmodels.api as sm
#Lasso Regression
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

os.chdir(‘C:\\Users\William Hsu\Desktop\\Capstone’)
Ethiopia = pandas.read_csv(‘Ethiopia_Clean.csv’, low_memory=False)

print (Ethiopia[‘AA’].describe())

Ethiopia = Ethiopia.fillna(method=’ffill’)

Cap = Ethiopia-Ethiopia.mean()

#Lasso Regression
CapReg = Cap
del CapReg[‘Year’]
RegData = CapReg.copy()
from sklearn import preprocessing

predictors = RegData[[‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’, ‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘W’, ‘X’, ‘Y’, ‘Z’, ‘AB’, ‘AC’, ‘AD’, ‘AE’, ‘AF’, ‘AG’, ‘AH’,
‘AI’, ‘AL’, ‘AM’]]

print (predictors.describe())

targets = Cap[[‘AA’]]

#Split into Training and Test Data Sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

#Specify Regression Model
model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

#Print Regression Coefficients
dict(zip(predictors.columns, model.coef_))
print(predictors.columns, model.coef_)

#Plot Coefficient Progression
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.ylabel(‘Regression Coefficients’)
plt.title(‘Regression Coefficients Pregression for Lasso Paths’)

#Plot MSE for Each Fold
plt.plot(m_log_alphascv, model.cv_mse_path_, ‘:’)
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=1), ‘k’, label=’Average across the folds’, linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.ylabel(‘Mean Squared Error’)
plt.title(‘Mean Squared Error on Each Fold’)

#MSE from Test and Training Data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print(‘Training Data MSE’)
print(‘Test Data MSE’)

#R-Squared for Test and Training Data
rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print(‘Training Data RSquared’)
print(‘Test Data RSquared’)
#Multiple Regression Analysis on Correlated Predicators
MRegTest = smf.ols(formula=’AA ~ A + D + L + N + T + AB + AE’, data=Cap).fit()
print (MRegTest.summary())

MRegTest2 = smf.ols(formula=’AE ~ A + D + L + T + AE’, data=Cap).fit()
print (MRegTest2.summary())

MRegTest3 = smf.ols(formula=’AA ~ A + D + L + N + T + AB + AE’, data=Cap).fit()
print (MRegTest.summary())

#Residual Plots
Resid = plt.figure(figsize=(8,5))
Resid = sm.qqplot(MRegTest3.resid, line=’r’)

stres = plt.figure(figsize=(8,5))
stres = pandas.DataFrame(MRegTest3.resid_pearson)
fig2 = plt.plot(stres, ‘o’, ls=’None’)
l= plt.axhline(y=0, color=’r’)
plt.ylabel(‘Standardized Residual’)
plt.xlabel(‘Observation Number’)
print (fig2)

MDG = plt.figure(figsize=(8,5))
MDG =, size=8)
print (MDG)

#Basic Correlations, Pearson R Value
print(scipy.stats.pearsonr(Cap[‘A’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘D’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘L’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘N’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘T’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘AB’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘AE’], Cap[‘AA’]))
#Scatter Plots of Identified Predicators
scatA = plt.figure(figsize=(8,5))
scatA = seaborn.regplot(x=’Year’, y=’A’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Universal Primary Education (% of Population)’)
plt.title(‘University Primary Education Since 1972′)

scatD = plt.figure(figsize=(8,5))
scatD = seaborn.regplot(x=’Year’, y=’D’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Carbon Dioxide Damage (%GNI)’)
plt.title(‘Carbon Dioxide Since 1972′)

scatL = plt.figure(figsize=(8,5))
scatL = seaborn.regplot(x=’Year’, y=’L’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Birth Rate (per 1000 People)’)
plt.title(‘Birth Rate Since 1972′)

scatN = plt.figure(figsize=(8,5))
scatN = seaborn.regplot(x=’Year’, y=’N’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Cereal Production (Metric Tons)’)
plt.title(‘Cereal Production Since 1972′)

scatT = plt.figure(figsize=(8,5))
scatT = seaborn.regplot(x=’Year’, y=’T’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Fertility Rate (Births per Woman)’)
plt.title(‘Fertility Rate Since 1972 (Ethiopia)’)

scatAB = plt.figure(figsize=(8,5))
scatAB = seaborn.regplot(x=’Year’, y=’AB’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Industry Value Added (%GNI)’)
plt.title(‘Industry Value Added Since 1972′)

scatAE = plt.figure(figsize=(8,5))
scatAE = seaborn.regplot(x=’Year’, y=’AE’, fit_reg=True, data=Ethiopia)
plt.ylabel(‘Population Density (per Sq.KM.)’)
plt.title(‘Population Density Since 1972′)

scatAA = plt.figure(figsize=(8,5))
scatAA = seaborn.regplot(x=’Year’, y=’AA’, fit_reg=False, data=Ethiopia)
plt.ylabel(‘Ensure Environmental Sustainability’)
plt.title(‘Ensure Environmental Sustainability Index Since 1972′)

FerEn = plt.figure(figsize=(8,5))
FerEn = seaborn.regplot(x=’AA’, y=’T’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Ensure Environmental Sustainability’)
plt.ylabel(‘Fertility Rate (Births per Woman)’)
plt.title(‘Fertility Rate Since 1972 (Ethiopia)’)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s