Latest Posts

Analyzing Density Bonus Developments in the City of Los Angeles

On February 22, 2016, I started the GIS Specialization Course with UC Davis through Coursera. For those of you who have paid attention, I have started the final course of the specialization: Geospatial Analysis Project. As with other Coursera specializations, this is a Capstone project that is the culmination of the previous courses.

For this project, I have to propose, design, analyze, and present a geospatial analysis project from start to finish. This week requires the creation of my project proposal, which is as follows (if any of you have suggestions on data sources and/or analysis, please feel free to comment):

What is Density Bonus?
Density Bonus is a program through which a developer can apply for a project with a unit density greater than that allowed by the current land use zoning, as calculated from unit floor area and floor area ratio (FAR). In exchange for the higher density, the developer must set aside a certain number of units to be affordable: this is by restricting the rent levels or sale prices to targeted income levels based on the Area Median Income (AMI). To facilitate and to lower the costs of these projects, the developer is granted between one to three development reliefs, based on the percentage of affordable set-aside, such as parking requirement reduction and increase in building height.

Background Information
In January 2005, SB 1818 which amended the State of California’s Density Bonus program became effective. This change in policy mandated that local jurisdictions must bring their ordinances into conformance with the State requirements. Until local jurisdictions are able to update their laws, the State policy applies. The City of Los Angeles enacted its version of the Density Bonus program in February, 2008. In other words, from 2005 onwards, Density Bonus has been in effect in Los Angeles. Around 620 projects (based on data up till mid-September) has been recorded and entitled by the Department of City Planning. Furthermore, year-to-year there has been a steady rise in the number of applications which illustrates the growing popularity of this program.

As the U.S. economy continues to recover from the 2008 recession, Los Angeles has been facing a growing housing crisis and greater development pressure. Density Bonus has become a way to fast-track housing development in a city where the zoning is often very restrictive. With the City’s own Density Bonus program nearing a decade of existence, it becomes imperative to understand if these new projects are built where needed and if these projects are in fact contributing to greater housing equality.

This project proposes to 1) locate where Density Bonus projects are being entitled and 2) analyze the demographic and physical conditions behind those applications.

Research Question/Hypothesis and Expected Results
For this project the research question is where are density bonus projects being proposed and what are the underlying demographic and physical conditions of their locations?

Despite the affordability requirements, the expected result is that density bonus projects will likely be located in higher than average income areas with better public amenities, such as transit and parks, as they provide for greater return on investments.

Potential Data Sources
For this project, publicly available data from the City of Los Angeles Department of City Planning (LADCP), the Los Angeles County Metropolitan Transportation Authority (Metro), the LA County GIS Data Portal, and the U.S. Census Bureau.

Entitlement data is typically made available through public requests to the Department of City Planning, while GIS data on city boundaries, transit stops, and parks are freely available through the GIS portals of the LADCP, Metro, and LA County.

Demographic data, available at census tracts and census blocks geography, is freely available through the U.S. Census Bureau.

Overview of Data and Sources:

Entitlements – LADCP
Transit – Metro
Parks – LA County
Demographics – U.S. Census Bureau

Method
The planned method of analysis will require several steps:

  • Intersect demographics data at census block or census tract level to Los Angeles city boundaries. This will result in only those census blocks or tracts that fall within the City.
  • Create buffers around transit stops and parks based on pedestrian sheds, defined as a five-minute walking distance or a quarter-mile.
  • Use overlay analysis with buffers and entitlement locations to evaluate proximity of projects to amenities.
  • Spatial join the entitlements (points) with the demographic data (polygons) to evaluate income levels.

With years available for the entitlement data, this analysis can be performed for every year, starting from 2005, to evaluate changes in development project trends. One of the challenges of this project will be joining the points data and the polygons data to create products with meaningful indicators. In many ways, this project is the reverse of a site suitability study: this is an attempt to deduce the considerations behind density bonus projects.

The Bus (in America)

It’s always darker in here.
There are days without light.
Even those loud colors
are subdued on the upholstery.

Then there are the bangs,
shocks and impacts,
direct hits of the road
rattling up backs and spines.

Rain dampens the floor
with sun baked crumbs,
gums and who knows what
left behind, left forgotten.

The morning swell through the doors
of untold routines and responsibilities.
These weary eyes and ears
time for signs to disembark.

Yes, all is trapped,
on routes dictated by stops.
Outside the window,
single passengers throttle by.

It struggles to navigate
the sea of more nimble cars.
It struggles to maintain
a timely pace.

Waiting could mean five minutes
or twenty, with a near miss.
Sometimes, a short sprint is required.
It doesn’t wait.

A suit and tie is rare
among sweaters and hoodies,
just as an unwashed shirt
always lingers in the corners.

It’s a decisive non-decision,
collectively by those whose
only way to get somewhere,
is trapped together with some bodies.

– Fu Lien Hsu
Oct 27, 2016

Unexpectedly

Goes by hand
Hand that used to be sand.

It goes to a dozen numbers
Around.

Watching it
Slowly.

Turning away
Quickly.

Yet
It brings everything.

The one thing
that matters.

To be suddenly caught
Standstill.

Like the wind it knocks
Over.

Or, pull the metaphorical rug
Under.

A Short About Policy Making

Have you ever wondered about the city you live in: its history, its planning, its development?

The Guardian has an incredible 50-part series on the history of urbanization from around the world. The more you read about cities, the more they become a metaphor for life – patience, plans, foundations, and changes. Any sort of urban development can takes years and decades. The saying goes, “Rome wasn’t built in a day”. Furthermore, even the best laid plans can be easily swept aside by unforeseen circumstances or self-created consequences. Yet, without plans and goals, a city will cease to exist.

Therein lies the paradox of urban planning (and of life) – each action results in an infinite possibility of reactions. You want to capture the current circumstances and anticipate future change, but it is always an impossibility. You create that which you hope to contain, and yet what you hope to contain is based on projections, assumptions, and visions that can easily fall apart in an instant…

The building of cities always serves as an expression of the political will of those who rule or of the collective actions of the locals. As such, planning as a field is highly political and rife with social visions. Each decision that is made typically serves a purpose, whether it is symbolic, idealistic, or practical. Often, the decisions made are not for the benefit of the public or the future.

One poignantly felt example is the land use designations and zoning here in Los Angeles. Anti-development forces are strong here; the land use and zoning is comparatively one of the most restrictive in the nation. This is the result of forces that limit policy changes to create more housing and to make housing more affordable – the paradox of a liberal conservative population. As one of the bastions of liberal, progressive thought in the Western United States, it is surprising how people support policies until it affects development in their backyard. Furthermore, it is amazing how people disregard data and trends, how people disregard affordability. The impact of neighborhood councils is huge in limiting policy.

An analogy a colleague of mine used was that of the relationship between a doctor and a patient. The locals are the patient and planners are the doctor. We inquire about the issues and problems while trying to come up with solutions. Yet, in a normal patient-doctor relationship, the patient only has limited input in the doctor’s prescription. Meanwhile, in Los Angeles, the local community has over-weighed power in decision-making, sometimes overruling the expert advise of planners. This results in policies that are toothless, aimless, and limited. Furthermore, politicians wishing to stay within their elected positions must cater to their constituents, despite regular outcry that they bend to the wills of developers.

In a true democracy, public forum and knowledge is essential. Relating to my last post, the problem persists where we do not live in a true democracy. Most people do not have the education, critical thinking skills, or the time to properly make decisions. This results in inefficiencies in policy making and ineffectiveness in policies.

Perhaps this is an unsolvable problem, though I have other thoughts on this subject…for now let’s conclude with a question:

How can we move closer to a true democracy with a properly educated and critical thinking public?

To Remember To Forget

A face without name
As I walk on a path in the hills
Counting the rocks
Under the beating sun

I cannot remember
I remember only to forget
That face
What’s the name?

There are memories
Faint visions but the face
The face is always sharp
In focus

Each step I take
I chew on words
Ruminate on these images
Fading

A figure waves
In the distance
As I look up from the rocks
And stop counting my steps

I was so close
But I already forget
What was it that I tried
So hard to remember?

We walk towards the sunset
Down the hill
Into the forest
And wait for the stars.

– Fu Lien Hsu
June 28, 2016

Half-Year Review – All that Work 就說到做到

About four months ago, I wrote about goals to achieve by the time my birthday rolled around. Well, my birthday came and went on the 8th of June and here I am today, looking back and looking forward. Before we start, big shout out to my friends with their surprise cakes.

2016-06-19 21.50.11.jpg

First, let’s review my goals and how much I completed:

  1. Learn Spanish – I am at 50% fluency on Duolingo!
  2. Learn to write Chinese – I am halfway there. Still in progress
  3. Complete Data Science Specialization with Coursera – Completed!
  4. Complete Half of the GIS Specialization – Yup, got them certificates!
  5. Complete a site suitability study – Done, I actually completed two but only one is public on LoopNet! Check under attachments.
  6. Write and submit poetry to Poetry Foundation – Done, but haven’t heard back yet.
  7. Work-out regularly – Almost everyday and I run on Tuesdays and Thursdays. Also started bouldering and biking from Venice Beach to Santa Monica.
  8. Practice violin everyday – Doing that as well, though been on a break the last couple weeks.
  9. Start volunteering – I have been volunteering at the Wildlife Waystation. Back to them biology roots! I am also involved with Union Station Homeless Services’s Young Leaders Society!
  10. Talk to family more – Working on it, but definitely talking to them more. Even my baby sister. Look at what my baby sister said:

    2016-06-14 19.53.45

    My mom said, “She was very sad the other day, and suddenly asked me: I miss brother, when can I see him?” Then my mom said “So you are very important to her!”

    I must be doing something right.

  11. Play more chess – This is hard because it requires another person…I did play for a while with a couple buddies on FB but that got stale after a while.
  12. Find a new job – Yup! Started a new job on May 31st.

10/12. Not bad at all. Not bad at all. One goal for every month. I’m definitely at October, hahaha.

To be honest, I was very happy to finally speak more with my little sister. I missed out on so much of her growing up. We had a really good, hour-long chat on the phone the other day and she updated me on a lot and her own personal thought. She is growing up so fast. The best part was, she melted my heart by not hesitating to say “I miss you” before she hung up. This coming from a teenager. Imagine that…

Starting a new job was definitely the highlight of the past half year. My new job has been fulfilling and exciting so far. I am learning a whole lot and I am doing what I went to school for – urban planning. Cities are what I love and I am so glad to be able to work on my passion. My hope is that this is the first step towards a long career in the field of urban development.

There are several things on that list I am looking to continue for the rest of the year: working out, playing the violin, volunteering, talking to family, and learning Spanish and Chinese writing.

I have yet to come up with goals for the rest of the year, but I have two very exciting projects currently in the works. Please look forward to them. They are personal passion projects that I hope will bring joy and happiness to my family and friends.

Lastly, again and again I am reminded that it is important to treat people with kindness and be genuine. I need to try harder to put myself in the other person’s shoes, be kinder, and more patient.

Love, peace, Will out.

2016-06-08 19.29.35-1

Headwinds of Life

They come, resistance.
Pressure shifts, moving from high to low.

Sometimes, a breeze. Other times, a tempest.
How did the pressure build?

Have you experienced the winds in a storm?
It blows you back.

If you try to fly a kite, the string may snap.
Even trees bow or crack.

So do you hide?

You can turn around and make them tailwinds.
If you are ok with moving in a different direction? For now.

Or just wait. For how long?

Either way, you will arrive.
They say, “All roads lead to Rome.”

Words about Life

Are you nervous?
Yes.

Are you scared?
Yes.

So why?
I don’t know. Why what?

Why do you keep going?
There is hope. There is change.

Do things really change?
Always. In this second, you already are not who you were.

So that’s hope?
No. Hope is believing in and building for that change.

If you believe, then why are you nervous and scared?
There is the unknown. We all fear the unknown.

What is unknown?
You never know what the future holds.

Then what do you do?
You hope.

A Lot

I’ve been reading a lot            “People change”
I’ve been thinking a lot           “only time will tell”
I’ve been writing a lot             “who am I?”
I’ve been…

Laughing all about the same
But really silent on the name

I am drawing blanks to describe
The only things that come to mind
As I soak in the Californian sun
On the beaches where we used to run

I still got that sand in my car
From days I no longer remember
The past is the past
So do I really want to talk?

I’ve been reading a lot            “People change”
I’ve been thinking a lot           “only time will tell”
I’ve been writing a lot             “what I have done”
I’ve been…

Working night and day
Trying to be a better man

The writing is not on the wall
When I still got time to have it all
My patience grows from a seed
Slowly becoming a grand tree

I go forward with a plan
Working pieces like new bricks
As I build a town to call my own
And one day I’ll give you a tour.

Data Analysis and Interpretation Capstone

So, this is the end. It took six months, but today I completed and was certified for the Data Analysis and Interpretation Specialization by Wesleyan University through Coursera. When I first started in October 2015, I had no idea how to write code in Python, let alone produce graphs and run statistical analysis. It has been a fun experience learning how to write code in Python and learning the different kinds of statistical methods. Ironically, I learned these after I left graduate school. One would think that these are method courses you would take in school.

For the Capstone Project, I do wish the data was more complete and over a longer period of time. It is difficult to run analysis on data that only goes back as far as 1972 and in many cases, missing records for many years in between. The results can be quite misleading, as it pointed to fertility rate as being highly correlated with environmental sustainability. However, fertility rate, in many cases is contingent on many different factors that are both quantitative and qualitative. It is difficult to untangle the relationships.

Furthermore, I long held the belief that each country is very different. I believe this project actually points that out. Every country had a different subset of correlated variables, though there were similarities between countries of similar GDP per capita.

Anyhow, I look forward to my next adventure. What follows is my Capstone Project Report and my code in Python for one of the countries in question (Ethiopia).

 

Predicting Variables Associated with Environmental Sustainability

Introduction

Using data provided by the World Bank, through DrivenData, this study looks to identify factors associated with the Ensure Environmental Sustainability goal defined as by the United Nations as one of the United Nations Millennium Development Goals (MDGs). The four indicators that comprise this goal are forestation, protected ecosystems, access to improved sources of water and access to improved sanitation facilities. Some hypothetical explanatory variables are Gross National Income, Forest Area, CO2 Emissions, Employment, Foreign Direct Investments, Household Final Consumption Expenditure, Adult Literacy Rate, Urban Population, Investments in Energy, and Energy Use. A mix of both economic and social factors will be examined for associations with the UN-MDG indicator of environmental sustainability. After the associated variables are identified, they will be used to create a model to predict data for the years 2008 and 2012.

As a social/urban scientist interested in analyzing and planning for better urban environments, I am always looking for data and analysis that can influence the development of urban environments that limit environmental impacts and maximize livability. I hope that through the understanding of the relationships between various social and economic variables and their effects on the environment, policy makers can create better policies and make informed decisions to positively benefit development and to improve the environmental conditions in countries around the world.

With better predictive models and better understanding of the relationships between the society, the economy, and the environment, organizations such as the World Bank and the United Nations can then create more specific economic or social solutions, for example investments in energy, to alleviate poverty and improve environmental conditions around the world.

Methods

Sample:

Out of the 211 World Bank recognized sovereignties, 8 (N=8) were chosen for this study. Countries that has the Ensure Environmental Sustainability goal were selection: three countries with the lowest GDP per capita (Burundi, Ethiopia, Liberia), three countries with the highest GDP per capita (Canada, Ireland, United States), and two from the median (Estonia, Seychelles). In addition to identifying associations between variables and the Ensure Environmental Sustainability indicators, this selection was used to also investigate how variable relationships differ in countries with varying degrees of economic development.

In this project, though the World Bank has compiled more than 450 possible indicators, only between 26 and 43 indicators were chosen for each country, with data from 1972 to 2007. If a variable has missing data, the most recently recorded data is used. For example, for Burundi, the Achieve Universal Primary Education indicator has missing data for the years 1994 to 1999. The most recently recorded data from 2000 was used to fill in the missing records. In the case that an indicator is missing more than half its data, the indicator will not be used for analysis. This management of missing data can result in over-simplification of the trends and fluctuations of the indicator over the years, but this method is simple and effective without having to create a model to extrapolate for missing data.

Measures:

The response variable in question is the Ensure Environmental Sustainability indicator, which is an overall measure consisting of Forest Area (% of total land area), Terrestrial and Marine Protect Areas (% of total territorial area), Terrestrial Protected Areas (% of total land area), Improved Sanitation Facilities (% of population with access), and Improved Water Source (% of rural and % of urban population with access). These are the indicators as defined by the Ensure Environmental Sustainability Goal.

The main predictors included Agricultural Land (% of land area), Fertility Rate (births per woman), Foreign Direct Investment (% of GDP), Household Final Consumption Expenditure per capita (constant 2005 US$), Population Growth (annual %), GDP per capita (constant 2005 US$), GDP per capita Growth (annual %), Industry Value Added (% of GDP), Urban Population (% of total population), and Adjust Savings: Net Forest Depletion (% of GNI). Due to the differences in data availability for each country, additional predictors may be included. All of these variables are quantitative.

The following is the complete list of possible predicators:

Adjusted net national income per capita (constant 2005 US$)
Adjusted savings: carbon dioxide damage (% of GNI)
Adjusted savings: consumption of fixed capital (% of GNI)
Adjusted savings: energy depletion (% of GNI)
Adjusted savings: natural resources depletion (% of GNI)
Adjusted savings: net forest depletion (% of GNI)
Adjusted savings: particulate emission damage (% of GNI)
Agricultural land (% of land area)
Alternative and nuclear energy (% of total energy use)
Birth rate, crude (per 1,000 people)
CO2 emissions (metric tons per capita)
Cereal production (metric tons)
Cereal yield (kg per hectare)
Electric power consumption (kWh per capita)
Electricity production (kWh)
Electricity production from renewable sources (kWh)
Energy use (kg of oil equivalent per capita)
Fertility rate, total (births per woman)
Foreign direct investment, net inflows (% of GDP)
Fossil fuel energy consumption (% of total)
GDP per capita (constant 2005 US$)
GDP per capita growth (annual %)
Household final consumption expenditure per capita (constant 2005 US$)
Industry, value added (% of GDP)
Industry, value added (annual % growth)
Organic water pollutant (BOD) emissions (kg per day)
Marine protected areas (% of territorial waters)
Population density (people per sq. km of land area)
Population growth (annual %)
Research and development expenditure (% of GDP)
Researchers in R&D (per million people)
Rural population (% of total population)
Rural population growth (annual %)
Terrestrial and marine protected areas (% of total territorial area)
Terrestrial protected areas (% of total land area)
Urban population (% of total)
Urban population growth (annual %)

Analyses:

For each country, the Ensure Environmental Index was plotted to examine the trends over the years between 1972 and 2007. The distributions of the Ensure Environmental Sustainability Index were evaluated through descriptive statistics.

With such a large number of predictor variables, lasso regression with least angle regression algorithm was used to identify the subset of variables most correlated with each response variable. This analysis allows for the exclusion of variables that have regression variables reduced to zero at each step of the selection process and allows the identification of the predictors most strongly associated with the response variable. Each of the predictors were standardized to have a mean of zero (AVG = 0) and a standard deviation of one (SD = 1) prior to running the analysis. The lasso regression model was tested on a training set of a random sample of 70% of the total data and a test data set of the remaining 30%. The k-fold cross validation, specifying 10 folds, was performed. The regression coefficients identified the predictors used in the final model and how strongly each predictor associated with the response variables.

For each identified predictor, their scatter plots were examined for trends over the years from 1972 to 2007. Plots of both the predictor and response variable were used to visualize their relationships and lines of fit. Bivariate correlation analysis, using the Pearson correlation test, was conducted on each predictor variable.

Results

Only the results for Burundi, Ethiopia, and Liberia will be reported, as the other countries demonstrated no change or very slight change in the ensure environmental sustainability index.

Descriptive Statistics:

The following table shows the descriptive statistics for the Ensure Environmental Sustainability Index for each of the selected countries, starting from the lowest GDP per capita group to the highest.

The standard deviations are much greater for the lowest GDP per capita group compared to the others. In three countries, Seychelles, Canada, and Ireland, no change in the value of the index was observed. It would appear that countries that reach a certain GDP per capita will have achieved a mean Ensure Environmental Sustainability Index value above 0.9 and demonstrate little change.

Table1

The following graphs are the Ensure Sustainability Index for Burundi, Ethiopia, and Liberia:

Burundi:

BurundiIndex

Ethiopia:

EthiopiaIndex

Liberia:

LiberiaIndex

Bivariate and Lasso Regression Analysis:

Lasso Regression was performed on each of the country’s ensure environmental sustainability index and their predictors. As Seychelles, Canada, and Ireland had index values that did not change, there were no observed correlations.

Each country demonstrated a different set of predictors that correlated with the ensure environmental sustainability index. However, in the low GDP per capita group, all three countries showed very strong correlations between fertility rate and the ensure environmental sustainability index (as demonstrated by the following graphs). The fertility rate predictor all had correlation coefficients on twice the order of magnitude compared to the other predictors.

BFertilityEFertilityLFertility

The follow table shows the correlation coefficients for the fertility rate predictor along with the mean squared errors for both the training and test data sets.

Table2

In all three countries, as the fertility rate lowered, the ensure environmental sustainability index value rose. This main predictor accounts for above 90% of the variance observed in the ensure environmental sustainability index. However, the mean squared errors differed between the test and training data sets. This suggests that the predicative accuracy of the model lowered when applied to the test data set.

Conclusion

Overview and Implications:

Lasso regression analysis was used to identify predictors for each country’s Ensure Environmental Sustainability Index. By choosing eight countries from different GDP per capita levels, sub-group differences became apparent. For the countries around or above the global median GDP per capita, the Ensure Environmental Index values actually showed little to no change over the years between 1972 and 2007. This meant that there were no demonstrated correlations between the various possible predictors and the index for these countries. Seychelles, Canada, and Ireland all had a standard deviation of zero in their index values. Meanwhile, United States had slight change, with a standard deviation of 0.00221 and the predictor identified was particulate emissions damage, calculated as a percentage of the gross national income.

Only Burundi, Ethiopia, and Liberia (the Low GDP per Capita group) had signification results from the lasso regression. For these countries, fertility rate was the strongest predictor, accounting for over 90% of the variance in the Ensure Environmental Sustainability Index. The trend demonstrated that as fertility rate declined, the Ensure Environmental Sustainability Index rose. These suggests that there are significant differences between countries of varying degrees of economic development and in low GDP per capita countries, fertility rate demonstrates strong correlations with the Ensure Environmental Sustainability Index.

Despite these results for countries with low GDP per capita, there are serious limitations to both the data and the model. With so many possible predicators (each country has more than 450 World Bank created indicators, of which only around 30 each are selected for this project), it would be inappropriate to implement policy that focuses on fertility rate in hopes of creating more environmental sustainability.

Limitations:

There are several serious limitations that must be accounted for in the interpretation of the results of this project. First, data accuracy is limited. Many indicators have missing data between the years of 1972 and 2007. The collection of the data depends on the government agency and the quality and accuracy of the data may not be comparable between different countries. Even the Ensure Environmental Sustainability Index has limitations, with most countries having values for only 18 out of the 35 possible years. Second, a 35 year time frame is quite short for data analysis. This meant that each country only has a maximum of 35 data points for each indicator and in many cases much less due to lack of data. Variances and outliers have much greater effect on the data analysis. Third, fertility rate is associated with a number of other variables and may not be causative for environmental sustainability. It is known that wealthier countries, with associated higher education and other social conditions, tend to have lower fertility rates. In this case, it would be an oversimplification to focus on fertility rate as a predictor, despite the results. Quantitative data analysis without a qualitative evaluation of a country’s condition is too limited. Lastly, the number of countries included in this project is very small (N=8). There are 211 recognized sovereignties in this data set. Each GDP per capita group can be expanded with more samples.

Future Directions:

To gain a more complete picture for the various relationships between the World Bank indicators, more indicators and countries should be included in the future to develop a more solid algorithm. Based on the results of this current project, there are likely significant differences between countries of different wealth groups, as measured by GDP per capita. The subset of predicators that most strongly correlate with the Ensure Environmental Sustainability Index is likely to be very different depending on the GDP per capita levels. Furthermore, the results demonstrate that despite a common, strongly correlated predictor in the Low GDP per capita group, each country is unique. With that in mind, future efforts to develop a better understanding of environmental sustainability will require more longitudinal data, more accurate and higher quality data, and qualitative data to generate a more complete picture of each country.

 

Python Code (Ethiopia):

import pandas
import numpy as np
import scipy
import os

#Graphing and Regression
import seaborn
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
import statsmodels.api as sm
#Lasso Regression
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

os.chdir(‘C:\\Users\William Hsu\Desktop\www.iamlliw.com\Capstone’)
Ethiopia = pandas.read_csv(‘Ethiopia_Clean.csv’, low_memory=False)

print (Ethiopia[‘AA’].describe())

Ethiopia = Ethiopia.fillna(method=’ffill’)

Cap = Ethiopia-Ethiopia.mean()

#Lasso Regression
CapReg = Cap
del CapReg[‘Year’]
RegData = CapReg.copy()
from sklearn import preprocessing

predictors = RegData[[‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, ‘J’, ‘K’, ‘L’, ‘M’, ‘N’, ‘O’, ‘P’, ‘Q’, ‘R’, ‘S’, ‘T’, ‘U’, ‘W’, ‘X’, ‘Y’, ‘Z’, ‘AB’, ‘AC’, ‘AD’, ‘AE’, ‘AF’, ‘AG’, ‘AH’,
‘AI’, ‘AL’, ‘AM’]]

print (predictors.describe())

targets = Cap[[‘AA’]]

#Split into Training and Test Data Sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

#Specify Regression Model
model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

#Print Regression Coefficients
dict(zip(predictors.columns, model.coef_))
print(predictors.columns, model.coef_)

#Plot Coefficient Progression
m_log_alphas=-np.log10(model.alphas_)
ax=plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.ylabel(‘Regression Coefficients’)
plt.xlabel(‘-log(alpha)’)
plt.title(‘Regression Coefficients Pregression for Lasso Paths’)

#Plot MSE for Each Fold
m_log_alphascv=-np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ‘:’)
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=1), ‘k’, label=’Average across the folds’, linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.legend()
plt.xlabel(‘-log(alpha)’)
plt.ylabel(‘Mean Squared Error’)
plt.title(‘Mean Squared Error on Each Fold’)

#MSE from Test and Training Data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print(‘Training Data MSE’)
print(train_error)
print(‘Test Data MSE’)
print(test_error)

#R-Squared for Test and Training Data
rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print(‘Training Data RSquared’)
print(rsquared_train)
print(‘Test Data RSquared’)
print(rsquared_test)
#Multiple Regression Analysis on Correlated Predicators
MRegTest = smf.ols(formula=’AA ~ A + D + L + N + T + AB + AE’, data=Cap).fit()
print (MRegTest.summary())

MRegTest2 = smf.ols(formula=’AE ~ A + D + L + T + AE’, data=Cap).fit()
print (MRegTest2.summary())

MRegTest3 = smf.ols(formula=’AA ~ A + D + L + N + T + AB + AE’, data=Cap).fit()
print (MRegTest.summary())

#Residual Plots
Resid = plt.figure(figsize=(8,5))
Resid = sm.qqplot(MRegTest3.resid, line=’r’)

stres = plt.figure(figsize=(8,5))
stres = pandas.DataFrame(MRegTest3.resid_pearson)
fig2 = plt.plot(stres, ‘o’, ls=’None’)
l= plt.axhline(y=0, color=’r’)
plt.ylabel(‘Standardized Residual’)
plt.xlabel(‘Observation Number’)
print (fig2)

MDG = plt.figure(figsize=(8,5))
MDG = sm.graphics.influence_plot(MRegTest3, size=8)
print (MDG)

#Basic Correlations, Pearson R Value
print(scipy.stats.pearsonr(Cap[‘A’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘D’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘L’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘N’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘T’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘AB’], Cap[‘AA’]))
print(scipy.stats.pearsonr(Cap[‘AE’], Cap[‘AA’]))
#Scatter Plots of Identified Predicators
scatA = plt.figure(figsize=(8,5))
scatA = seaborn.regplot(x=’Year’, y=’A’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Universal Primary Education (% of Population)’)
plt.title(‘University Primary Education Since 1972′)

scatD = plt.figure(figsize=(8,5))
scatD = seaborn.regplot(x=’Year’, y=’D’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Carbon Dioxide Damage (%GNI)’)
plt.title(‘Carbon Dioxide Since 1972′)

scatL = plt.figure(figsize=(8,5))
scatL = seaborn.regplot(x=’Year’, y=’L’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Birth Rate (per 1000 People)’)
plt.title(‘Birth Rate Since 1972′)

scatN = plt.figure(figsize=(8,5))
scatN = seaborn.regplot(x=’Year’, y=’N’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Cereal Production (Metric Tons)’)
plt.title(‘Cereal Production Since 1972′)

scatT = plt.figure(figsize=(8,5))
scatT = seaborn.regplot(x=’Year’, y=’T’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Fertility Rate (Births per Woman)’)
plt.title(‘Fertility Rate Since 1972 (Ethiopia)’)

scatAB = plt.figure(figsize=(8,5))
scatAB = seaborn.regplot(x=’Year’, y=’AB’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Industry Value Added (%GNI)’)
plt.title(‘Industry Value Added Since 1972′)

scatAE = plt.figure(figsize=(8,5))
scatAE = seaborn.regplot(x=’Year’, y=’AE’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Population Density (per Sq.KM.)’)
plt.title(‘Population Density Since 1972′)

scatAA = plt.figure(figsize=(8,5))
scatAA = seaborn.regplot(x=’Year’, y=’AA’, fit_reg=False, data=Ethiopia)
plt.xlabel(‘Year’)
plt.ylabel(‘Ensure Environmental Sustainability’)
plt.title(‘Ensure Environmental Sustainability Index Since 1972′)

FerEn = plt.figure(figsize=(8,5))
FerEn = seaborn.regplot(x=’AA’, y=’T’, fit_reg=True, data=Ethiopia)
plt.xlabel(‘Ensure Environmental Sustainability’)
plt.ylabel(‘Fertility Rate (Births per Woman)’)
plt.title(‘Fertility Rate Since 1972 (Ethiopia)’)