Coursera, Data Analysis, Research, Society, Urban Planning, World Affairs
Leave a Comment

Lasso Regression – Machine Learning

Machine Learning Data Analysis

This is the third lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to run a Lasso Regression that identifies the impact of each of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. As it is a linear regression model, I am able to use a quantitative variable. Unlike the previous lesson, I can use GDP per Capita 2007 as is, without having to convert it into a categorical variable.

This time, the training data set is 70% and the test data set is 30% of the original data, which means there are 100 observations in my training data set vs. 43 in my test data set.

pred_train.shape = (100, 6)
pred_test.shape   = (43, 6)
tar_train.shape    = (100,)
tar_test.shape      = (43,)

Running the Lasso Regression gave the following coefficients:

Urban Population                    = 3057.54737912
Urban Population Growth    = 0
GDP Growth                              = -289.06265017
Population Growth                 = 0
Employment Rate                   = 0
Energy Use per Capita           = 4961.13569776

RegressionCoefLasso

Again, like the results of my Random Forest analysis, Energy Use has the most impact on predicting GDP per Capita, followed by Urban Population. In this instances, however, it would appear that with a coefficient of 0, Urban Population Growth, Population Growth, and Employment Rate do not have an effect on predicting GDP per Capita. They are excluded from the model completely.

Looking at the Mean Squared Error and RSquared Values, the training data and the test data are fairly similar. The results suggest that the model explains about 53% and 59% of the variance in the training and test data sets. However, the mean squared error values are very high, which could suggest a very poor model to predict GDP per Capita using Urban Population and Energy Use per Capita.

Training Data MSE =
57945645.8152
Test Data MSE =
67792139.0988
Training Data RSquared =
0.52963714125
Test Data RSquared =
0.586598813704

MSELasso.png

This is my code in Python:

import pandas
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, np.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

Data = RegData.copy()
from sklearn import preprocessing

Data[‘UrbanPop2007’] = preprocessing.scale(Data[‘UrbanPop2007’].astype(‘float64’))
Data[‘UrbanPopGrowth2007’] = preprocessing.scale(Data[‘UrbanPopGrowth2007’].astype(‘float64’))
Data[‘GDPGrowth2007’] = preprocessing.scale(Data[‘GDPGrowth2007’].astype(‘float64’))
Data[‘PopGrow2007’] = preprocessing.scale(Data[‘PopGrow2007’].astype(‘float64’))
Data[‘Employment2007’] = preprocessing.scale(Data[‘Employment2007’].astype(‘float64’))
Data[‘Energy2007’] = preprocessing.scale(Data[‘Energy2007’].astype(‘float64’))

Data.dtypes
print (Data.describe())

predictors = Data[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007′]]

targets = Data.GDP2007

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

dict(zip(predictors.columns, model.coef_))

print(predictors.columns, model.coef_)

m_log_alphas=-np.log10(model.alphas_)
ax=plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.ylabel(‘Regression Coefficients’)
plt.xlabel(‘-log(alpha)’)
plt.title(‘Regression Coefficients Pregression for Lasso Paths’)

m_log_alphascv=-np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ‘:’)
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=1), ‘k’, label=’Average across the folds’, linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.legend()
plt.xlabel(‘-log(alpha)’)
plt.ylabel(‘Mean Squared Error’)
plt.title(‘Mean Squared Error on Each Fold’)

from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print(‘Training Data MSE’)
print(train_error)
print(‘Test Data MSE’)
print(test_error)

rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print(‘Training Data RSquared’)
print(rsquared_train)
print(‘Test Data RSquared’)
print(rsquared_test)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s