*Machine Learning Data Analysis*

*This is the third lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.*

*If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?*

For this assignment, the goal is to run a Lasso Regression that identifies the impact of each of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. As it is a linear regression model, I am able to use a quantitative variable. Unlike the previous lesson, I can use GDP per Capita 2007 as is, without having to convert it into a categorical variable.

This time, the training data set is 70% and the test data set is 30% of the original data, which means there are 100 observations in my training data set vs. 43 in my test data set.

pred_train.shape = (100, 6)

pred_test.shape = (43, 6)

tar_train.shape = (100,)

tar_test.shape = (43,)

Running the Lasso Regression gave the following coefficients:

Urban Population = 3057.54737912

Urban Population Growth = 0

GDP Growth = -289.06265017

Population Growth = 0

Employment Rate = 0

Energy Use per Capita = 4961.13569776

Again, like the results of my Random Forest analysis, Energy Use has the most impact on predicting GDP per Capita, followed by Urban Population. In this instances, however, it would appear that with a coefficient of 0, Urban Population Growth, Population Growth, and Employment Rate do not have an effect on predicting GDP per Capita. They are excluded from the model completely.

Looking at the Mean Squared Error and RSquared Values, the training data and the test data are fairly similar. The results suggest that the model explains about 53% and 59% of the variance in the training and test data sets. However, the mean squared error values are very high, which could suggest a very poor model to predict GDP per Capita using Urban Population and Energy Use per Capita.

Training Data MSE =

57945645.8152

Test Data MSE =

67792139.0988

Training Data RSquared =

0.52963714125

Test Data RSquared =

0.586598813704

**This is my code in Python:**

import pandas

import numpy as np

import matplotlib.pyplot as plt

import os

from sklearn.cross_validation import train_test_split

from sklearn.linear_model import LassoLarsCV

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, np.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()

RegData = RegData.dropna()

Data = RegData.copy()

from sklearn import preprocessing

Data[‘UrbanPop2007’] = preprocessing.scale(Data[‘UrbanPop2007’].astype(‘float64’))

Data[‘UrbanPopGrowth2007’] = preprocessing.scale(Data[‘UrbanPopGrowth2007’].astype(‘float64’))

Data[‘GDPGrowth2007’] = preprocessing.scale(Data[‘GDPGrowth2007’].astype(‘float64’))

Data[‘PopGrow2007’] = preprocessing.scale(Data[‘PopGrow2007’].astype(‘float64’))

Data[‘Employment2007’] = preprocessing.scale(Data[‘Employment2007’].astype(‘float64’))

Data[‘Energy2007’] = preprocessing.scale(Data[‘Energy2007’].astype(‘float64’))

Data.dtypes

print (Data.describe())

predictors = Data[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007′]]

targets = Data.GDP2007

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print (pred_train.shape)

print (pred_test.shape)

print (tar_train.shape)

print (tar_test.shape)

model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

dict(zip(predictors.columns, model.coef_))

print(predictors.columns, model.coef_)

m_log_alphas=-np.log10(model.alphas_)

ax=plt.gca()

plt.plot(m_log_alphas, model.coef_path_.T)

plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)

plt.ylabel(‘Regression Coefficients’)

plt.xlabel(‘-log(alpha)’)

plt.title(‘Regression Coefficients Pregression for Lasso Paths’)

m_log_alphascv=-np.log10(model.cv_alphas_)

plt.figure()

plt.plot(m_log_alphascv, model.cv_mse_path_, ‘:’)

plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=1), ‘k’, label=’Average across the folds’, linewidth=2)

plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)

plt.legend()

plt.xlabel(‘-log(alpha)’)

plt.ylabel(‘Mean Squared Error’)

plt.title(‘Mean Squared Error on Each Fold’)

from sklearn.metrics import mean_squared_error

train_error = mean_squared_error(tar_train, model.predict(pred_train))

test_error = mean_squared_error(tar_test, model.predict(pred_test))

print(‘Training Data MSE’)

print(train_error)

print(‘Test Data MSE’)

print(test_error)

rsquared_train = model.score(pred_train, tar_train)

rsquared_test = model.score(pred_test, tar_test)

print(‘Training Data RSquared’)

print(rsquared_train)

print(‘Test Data RSquared’)

print(rsquared_test)