Latest Posts

Courtyard by Marriott – Arc Atlantic Regional Center

As Arc Atlantic Regional Center’s Project Coordinator, I work on various aspects of their Courtyard by Marriott Hotel in Monterey Park development. My responsibilities are split between project coordination, marketing data analysis, and marketing.

I took initiative to create the company’s online presence. In collaboration with Suen Labs, I re-developed Arc Atlantic Regional Center’s website which launched on Feb 26. I created both the Chinese and English content for the website and provided design directions for the appearance of the website.

SN.png

My other work in marketing includes creating various media items such as monthly newsletters and press releases and I coordinate with local media such as TV Stations and newspapers in creating advertisements. I also provide Chinese-English translations for our marketing materials as needed, such as adding Chinese subtitles in the introduction video for investors.

JanAd

Our January Advertisement in World Journal LA, A local Chinese newspaper

In terms of project coordination, I help the Project Manager in contacting various utilities and contractors relevant to the construction and development of the project. For example, I reached out to Southern California Edison and AT&T for their underground cables and pipeline information for our shoring plans. I review zoning policies and provided research for various design guidelines for the hotel. I also draft correspondences and create contact and client information worksheets.

Lastly, I provide data analysis on Arc Atlantic Regional Center’s marketing campaigns. I created the template for client information database. From the information in the database, I create monthly reports on the quantity of calls and appointments that result. These reports include insights on the source of information in order to identify the most effective marketing platforms and insights on the type of callers – whether they are looking to invest for themselves, for a friend, or for family.

Short Post: The Issues – Educate and Compromise

Recently, I have had several conversations with my friends back in Taiwan. A common thread of discussion is how hopeless they feel about social conditions getting better.

Since the recent presidential election that resulted in the first female president in the country’s history, there have been a wave of euphoria. A large majority of the population, especially millennials, felt that the fall of the KMT was key to social change in Taiwan.

Yet the problems that face Taiwan, especially those of my generation, are complex and difficult to untangle and dissect. One thing that we all agreed on is the fact that many people are unable to think for themselves. Many young people comment and criticize based on misinformation or a lack of independent thought. Part of the issue is how the media presents information and part of the issue is how people consume media in Taiwan. Though media consumption is not a unique problem to Taiwan, the culture of Taiwan tend to cause people to disregard and dismiss opposing opinions and voices.

I feel like education is the best way to tackle these problems, but not through formal education. The current education system in Taiwan has been broken for awhile now, having gone through several reforms in the last five to ten years. Young people need to be able to calmly and rationally discuss problems without clouded by bias and judgement. Young people need to be able to independently find information and assess the quality of that information. These skills are typically acquired during college, but not every college and university provide that quality of thought.

Instead of complaining about how change is impossible, why don’t we start reaching out and starting conversations with one another. We should start conversations with people who have different worldviews and perspectives. We should start conversations with people who have different political ideologies. Of course, we need to start by educating ourselves on how to communicate, how to think, and how to evaluate. Is it possible to completely eliminate bias? Probably not, but we need to try. There are always going to be people who will never be able to communicate. There are always going to be people who are unwilling or unable to compromise or see the other side. Yet, enough of us can do it, I believe we can make a difference.

Social change can only occur when we communicate and compromise.

Perhaps we have forgotten how as a society, but it’s never too late to learn again.

WLM Financial Marketing and Branding

During September 2015 to January 2016, I worked as a Marketing Coordinator/Analyst for the real-estate broker WLM Financial. Based in Inglewood, CA, the company focused on providing first-time home buyers with financial advice and loans needed to purchase their home.

Using my knowledge of GIS and demographics, I identified the locations of their target markets. I proposed ten cities in the Los Angeles Metropolitan Area that they can look to expand marketing operations into. On the broker side of business operations, I looked at home sales data, mortgage data, and property prices to locate other states that WLM Financial can look to apply for broker licenses.

Hsu_Portfolio_Page_04

After getting to know their operations, targets and goals better, I created a marketing and a branding plan for the company.

In terms of brand building, I used their current website and Facebook page as points of interest and set goals to be reached by July 2016 and July 2017. I created a social media schedule for them to post select content and to generate more reach and views to reach a greater local audience. The original goal for July 2016, which is 700 likes on their Facebook page has already been reached as of today. During my time there, I also helped triple their unique daily reach from 1000 to 3000.

In terms of marketing, I identified a need for the company to start creating a reliable database and to start collecting appropriate information from their clients in order to understand the effectiveness of their marketing and the types of clients that tend to be successful in their loan applications. I created a plan for their database and a clear list of information to be collected. Database logs and templates were created from that plan. I also created a disclaimer template to be used in contracts to inform clients that their information is collected but the privacy of such information is maintained. Lastly, I understand the need for a system of checks and balances to ensure the security of the database so I created a plan detailing database authorizations.

With the implementation of these plans, WLM Financial has been more effective in reaching their clients and creating a professional and friendly company image for the local community. As of February 2016, the targets for July 2016, detailed in the branding plan, has been reached.

Compassion is Light

After being reminded of some personal philosophies earlier this morning, I want to take the time to explain it.

I wrote this poem 7 years ago after a conversation at dinner with my buddies about relationships with people:

An act in progress
is the star within you.

Simplified to four stages:
Red dwarf
Blue giant
Supernova
Black hole

Bring out the light in those around you
or engulf them with your own.
What is it that you want –
a harmony
or an overwhelming shell.

Self-confidence comes with the brightness
you emit
and cause others to emit.

Centered on yourself
you drain all around you as a black hole
pulls in even light.

Drowning in the ever amassing matter
accumulated throughout years of darkness.

None can save you but yourself.

So you, I am talking to
pick yourself up
and act in the manner you believe to be best.

I will expand on what I wrote in poetry form.

If we imagine each of us as a star (which I believe is appropriate, since we are basically made of stardust), we all have different types of shine. So as stars, we are always giving off a shine in our social relations – the impression you make to others, the actions you take that affect how others feel, the compassion or empathy you demonstrate, and such.

Now, just as there are different intensities of stars, there are different intensities of how we present ourselves. Some people, like supernovas, outshine and drown out everything and everyone around them in an explosion of their person.

Others, whose gravity about themselves are so intense, are like black holes. They suck in all the light of those around them, ever darkening their world.

There are also those that we all enjoy being around. They are stars that don’t outshine. They are stars that don’t have super gravity collapsing on themselves. Their shine is just warm and bright enough that they bring out the light in others. They complement the light in others without taking too much or giving too much. They are stars who are sure of what they are, just confident enough in their kind of light.

Obviously, this is a somewhat simplified view of relationships, but I think stars are good metaphors because we are all stars in our own right. We are all good in our own ways but as social creatures, it’s about what we give off to others and we can choose and decide that! Some of us need to retrain our light, some of us need to brighten, yet others like supernovas and black holes need to become whole (no pun intended).

Now, why did I title this “Compassion is Light”? Well, the light you see in yourself and the light you can give off, ultimately comes from the love and confidence you have in yourself and in the world. Once we find our own light, we can shine them on the less fortunate and bring their light out. Our self-confidence and love allows us to truly care about those around us. With that warm and bright light, we can brighten those around us as they start to brighten themselves.

If our hearts can be light (in all meanings of that word) we can be compassionate and empathetic to the world. Our world needs more of that.

So, what kind of a star are you?

[Image via NASA]

Half-Year Goals – 2016

As I write this, almost one-sixth of the year has already passed. This also means…as fast as time flies, my birthday is right around the corner.

In the spirit of summer babies and half years, I have set out a list of goals I want to achieve by the time I turn 27. A man/woman without goals is not living but merely surviving.

Since I moved to Los Angeles last summer, much of my time has been consumed by job hunting. I have been stressed out because of a lack of productivity, a lack of stability in terms of my visa, and a lack of direction. I stopped being who I was: cheerful, fun, adventurous, enterprising, passionate, curious, to name a few generic/unique traits…

In order to rectify that, I am going to set some goals. I want to learn new skills and continue to become a better person.

I know that every year I am gaining more experience. Every year, I have grown and become a better man. This year, I am writing this down and posting it up so I can hold myself even more accountable; I have always been a man of my word.

If I need external motivators and motivation, so be it until I can become internal. In life, we all struggle with ourselves and our own demons. Some people can deal with it alone, some people need some hands along the way.

So this is my list, what’s yours?

  1. Learn Spanish and be at least 50% Fluent on Duolingo  (currently 19%) – I want to become trilingual eventually
  2. Learn to write Chinese (500 words) – I can type, read, and speak fluently but I probably only know how to write around 50 words right now
  3. Complete Data Analysis and Interpretation Specialization with Coursera
  4. Complete half of the GIS Specialization with Coursera
  5. Complete a site suitability study – Bakersfield project
  6. Write and submit poetry to http://www.poetryfoundation.org – Feb 25th
  7. Work-out regularly
  8. Practice the violin everyday for at least 15 mins
  9. Start volunteering – Orientation is on April 5th
  10. Talk to my parents, sisters, and family more
  11. Play more chess
  12. Find a new job – definitely the hardest goal to achieve

You might wonder how that list has anything to do with becoming a better person. It’s all about being anchored, in order to bring love to those around you.

“The only thing we have to fear…is fear itself.” – Franklin D. Roosevelt

2014-08-31 11.57.09

k-Means Cluster Analysis – Machine Learning

Machine Learning Data Analysis

This is the last lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to run a k-Means Cluster Analysis using my variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. Here, GDP per Capita in 2007 is used as the validation variable. I am trying to identify if there are clusters of characteristics that associate with certain values of GDP per Capita based on national data from 2007.

As before, the data is split into 70% training data and 30% test data. However, the k-means cluster analysis will only be run on the training data set.

The Elbow Curve Graph shows that 2, 3, and 4 clusters could be interpreted, though it is inconclusive. I decided to analyze 2 clusters instead, because I believe the greatest change in the Elbow Curve occurs at cluster = 2.

Elbow1.png

If we looked at the Scatter Plots of both 3 and 2 Clusters, it is obvious that with 3 clusters there is less correlation. With 2 clusters, though the second cluster is much more spread out, the first cluster is fairly bunched together and contain little overlap.The second cluster has much more in-cluster variance.

3Clusters.png

2Clusters.png

Clustering Variable Means by Cluster
cluster             index          UrbanPop2007       UrbanPopGrowth2007                GDP2007
0                   128.867470       -0.209996                    -0.106216                             3622.924772
1                    114.470588        1.021819                         0.134405                               29775.197180

cluster      GDPGrowth2007          PopGrow2007         Employment2007        Energy2007
0                     0.070663                         -0.136930                   -0.154707                   -0.271868
1                      -0.431318                        0.309997                     0.117959                      1.271157

If we look at the clustering variable means, it is obvious there are significant differences between the two clusters. Furthermore, it would appear again that Urban Population and Energy Use per Capita have the greatest associations with GDP per Capita. The first cluster had low Urban Population and Energy per Capita means associating with a very low GDP per Capita, while the second cluster has high Urban Population and Energy per Capita means associating with a comparatively high GDP per Capita. This would suggest that in countries that are highly urbanized and consume more energy, there is more GDP per Capita. This is logical considering that the most technologically advance nations on Earth tend to be much richer, more urbanized, and much more energy intensive in terms of industry. The question then becomes, does high urbanization and high energy consumption result in higher GDP per Capita or are they just characteristics of nations with higher GDP per Capita?

To run an external validation of the clusters, an ANOVA was run and demonstrated significant differences between the two clusters. The p-value was <0.0001 and both clusters had very different mean values: 3622.92 and 29775.20. However, the standard deviations were large, demonstrating the high in-cluster variance experienced by both clusters. This could very much be the result of the small sample size. With training data set at 70%, there are only 100 observations to be made.

This is my code in Python:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pd.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, np.nan)

RegData = urbandata[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

Data=RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
Data.describe()

Data[‘UrbanPop2007’] = preprocessing.scale(Data[‘UrbanPop2007’].astype(‘float64’))
Data[‘UrbanPopGrowth2007’] = preprocessing.scale(Data[‘UrbanPopGrowth2007’].astype(‘float64’))
Data[‘GDPGrowth2007’] = preprocessing.scale(Data[‘GDPGrowth2007’].astype(‘float64’))
Data[‘PopGrow2007’] = preprocessing.scale(Data[‘PopGrow2007’].astype(‘float64’))
Data[‘Employment2007’] = preprocessing.scale(Data[‘Employment2007’].astype(‘float64’))
Data[‘Energy2007’] = preprocessing.scale(Data[‘Energy2007’].astype(‘float64’))

print(Data.describe())

clus_train, clus_test = train_test_split(Data, test_size=.3, random_state=123)

from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
model=KMeans(n_clusters=k)
model.fit(clus_train)
clusassign=model.predict(clus_train)
meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, ‘euclidean’), axis=1))
/ clus_train.shape[0])

plt.plot(clusters, meandist)
plt.xlabel(‘Number of clusters’)
plt.ylabel(‘Average distance’)
plt.title(‘Selecting k with the Elbow Method’)

model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
fig2 = plt.figure(figsize=(12,8))
fig2 = plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel(‘Canonical variable 1’)
plt.ylabel(‘Canonical variable 2’)
plt.title(‘Scatterplot of Canonical Variables for 3 Clusters’)
plt.show()

model2=KMeans(n_clusters=2)
model2.fit(clus_train)
clusassign=model2.predict(clus_train)

pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
fig3 = plt.figure(figsize=(12,8))
fig3 = plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model2.labels_,)
plt.xlabel(‘Canonical variable 1’)
plt.ylabel(‘Canonical variable 2’)
plt.title(‘Scatterplot of Canonical Variables for 2 Clusters’)
plt.show()

clus_train.reset_index(level=0, inplace=True)
cluslist=list(clus_train[‘index’])
labels=list(model2.labels_)
newlist=dict(zip(cluslist, labels))
newlist
newclus=DataFrame.from_dict(newlist, orient=’index’)
newclus
newclus.columns = [‘cluster’]

newclus.reset_index(level=0, inplace=True)
merged_train=pd.merge(clus_train, newclus, on=’index’)
merged_train.head(n=100)
merged_train.cluster.value_counts()

clustergrp = merged_train.groupby(‘cluster’).mean()
print (“Clustering variable means by cluster”)
print(clustergrp)

GDP=RegData[‘GDP2007’]
GDP_train, GDP_test = train_test_split(GDP, test_size=.3, random_state=123)
GDP_train1=pd.DataFrame(GDP_train)
GDP_train1.reset_index(level=0, inplace=True)
sub1 = merged_train[[‘GDP2007’, ‘cluster’]].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

GDPmod = smf.ols(formula=’GDP2007 ~ C(cluster)’, data=sub1).fit()
print (GDPmod.summary())

print (‘means for GDP by cluster’)
m1= sub1.groupby(‘cluster’).mean()
print (m1)

print (‘standard deviations for GDP by cluster’)
m2= sub1.groupby(‘cluster’).std()
print (m2)

mc1 = multi.MultiComparison(sub1[‘GDP2007’], sub1[‘cluster’])
res1 = mc1.tukeyhsd()
print(res1.summary())

Lasso Regression – Machine Learning

Machine Learning Data Analysis

This is the third lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to run a Lasso Regression that identifies the impact of each of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. As it is a linear regression model, I am able to use a quantitative variable. Unlike the previous lesson, I can use GDP per Capita 2007 as is, without having to convert it into a categorical variable.

This time, the training data set is 70% and the test data set is 30% of the original data, which means there are 100 observations in my training data set vs. 43 in my test data set.

pred_train.shape = (100, 6)
pred_test.shape   = (43, 6)
tar_train.shape    = (100,)
tar_test.shape      = (43,)

Running the Lasso Regression gave the following coefficients:

Urban Population                    = 3057.54737912
Urban Population Growth    = 0
GDP Growth                              = -289.06265017
Population Growth                 = 0
Employment Rate                   = 0
Energy Use per Capita           = 4961.13569776

RegressionCoefLasso

Again, like the results of my Random Forest analysis, Energy Use has the most impact on predicting GDP per Capita, followed by Urban Population. In this instances, however, it would appear that with a coefficient of 0, Urban Population Growth, Population Growth, and Employment Rate do not have an effect on predicting GDP per Capita. They are excluded from the model completely.

Looking at the Mean Squared Error and RSquared Values, the training data and the test data are fairly similar. The results suggest that the model explains about 53% and 59% of the variance in the training and test data sets. However, the mean squared error values are very high, which could suggest a very poor model to predict GDP per Capita using Urban Population and Energy Use per Capita.

Training Data MSE =
57945645.8152
Test Data MSE =
67792139.0988
Training Data RSquared =
0.52963714125
Test Data RSquared =
0.586598813704

MSELasso.png

This is my code in Python:

import pandas
import numpy as np
import matplotlib.pyplot as plt
import os

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, np.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

Data = RegData.copy()
from sklearn import preprocessing

Data[‘UrbanPop2007’] = preprocessing.scale(Data[‘UrbanPop2007’].astype(‘float64’))
Data[‘UrbanPopGrowth2007’] = preprocessing.scale(Data[‘UrbanPopGrowth2007’].astype(‘float64’))
Data[‘GDPGrowth2007’] = preprocessing.scale(Data[‘GDPGrowth2007’].astype(‘float64’))
Data[‘PopGrow2007’] = preprocessing.scale(Data[‘PopGrow2007’].astype(‘float64’))
Data[‘Employment2007’] = preprocessing.scale(Data[‘Employment2007’].astype(‘float64’))
Data[‘Energy2007’] = preprocessing.scale(Data[‘Energy2007’].astype(‘float64’))

Data.dtypes
print (Data.describe())

predictors = Data[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007′]]

targets = Data.GDP2007

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.3, random_state=123)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

model = LassoLarsCV(cv=10, precompute=False).fit(pred_train, tar_train)

dict(zip(predictors.columns, model.coef_))

print(predictors.columns, model.coef_)

m_log_alphas=-np.log10(model.alphas_)
ax=plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.ylabel(‘Regression Coefficients’)
plt.xlabel(‘-log(alpha)’)
plt.title(‘Regression Coefficients Pregression for Lasso Paths’)

m_log_alphascv=-np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ‘:’)
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=1), ‘k’, label=’Average across the folds’, linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle=’–‘, color=’k’, label=’alpha cv’)
plt.legend()
plt.xlabel(‘-log(alpha)’)
plt.ylabel(‘Mean Squared Error’)
plt.title(‘Mean Squared Error on Each Fold’)

from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print(‘Training Data MSE’)
print(train_error)
print(‘Test Data MSE’)
print(test_error)

rsquared_train = model.score(pred_train, tar_train)
rsquared_test = model.score(pred_test, tar_test)
print(‘Training Data RSquared’)
print(rsquared_train)
print(‘Test Data RSquared’)
print(rsquared_test)

Random Forests – Machine Learning

Machine Learning Data Analysis

This is the second lesson of the fourth course of my Data Analysis and Interpretation Specialization by Wesleyan University through Coursera.

If you have been following along with my work, you will know that I am interested in the relationship between urbanization and economic development and am posing the general question of whether urbanization drives economic growth?

For this assignment, the goal is to create a random forest that identifies the varying importance of my explanatory variables: Urban Population, Urban Population Growth, GDP Growth, Population Growth, Employment Rate, and Energy Use per Capita in 2007. For my response variable, I created a categorical variable from GDP per Capita 2007. I separated the data into two levels, where GDP per Capita 2007 is lower than 10000 is 0 or low and where GDP per Capita 2007 is higher than 10000 is 1 or high.

Just as in the last assignment, when my test sample is set at 40%, the result is 58 test samples and 85 training samples out of 143 total, with 6 explanatory variables: Urban Population 2007, Urban Population Growth 2007, GDP Growth 2007, Population Growth 2007, Employment Rate 2007 and Energy Use 2007.

This is demonstrated in the output below:
pred_train.shape = (85, 6)
pred_test.shape   = (58, 6)
tar_train.shape    = (85,)
tar_test.shape      = (58,)

Classification Matrix
[41    3]
[ 4   10]

Accuracy Score = 0.879310344828

The classification matrix showed that the model classified 41 negatives and 10 positives correctly, while there were four false negatives and three false positives. With an accuracy score of 88%, this means that the model classified 88% of the sample correctly as either a high income or a low income country.

Measure of Importance of Explanatory Variables (Importance Scores)
Urban Population                    = 0.26979787
Urban Population Growth    = 0.10315226
GDP Growth                              = 0.14031019
Population Growth                 = 0.04450653
Employment Rate                   = 0.07961939
Energy Use per Capita           = 0.36261374

From these measures, the random forest results show that Energy Use per Capita is actually the most important variable in predicting GDP per Capita and Population Growth is the least important. Urban Population is second in importance, which provides some support for the idea that urban agglomerations drives economic growth.

Finally, let’s look at the number of trees that is needed to generate a reasonably accurate result:

RandomForest.png

The graph shows that accuracy climbs to the maximum 95% at ten trees, which then falls to between 90 to 93% with successive trees. This means that with the current data and parameters, at least ten decision trees will need to be generated and interpreted for the best model.

This is my code in Python:

from pandas import Series, DataFrame
import pandas
import numpy
import os
import matplotlib.pyplot as plt

from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

import sklearn.metrics

from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

os.chdir(“C:\\Users\William Hsu\Desktop\www.iamlliw.com\Data Analysis Course\Python”)

urbandata = pandas.read_csv(‘Data1.csv’, low_memory=False)

urbandata = urbandata.replace(0, numpy.nan)

RegData = urbandata[[‘Country’, ‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDP2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]
#RegData[‘UrbanPop2010’] = RegData[‘UrbanPop2010’] – RegData[‘UrbanPop2010’].mean()
RegData = RegData.dropna()

def GDPCat (row):
if row[‘GDP2007’] <= 10000:
return 0
elif row[‘GDP2007’] > 10000:
return 1

RegData[‘GDPCat’] = RegData.apply (lambda row: GDPCat (row), axis = 1 )

RegData.dtypes
print (RegData.describe())

predictors = RegData[[‘UrbanPop2007’, ‘UrbanPopGrowth2007’, ‘GDPGrowth2007’, ‘PopGrow2007’, ‘Employment2007’, ‘Energy2007’]]

targets = RegData.GDPCat

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

print (pred_train.shape)
print (pred_test.shape)
print (tar_train.shape)
print (tar_test.shape)

#Build Model on Test Sample
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=25)
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print (sklearn.metrics.confusion_matrix(tar_test, predictions))

print (sklearn.metrics.accuracy_score(tar_test, predictions))

model = ExtraTreesClassifier ()
model.fit(pred_train, tar_train)

print(model.feature_importances_)

trees = range(25)
accuracy = numpy.zeros(25)

for idx in range (len(trees)):
classifier = RandomForestClassifier(n_estimators=idx+1)
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
accuracy[idx] = sklearn.metrics.accuracy_score(tar_test, predictions)

plt.cla()
plt.plot(trees, accuracy)

I Forgot Hope, Is What Makes the World Beautiful

*very very rough hahaha kind of rusty.

I believe that we are born pure
Clean without evil
But why do these things happen

So I descended into the darkness
In my search for the meaning
Of why

Fog and mist crept in
Covered every path
Colored everything gray

Trapped within the shadows
Trapped within my cave
Within my own head

So I saw nothing
Save for darkness
And I accepted it as fact

 

I came out tainted
Forgotten my belief
With no reminders of what came first

But back into the light
The sun shines
Cleared away the shades

Then I remembered
Hope is what makes the world beautiful
Because without hope there is no love.