2016-10-06

Running a Random Forest

Its time for Random Forest!

This post is the second assignment of Coursera's MOOC Machine Learning for Data Analysis. I'm gonna show how to use Random Forest - a very powerful algorithm - to predict a binary variable outcome.

The data


Again, I am using GapMinder's dataset. I selected the features bellow to compound my training data:
  • armedforcesrate
  • co2emissions
  • internetuserate
  • employrate
  • femaleemployrate
  • urbanrate
  • oilperperson
  • relectricperperson

And my target variable is 'aboveagv', wich means that a country has the income per person above the global average.

The number of estimators


In order to create the most efficient random forest classifier, I used 25 different configurations and chose the best one.
This was made with a loop and setting the number of estimators for each run.
The best result happens early, using only 10 estimators. The graphic bellow gives a sense of it:


The feature importances


For the final step, I want to see the importance of each feature. So I produced this table after running the random forest with my best estimator number.

importancefeature
internetuserate 0.27930588
armedforcesrate 0.03128309
co2emissions 0.06441896
femaleemployrate 0.04944427
oilperperson 0.17901002
relectricperperson 0.11783898
employrate 0.07382482
urbanrate 0.20487397

Finally, as you can see, the most important features in my random forest were 'urbanrate' and 'internetuserate'.

The code


import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import tree
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pylab as plt

data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()

data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)

data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome

data_clean = data_clean.fillna(0)

predictors = data_clean[['internetuserate'
                        ,'armedforcesrate'
                        ,'co2emissions'
                        ,'femaleemployrate'
                        ,'oilperperson'
                        ,'relectricperperson'
                        ,'employrate'
                        ,'urbanrate']]

targets = data_clean['aboveavg']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

#find the best estimator
trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
    classifier=RandomForestClassifier(n_estimators=idx + 1, random_state=123)
    classifier=classifier.fit(pred_train,tar_train)
    predictions=classifier.predict(pred_test)
    acc = sklearn.metrics.accuracy_score(tar_test, predictions)
    accuracy[idx] = acc
    print 'acc: ', acc, ' idx: ', idx

#graphic to make things cool
plt.cla()
plt.plot(trees, accuracy)
plt.xlabel('# estimators')
plt.ylabel('accuracy')
plt.show()
print classifier.column_names()

#using best estimator to see feature importances
classifier=RandomForestClassifier(10, random_state=123)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print 'accuracy for 10 estimators: ', sklearn.metrics.accuracy_score(tar_test, predictions)
print classifier.feature_importances_


No comments:

Post a Comment