The data
Again, I am using GapMinder's dataset. I selected the features bellow to compound my training data:
- armedforcesrate
- co2emissions
- internetuserate
- employrate
- femaleemployrate
- urbanrate
- oilperperson
- relectricperperson
And my target variable is 'aboveagv', wich means that a country has the income per person above the global average.
The number of estimators
In order to create the most efficient random forest classifier, I used 25 different configurations and chose the best one.
This was made with a loop and setting the number of estimators for each run.
The best result happens early, using only 10 estimators. The graphic bellow gives a sense of it:
The feature importances
For the final step, I want to see the importance of each feature. So I produced this table after running the random forest with my best estimator number.
| importance | feature |
|---|---|
| internetuserate | 0.27930588 |
| armedforcesrate | 0.03128309 |
| co2emissions | 0.06441896 |
| femaleemployrate | 0.04944427 |
| oilperperson | 0.17901002 |
| relectricperperson | 0.11783898 |
| employrate | 0.07382482 |
| urbanrate | 0.20487397 |
Finally, as you can see, the most important features in my random forest were 'urbanrate' and 'internetuserate'.
The code
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import tree
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pylab as plt
data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()
data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)
data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome
data_clean = data_clean.fillna(0)
predictors = data_clean[['internetuserate'
,'armedforcesrate'
,'co2emissions'
,'femaleemployrate'
,'oilperperson'
,'relectricperperson'
,'employrate'
,'urbanrate']]
targets = data_clean['aboveavg']
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
#find the best estimator
trees=range(25)
accuracy=np.zeros(25)
for idx in range(len(trees)):
classifier=RandomForestClassifier(n_estimators=idx + 1, random_state=123)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
acc = sklearn.metrics.accuracy_score(tar_test, predictions)
accuracy[idx] = acc
print 'acc: ', acc, ' idx: ', idx
#graphic to make things cool
plt.cla()
plt.plot(trees, accuracy)
plt.xlabel('# estimators')
plt.ylabel('accuracy')
plt.show()
print classifier.column_names()
#using best estimator to see feature importances
classifier=RandomForestClassifier(10, random_state=123)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print 'accuracy for 10 estimators: ', sklearn.metrics.accuracy_score(tar_test, predictions)
print classifier.feature_importances_

No comments:
Post a Comment