Introduction
Using the GapMinder dataset, we are going to predict wheater or not a country has income per person above the world average.
Classification tree
The choice of tools
I am using Visual Studio 2015, with python 2.7. The libraries required to run this code are:
- pandas
- numpy
- matplotlib
- sklearn
- pydotplus
- IPython
All of them are free, and can be easily found at Google.
The data
I have chosen, as in the previous posts, the GapMinder dataset to create mine tree. You can get to know it better at their website www.gapminder.org
The code
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import classification_report
import sklearn.metrics
import pydotplus
import io
from io import BytesIO
from IPython.display import Image
data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()
data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome
data_clean['alcconsumption'] = data_clean['alcconsumption'].convert_objects(convert_numeric=True)
data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['breastcancerper100th'] = data_clean['breastcancerper100th'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)
data_clean['hivrate'] = data_clean['hivrate'].convert_objects(convert_numeric=True)
data_clean['lifeexpectancy'] = data_clean['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['polityscore'] = data_clean['polityscore'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)
data_clean['suicideper100th'] = data_clean['suicideper100th'].convert_objects(convert_numeric=True)
data_clean = data_clean.fillna(0)
predictors = data_clean[['internetuserate'
,'alcconsumption','armedforcesrate'
,'breastcancerper100th'
,'co2emissions'
,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
]]
targets = data_clean['aboveavg']
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)
classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)
predictions = classifier.predict(pred_test)
print sklearn.metrics.confusion_matrix(tar_test, predictions)
print sklearn.metrics.accuracy_score(tar_test, predictions)
out = BytesIO()
tree.export_graphviz(classifier, out_file=out, feature_names=['internetuserate'
,'alcconsumption','armedforcesrate'
,'breastcancerper100th'
,'co2emissions'
,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
])
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_pdf('tree.pdf')
The results
[[62 8]
[ 4 12]]
Meaning that, 74 observations were predicted correctly - 62 as 0 (income per person below world average) and 12 as 1 (above world average). And 8 was predicted as being 1, but actually are 0 (false positive), and 4 as the opposite (false negative).
The tree shows that the most significant variable selected was internetuserate, follow by reletricperperson to the left and alccomsuption right.
