2016-09-29

Statistical inference with classification tree

Today's post is an assigmnet for the Coursera MOOC "Machine Learning for Data Analysis".

Introduction


This post will show how to create a classification tree with python, the code and its results.
Using the GapMinder dataset, we are going to predict wheater or not a country has income per person above the world average.

Classification tree


A classification tree is a data structure used to predict binary categorical variables, based on preprocessed data. It allows us to predict the result to simple questions like "will it rain today?" analysing other variables like umidity, wind, solar radiance, temperature, and many others.

The choice of tools


I have chosen python and its libraries to create my classification tree, because of its flexibilty and similarity to other programming languages.
I am using Visual Studio 2015, with python 2.7. The libraries required to run this code are:

  • pandas
  • numpy
  • matplotlib
  • sklearn
  • pydotplus
  • IPython

All of them are free, and can be easily found at Google.

The data


I have chosen, as in the previous posts, the GapMinder dataset to create mine tree. You can get to know it better at their website www.gapminder.org

The code


from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import classification_report
import sklearn.metrics
import pydotplus
import io
from io import BytesIO
from IPython.display import Image

data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()

data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome

data_clean['alcconsumption'] = data_clean['alcconsumption'].convert_objects(convert_numeric=True)
data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['breastcancerper100th'] = data_clean['breastcancerper100th'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)

data_clean['hivrate'] = data_clean['hivrate'].convert_objects(convert_numeric=True)
data_clean['lifeexpectancy'] = data_clean['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['polityscore'] = data_clean['polityscore'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)
data_clean['suicideper100th'] = data_clean['suicideper100th'].convert_objects(convert_numeric=True)

data_clean = data_clean.fillna(0)

predictors = data_clean[['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ]]

targets = data_clean['aboveavg']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print sklearn.metrics.confusion_matrix(tar_test, predictions)
print sklearn.metrics.accuracy_score(tar_test, predictions)

out = BytesIO()
tree.export_graphviz(classifier, out_file=out, feature_names=['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ])
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_pdf('tree.pdf')

In case that you are not familiar with python code, or any part of the code above, please refer to Coursera's Machine learning data analysis course.

The results


This model was capable of predict correctly 0.86%, producing the following confusion matrix:
[[62  8]
 [ 4 12]]
Meaning that, 74 observations were predicted correctly - 62 as 0 (income per person below world average) and 12 as 1 (above world average). And 8 was predicted as being 1, but actually are 0 (false positive), and 4 as the opposite (false negative).

The tree display using Graphviz is the following:




The tree shows that the most significant variable selected was internetuserate, follow by reletricperperson to the left and alccomsuption right.

No comments:

Post a Comment