2016-09-29

Statistical inference with classification tree

Today's post is an assigmnet for the Coursera MOOC "Machine Learning for Data Analysis".

Introduction


This post will show how to create a classification tree with python, the code and its results.
Using the GapMinder dataset, we are going to predict wheater or not a country has income per person above the world average.

Classification tree


A classification tree is a data structure used to predict binary categorical variables, based on preprocessed data. It allows us to predict the result to simple questions like "will it rain today?" analysing other variables like umidity, wind, solar radiance, temperature, and many others.

The choice of tools


I have chosen python and its libraries to create my classification tree, because of its flexibilty and similarity to other programming languages.
I am using Visual Studio 2015, with python 2.7. The libraries required to run this code are:

  • pandas
  • numpy
  • matplotlib
  • sklearn
  • pydotplus
  • IPython

All of them are free, and can be easily found at Google.

The data


I have chosen, as in the previous posts, the GapMinder dataset to create mine tree. You can get to know it better at their website www.gapminder.org

The code


from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import classification_report
import sklearn.metrics
import pydotplus
import io
from io import BytesIO
from IPython.display import Image

data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()

data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome

data_clean['alcconsumption'] = data_clean['alcconsumption'].convert_objects(convert_numeric=True)
data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['breastcancerper100th'] = data_clean['breastcancerper100th'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)

data_clean['hivrate'] = data_clean['hivrate'].convert_objects(convert_numeric=True)
data_clean['lifeexpectancy'] = data_clean['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['polityscore'] = data_clean['polityscore'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)
data_clean['suicideper100th'] = data_clean['suicideper100th'].convert_objects(convert_numeric=True)

data_clean = data_clean.fillna(0)

predictors = data_clean[['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ]]

targets = data_clean['aboveavg']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print sklearn.metrics.confusion_matrix(tar_test, predictions)
print sklearn.metrics.accuracy_score(tar_test, predictions)

out = BytesIO()
tree.export_graphviz(classifier, out_file=out, feature_names=['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ])
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_pdf('tree.pdf')

In case that you are not familiar with python code, or any part of the code above, please refer to Coursera's Machine learning data analysis course.

The results


This model was capable of predict correctly 0.86%, producing the following confusion matrix:
[[62  8]
 [ 4 12]]
Meaning that, 74 observations were predicted correctly - 62 as 0 (income per person below world average) and 12 as 1 (above world average). And 8 was predicted as being 1, but actually are 0 (false positive), and 4 as the opposite (false negative).

The tree display using Graphviz is the following:




The tree shows that the most significant variable selected was internetuserate, follow by reletricperperson to the left and alccomsuption right.

2016-09-28

Analyzing global income per person, internet use, employ rate and female employ rate


This is the second assignment for the Coursera's MOOC 'Data Management and Visualization', provided by Wesleyan University.


Introduction

The objective of this post is to analyse the frequency of some variables, from GapMinder's data set.

The data

The variables chosen for this analysis are:
  • Income per person
  • Internet use rate
  • Employ rate
  • Female employ rate

The python code

import pandas
import numpy
data = pandas.read_csv('gapminder.csv')
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['incomeGROUP'] = pandas.cut(data.incomeperperson, [0, 100, 1000, 10000, 20000, 40000, 60000, 100000])
print 'Income per person - 7 categories'
print data['incomeGROUP'].value_counts(sort=False, dropna=True)
data['internetuseGROUP'] = pandas.qcut(data.internetuserate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print 'Income per person - 7 categories'
print data['internetuseGROUP'].value_counts(sort=False, dropna=True)
print pandas.crosstab(data.incomeGROUP, data.internetuseGROUP)
data.employGROUP = pandas.qcut(data.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
data.fememployGROUP = pandas.qcut(data.femaleemployrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print pandas.crosstab(data.employGROUP, data.fememployGROUP)

In this code is used the numpy library, to perform data load, management, analysis and display.
The functions used are:

  • read_csv
  • convert_objects
  • cut
  • qcut
  • value_counts
  • crosstab

The results

Income per person - 7 categories
(0, 100]            0
(100, 1000]        54
(1000, 10000]      89
(10000, 20000]     17
(20000, 40000]     26
(40000, 60000]      1
(60000, 100000]     2

Income per person - 4 quartiles
1=0%tile     48
2=25%tile    48
3=50%tile    48
4=75%tile    48

internetuseGROUP  1=0%tile  2=25%tile  3=50%tile  4=75%tile
incomeGROUP                                              
(0, 100]                 0          0          0          0
(100, 1000]             36         15          1          0
(1000, 10000]           11         32         33         10
(10000, 20000]           0          0         10          7
(20000, 40000]           0          0          0         25
(40000, 60000]           0          0          0          1
(60000, 100000]          0          0          0          2
femaleemployrate  1=0%tile  2=25%tile  3=50%tile  4=75%tile
employrate                                              
1=0%tile                34         11          0          0
2=25%tile                6         22         16          0
3=50%tile                4          9         21         10
4=75%tile                1          2          7         35

Conclusions

As shown in the tables, there are only two countries with income per person larger than 60 thousand dollars, while there are more than 67% bellow 10 thousand dollars per year.

The internet use rate, displayed in the second table is crossed with income per person. Is interesting to notice that this two variables are positively correlated, because the internet use rate increases with the income per person.

In the last table, is demonstrated that the global workforce is composed mainly by men. The employ rate crossed with the female employ rate shows that 89 countries has its female work force bellow 50% tile. Considering that only 178 countries provides this information, in half of our sample the work force is mainly male.

2016-09-27

The internet access rate and the exportation of high technology

This is the first assignment for the Coursera's MOOC 'Data Management and Visualization', provided by Wesleyan University.

Choosing the data

After reading the codebooks available, I decided that the most relevant for me is the GapMinder dataset.
The GapMinder is an world-view, with information about person income, employ rate, internet use rate, urban rate, and some more, that will be used to perform an analisys and try to answer the question bellow.

The question

Is the internet access rate correlated with the high technology export rate from countries around the globe?

As a Brazilian, I am particularly interested in determinating wheater of not the internet use rate is related to the exportation of high technology.
In my actual opinion, my country needs to export more industrialized goodies, rather than mainly exporting commodities.

While trying to understand how the internet can ameliorate my countries exportations, I will hate to also work with this questions: 

  • What is the impact of exportation of high technology in countries economics?
  • What is the amount of high technology that is beeing exported exclusively by internet?

The codebook

The following columns were taken in account to perform my analisys

incomeperperson: 2010 Gross Domestic Product per capita in constant 2000 US$ - The inflation but not the differences in the cost of living between countries has been taken into account. [World Bank Work Development Indicators]

employrate: 2007 total employees age 15+ (% of population) - Percentage of total population, age above 15, that has been employed during the given year. [International Labour Organization]

Internetuserate: 2010 Internet users (per 100 people) - Internet users are people with access to the worldwide network. [World Bank]
"Internet users are individuals who have used the Internet (from any location) in the last 12 months. Internet can be used via a computer, mobile phone, personal digital assistant, games machine, digital TV etc."
relectricperperson: 2008 residential electricity consumption, per person (kWh) - The amount of residential electricity consumption per person during the given year, counted in kilowatt-hours (kWh). [International Energy Agency]

urbanrate: 2008 urban population (% of total) - Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) [World Bank]

htexp2010: High-technology exports (% of manufactured exports) - High-technology exports are products with high R&D intensity, such as in aerospace, computers, pharmaceuticals, scientific instruments, and electrical machinery. [World Bank]

htc2010: High-technology exports (current US$) - High-technology exports are products with high R&D intensity, such as in aerospace, computers, pharmaceuticals, scientific instruments, and electrical machinery. Data are in current U.S. dollars. [World Bank]

The second question

Is the high technology exportation amount related to the Income Per Person?


The literature found

In order to better understand this two questions, I had to search in Google Scholar, using the following search terms: internet use and exportation of technology

And the results lead me to the following readings:
  • Borich, Robert A. "Globalization of the US Defense Industrial Base: Developing Procurement Sources Abroad Through Exporting Advanced Military Technology." Public Contract Law Journal (2002): 623-677.
  • Javalgi, Rajshekhar G., Charles L. Martin, and Patricia R. Todd. "The export of e-services in the age of technology transformation: challenges and implications for international service providers." Journal of Services Marketing 18.7 (2004): 560-573.
  • Hsiao, Chun Hua, and Chyan Yang. "The intellectual development of the technology acceptance model: A co-citation analysis." International Journal of Information Management 31.2 (2011): 128-136.