2016-10-06

Running a Random Forest

Its time for Random Forest!

This post is the second assignment of Coursera's MOOC Machine Learning for Data Analysis. I'm gonna show how to use Random Forest - a very powerful algorithm - to predict a binary variable outcome.

The data


Again, I am using GapMinder's dataset. I selected the features bellow to compound my training data:
  • armedforcesrate
  • co2emissions
  • internetuserate
  • employrate
  • femaleemployrate
  • urbanrate
  • oilperperson
  • relectricperperson

And my target variable is 'aboveagv', wich means that a country has the income per person above the global average.

The number of estimators


In order to create the most efficient random forest classifier, I used 25 different configurations and chose the best one.
This was made with a loop and setting the number of estimators for each run.
The best result happens early, using only 10 estimators. The graphic bellow gives a sense of it:


The feature importances


For the final step, I want to see the importance of each feature. So I produced this table after running the random forest with my best estimator number.

importancefeature
internetuserate 0.27930588
armedforcesrate 0.03128309
co2emissions 0.06441896
femaleemployrate 0.04944427
oilperperson 0.17901002
relectricperperson 0.11783898
employrate 0.07382482
urbanrate 0.20487397

Finally, as you can see, the most important features in my random forest were 'urbanrate' and 'internetuserate'.

The code


import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn import tree
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pylab as plt

data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()

data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)

data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome

data_clean = data_clean.fillna(0)

predictors = data_clean[['internetuserate'
                        ,'armedforcesrate'
                        ,'co2emissions'
                        ,'femaleemployrate'
                        ,'oilperperson'
                        ,'relectricperperson'
                        ,'employrate'
                        ,'urbanrate']]

targets = data_clean['aboveavg']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

#find the best estimator
trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
    classifier=RandomForestClassifier(n_estimators=idx + 1, random_state=123)
    classifier=classifier.fit(pred_train,tar_train)
    predictions=classifier.predict(pred_test)
    acc = sklearn.metrics.accuracy_score(tar_test, predictions)
    accuracy[idx] = acc
    print 'acc: ', acc, ' idx: ', idx

#graphic to make things cool
plt.cla()
plt.plot(trees, accuracy)
plt.xlabel('# estimators')
plt.ylabel('accuracy')
plt.show()
print classifier.column_names()

#using best estimator to see feature importances
classifier=RandomForestClassifier(10, random_state=123)
classifier=classifier.fit(pred_train,tar_train)
predictions=classifier.predict(pred_test)
print 'accuracy for 10 estimators: ', sklearn.metrics.accuracy_score(tar_test, predictions)
print classifier.feature_importances_


2016-10-01

Data Management Decisions Making

Here we go again!
This post is the third assignment of Coursera's MOOC Data Visualization and Management. Today we are going to explore Data Management Decisions Making, and you will see how to analyse information with Python.

The data


My dataset is the GapMinder csv file, imported and converted to numeric variables:
import pandas
import numpy
 

data = pandas.read_csv('gapminder.csv') 
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
The python code above uses the three variables incomeperperson, internetuserate, e employrate. If you want to know more about them, check the Code Book from GapMinder's dataset.

The definition of classes


Income Per Person


I decided to classify the countries using the amount of dollars per person, which can demonstrates how many countries are rich, average or poor.
The code to make this classifications is the following:

def getIppClass(income):
    if(income > 70000):
        return 1
    elif(income > 50000):
        return 2
    elif(income > 30000):
        return 3
    elif(income > 10000):
        return 4
    else:
        return 5

Internet use rate


While classifying the countries by the percentage of population with Internet access, I decided to create four groups.
The code to make this classifications is the following:

def getInternetClass(internet):
    if(internet > 70):
        return 4
    if(internet > 50):
        return 3
    if(internet > 30):
        return 2
    else:
        return 1

Employ rate


And finally, to analyze the employ rate, I divided them within four classes using the following code:

def getEmployClass(employrate):
    if(employrate > 70):
        return 4
    if(employrate > 50):
        return 3
    if(employrate > 30):
        return 2
    else:
        return 1

The results and full code

My program generates the following output:

counts for income per person classes
5    166
4     31
3     12
2      2
1      2

percentage for income per person classes
5    0.779343
4    0.145540
3    0.056338
2    0.009390
1    0.009390

counts for Internet use rate classes
1    114
2     43
3     23
4     33

percentage for Internet use rate classes
1    0.535211
2    0.201878
3    0.107981
4    0.154930

counts for employ rate classes
1     35
2     37
3    114
4     27

percentage for employ rate classes
1    0.164319
2    0.173709
3    0.535211
4    0.126761
The first table and second tables show that in almost 78% (166) of countries, the income per person is bellow U$ 10,000 per year. And in only two of them (0.00939%) the income per person is above U$ 70,000 per yer. There are many more poor countries, and less rich ones.

The second table shows that in 53% of countries (114) the Internet use rate is bellow 30%, and in 15% of them, it is above 70%.

Finally, in the last two tables the employ rate is shown. In 53% of the countries the are more then 50 and less than 70% of employ rate, which means 114 countries. On the other hand, there are 27 countries where only less than 30% of population are employed, 27 countries.

The missing values were removed from the frequency counts, using the dropna option.

Here is the full python code:

import pandas
import numpy
data = pandas.read_csv('gapminder.csv')
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
def getIppClass(income):
    if(income > 70000):
        return 1
    elif(income > 50000):
        return 2
    elif(income > 30000):
        return 3
    elif(income > 10000):
        return 4
    else:
        return 5
result = data['incomeperperson'].apply(lambda x: getIppClass(x))
print 'counts for income per person classes'
print pandas.value_counts(result, dropna=True)
print 'percentage for income per person classes'
print pandas.value_counts(result, normalize=True, dropna=True)
def getInternetClass(income):
    if(income > 70):
        return 4
    if(income > 50):
        return 3
    if(income > 30):
        return 2
    else:
        return 1
result = data['internetuserate'].apply(lambda x: getInternetClass(x))
print 'counts for internet use rate classes'
print pandas.value_counts(result, sort=False, dropna=True)
print 'percentage for internet use rate classes'
print pandas.value_counts(result, sort=False, normalize=True, dropna=True)
def getEmployClass(employrate):
    if(employrate > 70):
        return 4
    if(employrate > 50):
        return 3
    if(employrate > 30):
        return 2
    else:
        return 1
result = data['employrate'].apply(lambda x: getEmployClass(x))
print 'counts for employ rate classes'
print pandas.value_counts(result, sort=False, dropna=True)
print 'percentage for employ rate classes'
print pandas.value_counts(result, sort=False, normalize=True, dropna=True)

2016-09-29

Statistical inference with classification tree

Today's post is an assigmnet for the Coursera MOOC "Machine Learning for Data Analysis".

Introduction


This post will show how to create a classification tree with python, the code and its results.
Using the GapMinder dataset, we are going to predict wheater or not a country has income per person above the world average.

Classification tree


A classification tree is a data structure used to predict binary categorical variables, based on preprocessed data. It allows us to predict the result to simple questions like "will it rain today?" analysing other variables like umidity, wind, solar radiance, temperature, and many others.

The choice of tools


I have chosen python and its libraries to create my classification tree, because of its flexibilty and similarity to other programming languages.
I am using Visual Studio 2015, with python 2.7. The libraries required to run this code are:

  • pandas
  • numpy
  • matplotlib
  • sklearn
  • pydotplus
  • IPython

All of them are free, and can be easily found at Google.

The data


I have chosen, as in the previous posts, the GapMinder dataset to create mine tree. You can get to know it better at their website www.gapminder.org

The code


from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import classification_report
import sklearn.metrics
import pydotplus
import io
from io import BytesIO
from IPython.display import Image

data = pd.read_csv("gapminder.csv")
data_clean = data.dropna()

data_clean['incomeperperson'] = data_clean['incomeperperson'].convert_objects(convert_numeric=True)
averageincome = data_clean['incomeperperson'].sum() / len(data_clean['incomeperperson'])
data_clean['aboveavg'] = data_clean['incomeperperson'] > averageincome

data_clean['alcconsumption'] = data_clean['alcconsumption'].convert_objects(convert_numeric=True)
data_clean['armedforcesrate'] = data_clean['armedforcesrate'].convert_objects(convert_numeric=True)
data_clean['breastcancerper100th'] = data_clean['breastcancerper100th'].convert_objects(convert_numeric=True)
data_clean['co2emissions'] = data_clean['co2emissions'].convert_objects(convert_numeric=True)
data_clean['internetuserate'] = data_clean['internetuserate'].convert_objects(convert_numeric=True)
data_clean['employrate'] = data_clean['employrate'].convert_objects(convert_numeric=True)
data_clean['femaleemployrate'] = data_clean['femaleemployrate'].convert_objects(convert_numeric=True)
data_clean['urbanrate'] = data_clean['urbanrate'].convert_objects(convert_numeric=True)

data_clean['hivrate'] = data_clean['hivrate'].convert_objects(convert_numeric=True)
data_clean['lifeexpectancy'] = data_clean['lifeexpectancy'].convert_objects(convert_numeric=True)
data_clean['oilperperson'] = data_clean['oilperperson'].convert_objects(convert_numeric=True)
data_clean['polityscore'] = data_clean['polityscore'].convert_objects(convert_numeric=True)
data_clean['relectricperperson'] = data_clean['relectricperperson'].convert_objects(convert_numeric=True)
data_clean['suicideper100th'] = data_clean['suicideper100th'].convert_objects(convert_numeric=True)

data_clean = data_clean.fillna(0)

predictors = data_clean[['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ]]

targets = data_clean['aboveavg']

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

classifier = DecisionTreeClassifier()
classifier = classifier.fit(pred_train, tar_train)

predictions = classifier.predict(pred_test)

print sklearn.metrics.confusion_matrix(tar_test, predictions)
print sklearn.metrics.accuracy_score(tar_test, predictions)

out = BytesIO()
tree.export_graphviz(classifier, out_file=out, feature_names=['internetuserate'
                         ,'alcconsumption','armedforcesrate'
                         ,'breastcancerper100th'
                         ,'co2emissions'
                         ,'femaleemployrate','hivrate','lifeexpectancy','oilperperson',
                         'polityscore','relectricperperson','suicideper100th','employrate','urbanrate'
                        ])
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_pdf('tree.pdf')

In case that you are not familiar with python code, or any part of the code above, please refer to Coursera's Machine learning data analysis course.

The results


This model was capable of predict correctly 0.86%, producing the following confusion matrix:
[[62  8]
 [ 4 12]]
Meaning that, 74 observations were predicted correctly - 62 as 0 (income per person below world average) and 12 as 1 (above world average). And 8 was predicted as being 1, but actually are 0 (false positive), and 4 as the opposite (false negative).

The tree display using Graphviz is the following:




The tree shows that the most significant variable selected was internetuserate, follow by reletricperperson to the left and alccomsuption right.

2016-09-28

Analyzing global income per person, internet use, employ rate and female employ rate


This is the second assignment for the Coursera's MOOC 'Data Management and Visualization', provided by Wesleyan University.


Introduction

The objective of this post is to analyse the frequency of some variables, from GapMinder's data set.

The data

The variables chosen for this analysis are:
  • Income per person
  • Internet use rate
  • Employ rate
  • Female employ rate

The python code

import pandas
import numpy
data = pandas.read_csv('gapminder.csv')
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['incomeGROUP'] = pandas.cut(data.incomeperperson, [0, 100, 1000, 10000, 20000, 40000, 60000, 100000])
print 'Income per person - 7 categories'
print data['incomeGROUP'].value_counts(sort=False, dropna=True)
data['internetuseGROUP'] = pandas.qcut(data.internetuserate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print 'Income per person - 7 categories'
print data['internetuseGROUP'].value_counts(sort=False, dropna=True)
print pandas.crosstab(data.incomeGROUP, data.internetuseGROUP)
data.employGROUP = pandas.qcut(data.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
data.fememployGROUP = pandas.qcut(data.femaleemployrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print pandas.crosstab(data.employGROUP, data.fememployGROUP)

In this code is used the numpy library, to perform data load, management, analysis and display.
The functions used are:

  • read_csv
  • convert_objects
  • cut
  • qcut
  • value_counts
  • crosstab

The results

Income per person - 7 categories
(0, 100]            0
(100, 1000]        54
(1000, 10000]      89
(10000, 20000]     17
(20000, 40000]     26
(40000, 60000]      1
(60000, 100000]     2

Income per person - 4 quartiles
1=0%tile     48
2=25%tile    48
3=50%tile    48
4=75%tile    48

internetuseGROUP  1=0%tile  2=25%tile  3=50%tile  4=75%tile
incomeGROUP                                              
(0, 100]                 0          0          0          0
(100, 1000]             36         15          1          0
(1000, 10000]           11         32         33         10
(10000, 20000]           0          0         10          7
(20000, 40000]           0          0          0         25
(40000, 60000]           0          0          0          1
(60000, 100000]          0          0          0          2
femaleemployrate  1=0%tile  2=25%tile  3=50%tile  4=75%tile
employrate                                              
1=0%tile                34         11          0          0
2=25%tile                6         22         16          0
3=50%tile                4          9         21         10
4=75%tile                1          2          7         35

Conclusions

As shown in the tables, there are only two countries with income per person larger than 60 thousand dollars, while there are more than 67% bellow 10 thousand dollars per year.

The internet use rate, displayed in the second table is crossed with income per person. Is interesting to notice that this two variables are positively correlated, because the internet use rate increases with the income per person.

In the last table, is demonstrated that the global workforce is composed mainly by men. The employ rate crossed with the female employ rate shows that 89 countries has its female work force bellow 50% tile. Considering that only 178 countries provides this information, in half of our sample the work force is mainly male.

2016-09-27

The internet access rate and the exportation of high technology

This is the first assignment for the Coursera's MOOC 'Data Management and Visualization', provided by Wesleyan University.

Choosing the data

After reading the codebooks available, I decided that the most relevant for me is the GapMinder dataset.
The GapMinder is an world-view, with information about person income, employ rate, internet use rate, urban rate, and some more, that will be used to perform an analisys and try to answer the question bellow.

The question

Is the internet access rate correlated with the high technology export rate from countries around the globe?

As a Brazilian, I am particularly interested in determinating wheater of not the internet use rate is related to the exportation of high technology.
In my actual opinion, my country needs to export more industrialized goodies, rather than mainly exporting commodities.

While trying to understand how the internet can ameliorate my countries exportations, I will hate to also work with this questions: 

  • What is the impact of exportation of high technology in countries economics?
  • What is the amount of high technology that is beeing exported exclusively by internet?

The codebook

The following columns were taken in account to perform my analisys

incomeperperson: 2010 Gross Domestic Product per capita in constant 2000 US$ - The inflation but not the differences in the cost of living between countries has been taken into account. [World Bank Work Development Indicators]

employrate: 2007 total employees age 15+ (% of population) - Percentage of total population, age above 15, that has been employed during the given year. [International Labour Organization]

Internetuserate: 2010 Internet users (per 100 people) - Internet users are people with access to the worldwide network. [World Bank]
"Internet users are individuals who have used the Internet (from any location) in the last 12 months. Internet can be used via a computer, mobile phone, personal digital assistant, games machine, digital TV etc."
relectricperperson: 2008 residential electricity consumption, per person (kWh) - The amount of residential electricity consumption per person during the given year, counted in kilowatt-hours (kWh). [International Energy Agency]

urbanrate: 2008 urban population (% of total) - Urban population refers to people living in urban areas as defined by national statistical offices (calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects) [World Bank]

htexp2010: High-technology exports (% of manufactured exports) - High-technology exports are products with high R&D intensity, such as in aerospace, computers, pharmaceuticals, scientific instruments, and electrical machinery. [World Bank]

htc2010: High-technology exports (current US$) - High-technology exports are products with high R&D intensity, such as in aerospace, computers, pharmaceuticals, scientific instruments, and electrical machinery. Data are in current U.S. dollars. [World Bank]

The second question

Is the high technology exportation amount related to the Income Per Person?


The literature found

In order to better understand this two questions, I had to search in Google Scholar, using the following search terms: internet use and exportation of technology

And the results lead me to the following readings:
  • Borich, Robert A. "Globalization of the US Defense Industrial Base: Developing Procurement Sources Abroad Through Exporting Advanced Military Technology." Public Contract Law Journal (2002): 623-677.
  • Javalgi, Rajshekhar G., Charles L. Martin, and Patricia R. Todd. "The export of e-services in the age of technology transformation: challenges and implications for international service providers." Journal of Services Marketing 18.7 (2004): 560-573.
  • Hsiao, Chun Hua, and Chyan Yang. "The intellectual development of the technology acceptance model: A co-citation analysis." International Journal of Information Management 31.2 (2011): 128-136.