This is the second assignment for the Coursera's MOOC 'Data Management and Visualization', provided by Wesleyan University.
Introduction
The objective of this post is to analyse the frequency of some variables, from GapMinder's data set.
The data
The variables chosen for this analysis are:- Income per person
- Internet use rate
- Employ rate
- Female employ rate
The python code
import pandas
import numpy
data = pandas.read_csv('gapminder.csv')
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
data['femaleemployrate'] = data['femaleemployrate'].convert_objects(convert_numeric=True)
data['urbanrate'] = data['urbanrate'].convert_objects(convert_numeric=True)
data['incomeGROUP'] = pandas.cut(data.incomeperperson, [0, 100, 1000, 10000, 20000, 40000, 60000, 100000])
print 'Income per person - 7 categories'
print data['incomeGROUP'].value_counts(sort=False, dropna=True)
data['internetuseGROUP'] = pandas.qcut(data.internetuserate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print 'Income per person - 7 categories'
print data['internetuseGROUP'].value_counts(sort=False, dropna=True)
print pandas.crosstab(data.incomeGROUP, data.internetuseGROUP)
data.employGROUP = pandas.qcut(data.employrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
data.fememployGROUP = pandas.qcut(data.femaleemployrate, 4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print pandas.crosstab(data.employGROUP, data.fememployGROUP)
The functions used are:
- read_csv
- convert_objects
- cut
- qcut
- value_counts
- crosstab
The results
Income per person - 7 categories
(0, 100] 0
(100, 1000] 54
(1000, 10000] 89
(10000, 20000] 17
(20000, 40000] 26
(40000, 60000] 1
(60000, 100000] 2
Income per person - 4 quartiles
1=0%tile 48
2=25%tile 48
3=50%tile 48
4=75%tile 48
internetuseGROUP 1=0%tile 2=25%tile 3=50%tile 4=75%tile
incomeGROUP
(0, 100] 0 0 0 0
(100, 1000] 36 15 1 0
(1000, 10000] 11 32 33 10
(10000, 20000] 0 0 10 7
(20000, 40000] 0 0 0 25
(40000, 60000] 0 0 0 1
(60000, 100000] 0 0 0 2
femaleemployrate 1=0%tile 2=25%tile 3=50%tile 4=75%tile
employrate
1=0%tile 34 11 0 0
2=25%tile 6 22 16 0
3=50%tile 4 9 21 10
4=75%tile 1 2 7 35
Conclusions
As shown in the tables, there are only two countries with income per person larger than 60 thousand dollars, while there are more than 67% bellow 10 thousand dollars per year.
The internet use rate, displayed in the second table is crossed with income per person. Is interesting to notice that this two variables are positively correlated, because the internet use rate increases with the income per person.
In the last table, is demonstrated that the global workforce is composed mainly by men. The employ rate crossed with the female employ rate shows that 89 countries has its female work force bellow 50% tile. Considering that only 178 countries provides this information, in half of our sample the work force is mainly male.
No comments:
Post a Comment