2016-10-01

Data Management Decisions Making

Here we go again!
This post is the third assignment of Coursera's MOOC Data Visualization and Management. Today we are going to explore Data Management Decisions Making, and you will see how to analyse information with Python.

The data


My dataset is the GapMinder csv file, imported and converted to numeric variables:
import pandas
import numpy
 

data = pandas.read_csv('gapminder.csv') 
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
The python code above uses the three variables incomeperperson, internetuserate, e employrate. If you want to know more about them, check the Code Book from GapMinder's dataset.

The definition of classes


Income Per Person


I decided to classify the countries using the amount of dollars per person, which can demonstrates how many countries are rich, average or poor.
The code to make this classifications is the following:

def getIppClass(income):
    if(income > 70000):
        return 1
    elif(income > 50000):
        return 2
    elif(income > 30000):
        return 3
    elif(income > 10000):
        return 4
    else:
        return 5

Internet use rate


While classifying the countries by the percentage of population with Internet access, I decided to create four groups.
The code to make this classifications is the following:

def getInternetClass(internet):
    if(internet > 70):
        return 4
    if(internet > 50):
        return 3
    if(internet > 30):
        return 2
    else:
        return 1

Employ rate


And finally, to analyze the employ rate, I divided them within four classes using the following code:

def getEmployClass(employrate):
    if(employrate > 70):
        return 4
    if(employrate > 50):
        return 3
    if(employrate > 30):
        return 2
    else:
        return 1

The results and full code

My program generates the following output:

counts for income per person classes
5    166
4     31
3     12
2      2
1      2

percentage for income per person classes
5    0.779343
4    0.145540
3    0.056338
2    0.009390
1    0.009390

counts for Internet use rate classes
1    114
2     43
3     23
4     33

percentage for Internet use rate classes
1    0.535211
2    0.201878
3    0.107981
4    0.154930

counts for employ rate classes
1     35
2     37
3    114
4     27

percentage for employ rate classes
1    0.164319
2    0.173709
3    0.535211
4    0.126761
The first table and second tables show that in almost 78% (166) of countries, the income per person is bellow U$ 10,000 per year. And in only two of them (0.00939%) the income per person is above U$ 70,000 per yer. There are many more poor countries, and less rich ones.

The second table shows that in 53% of countries (114) the Internet use rate is bellow 30%, and in 15% of them, it is above 70%.

Finally, in the last two tables the employ rate is shown. In 53% of the countries the are more then 50 and less than 70% of employ rate, which means 114 countries. On the other hand, there are 27 countries where only less than 30% of population are employed, 27 countries.

The missing values were removed from the frequency counts, using the dropna option.

Here is the full python code:

import pandas
import numpy
data = pandas.read_csv('gapminder.csv')
data['incomeperperson'] = data['incomeperperson'].convert_objects(convert_numeric=True)
data['internetuserate'] = data['internetuserate'].convert_objects(convert_numeric=True)
data['employrate'] = data['employrate'].convert_objects(convert_numeric=True)
def getIppClass(income):
    if(income > 70000):
        return 1
    elif(income > 50000):
        return 2
    elif(income > 30000):
        return 3
    elif(income > 10000):
        return 4
    else:
        return 5
result = data['incomeperperson'].apply(lambda x: getIppClass(x))
print 'counts for income per person classes'
print pandas.value_counts(result, dropna=True)
print 'percentage for income per person classes'
print pandas.value_counts(result, normalize=True, dropna=True)
def getInternetClass(income):
    if(income > 70):
        return 4
    if(income > 50):
        return 3
    if(income > 30):
        return 2
    else:
        return 1
result = data['internetuserate'].apply(lambda x: getInternetClass(x))
print 'counts for internet use rate classes'
print pandas.value_counts(result, sort=False, dropna=True)
print 'percentage for internet use rate classes'
print pandas.value_counts(result, sort=False, normalize=True, dropna=True)
def getEmployClass(employrate):
    if(employrate > 70):
        return 4
    if(employrate > 50):
        return 3
    if(employrate > 30):
        return 2
    else:
        return 1
result = data['employrate'].apply(lambda x: getEmployClass(x))
print 'counts for employ rate classes'
print pandas.value_counts(result, sort=False, dropna=True)
print 'percentage for employ rate classes'
print pandas.value_counts(result, sort=False, normalize=True, dropna=True)

No comments:

Post a Comment