Estimation of Probabilities from Datasets
Estimation of Probabilities from Datasets#
In this notebook a small dataset of employess is given. Each employee is described by:
sex: m for male and f for female
number of years in the company: integer
income: h (high), m (medium) or l (low)
division: s (sales), d (design), b (backoffice) and m (marketing)
The dataset is defined below and represented as a pandas dataframe.
import pandas as pd
dataDict={"sex":["m","m","f","m","f","f","f","m","m","m"],
"years":[10,2,4,4,5,1,7,2,4,1],
"income":["h","m","m","l","m","l","m","m","h","m"],
"division":["s","d","d","b","b","m","s","d","d","d"]
}
data=pd.DataFrame(dataDict)
data
sex | years | income | division | |
---|---|---|---|---|
0 | m | 10 | h | s |
1 | m | 2 | m | d |
2 | f | 4 | m | d |
3 | m | 4 | l | b |
4 | f | 5 | m | b |
5 | f | 1 | l | m |
6 | f | 7 | m | s |
7 | m | 2 | m | d |
8 | m | 4 | h | d |
9 | m | 1 | m | d |
The following probabilities shall be estimated from the given dataset:
propability for male and high income -> \(P(male,high)\)
probability for male, low income and backoffice -> \(P(male,low,backoffice)\)
probability that a male has high income -> \(P(high|male)\)
probability that a female has high income -> \(P(high|female)\)
probability that an employee with high income is female -> \(P(female|high)\)
probability that a male with medium income works in division design -> \(P(design|male,medium)\)
probability that an employee in division design is a male with high income -> \(P(male,medium|design)\)
probability that a male which is at least 4 years in the company has medium income -> \(P(medium|male,\geq4)\)
For calculating joint probabilities and conditional probabilities the pandas method crosstab() can be applied. This method creates a table in which the frequencies of all value-combinations of two or more random variables can be determined. Moreover, by applying the argument normalize of the crosstab()
-method it is possible to calculate instead of the frequencies the joint probabilities or conditional probabilities of all value-combinations. This is demonstrated below:
First we calculate the frequencies of all value-combinations of the variables sex and income:
pd.crosstab(data["sex"],data["income"])
income | h | l | m |
---|---|---|---|
sex | |||
f | 0 | 1 | 3 |
m | 2 | 1 | 3 |
Next we, set the argument normalize="all"
in the same method call. The result is the complete joint probability distribution of these two variables.
pd.crosstab(data["sex"],data["income"],normalize="all")
income | h | l | m |
---|---|---|---|
sex | |||
f | 0.0 | 0.1 | 0.3 |
m | 0.2 | 0.1 | 0.3 |
From the table calculated above, we can derive the answer for question 1:
Next, we set the argument normalize="index"
. The calculated values are the conditional probabilities \(P(income|sex)\):
pd.crosstab(data["sex"],data["income"],normalize="index")
income | h | l | m |
---|---|---|---|
sex | |||
f | 0.000000 | 0.250000 | 0.75 |
m | 0.333333 | 0.166667 | 0.50 |
The table calculated above contains the solutions for question 2 and 3:
and
In order to calculate the conditional probabilities of type \(P(sex|income)\) we can apply the same crosstab()
, but now with normalize="columns"
.
pd.crosstab(data["sex"],data["income"],normalize="columns")
income | h | l | m |
---|---|---|---|
sex | |||
f | 0.0 | 0.5 | 0.5 |
m | 1.0 | 0.5 | 0.5 |
This table contains the answer to question 5:
The crosstab()
-method can also be applied for more than two variables, as demonstrated below:
pd.crosstab([data["sex"],data["income"]],data["division"],normalize="all")
division | b | d | m | s | |
---|---|---|---|---|---|
sex | income | ||||
f | l | 0.0 | 0.0 | 0.1 | 0.0 |
m | 0.1 | 0.1 | 0.0 | 0.1 | |
m | h | 0.0 | 0.1 | 0.0 | 0.1 |
l | 0.1 | 0.0 | 0.0 | 0.0 | |
m | 0.0 | 0.3 | 0.0 | 0.0 |
From the table calculated above, we can derive the answer for question 2:
pd.crosstab([data["sex"],data["income"]],data["division"],normalize="index")
division | b | d | m | s | |
---|---|---|---|---|---|
sex | income | ||||
f | l | 0.000000 | 0.000000 | 1.0 | 0.000000 |
m | 0.333333 | 0.333333 | 0.0 | 0.333333 | |
m | h | 0.000000 | 0.500000 | 0.0 | 0.500000 |
l | 1.000000 | 0.000000 | 0.0 | 0.000000 | |
m | 0.000000 | 1.000000 | 0.0 | 0.000000 |
The table calculated above, contains the answer of question 6:
In order to calculate the answer for question 7, we set normalize="columns"
:
pd.crosstab([data["sex"],data["income"]],data["division"],normalize="columns")
division | b | d | m | s | |
---|---|---|---|---|---|
sex | income | ||||
f | l | 0.0 | 0.0 | 1.0 | 0.0 |
m | 0.5 | 0.2 | 0.0 | 0.5 | |
m | h | 0.0 | 0.2 | 0.0 | 0.5 |
l | 0.5 | 0.0 | 0.0 | 0.0 | |
m | 0.0 | 0.6 | 0.0 | 0.0 |
From this table we derive the answer for question 7:
For calculating the answer of question 8, we apply the crosstab()
-method as described below and add the two values in column m
which belong to rows that belong to male and at least 4 years:
pd.crosstab([data["sex"],data["years"]],data["income"],normalize="index")
income | h | l | m | |
---|---|---|---|---|
sex | years | |||
f | 1 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 | |
5 | 0.0 | 0.0 | 1.0 | |
7 | 0.0 | 0.0 | 1.0 | |
m | 1 | 0.0 | 0.0 | 1.0 |
2 | 0.0 | 0.0 | 1.0 | |
4 | 0.5 | 0.5 | 0.0 | |
10 | 1.0 | 0.0 | 0.0 |