Estimation of Probabilities from Datasets#

In this notebook a small dataset of employess is given. Each employee is described by:

sex: m for male and f for female
number of years in the company: integer
income: h (high), m (medium) or l (low)
division: s (sales), d (design), b (backoffice) and m (marketing)

The dataset is defined below and represented as a pandas dataframe.

import pandas as pd

dataDict={"sex":["m","m","f","m","f","f","f","m","m","m"],
          "years":[10,2,4,4,5,1,7,2,4,1],
          "income":["h","m","m","l","m","l","m","m","h","m"],
          "division":["s","d","d","b","b","m","s","d","d","d"]
         }
data=pd.DataFrame(dataDict)
data

	sex	years	income	division
0	m	10	h	s
1	m	2	m	d
2	f	4	m	d
3	m	4	l	b
4	f	5	m	b
5	f	1	l	m
6	f	7	m	s
7	m	2	m	d
8	m	4	h	d
9	m	1	m	d

The following probabilities shall be estimated from the given dataset:

propability for male and high income -> \(P(male,high)\)
probability for male, low income and backoffice -> \(P(male,low,backoffice)\)
probability that a male has high income -> \(P(high|male)\)
probability that a female has high income -> \(P(high|female)\)
probability that an employee with high income is female -> \(P(female|high)\)
probability that a male with medium income works in division design -> \(P(design|male,medium)\)
probability that an employee in division design is a male with high income -> \(P(male,medium|design)\)
probability that a male which is at least 4 years in the company has medium income -> \(P(medium|male,\geq4)\)

For calculating joint probabilities and conditional probabilities the pandas method crosstab() can be applied. This method creates a table in which the frequencies of all value-combinations of two or more random variables can be determined. Moreover, by applying the argument normalize of the crosstab()-method it is possible to calculate instead of the frequencies the joint probabilities or conditional probabilities of all value-combinations. This is demonstrated below:

First we calculate the frequencies of all value-combinations of the variables sex and income:

pd.crosstab(data["sex"],data["income"])

income	h	l	m
sex
f	0	1	3
m	2	1	3

Next we, set the argument normalize="all" in the same method call. The result is the complete joint probability distribution of these two variables.

pd.crosstab(data["sex"],data["income"],normalize="all")

income	h	l	m
sex
f	0.0	0.1	0.3
m	0.2	0.1	0.3

From the table calculated above, we can derive the answer for question 1:

\[ P(male,high)=0.2 \]

Next, we set the argument normalize="index". The calculated values are the conditional probabilities \(P(income|sex)\):

pd.crosstab(data["sex"],data["income"],normalize="index")

income	h	l	m
sex
f	0.000000	0.250000	0.75
m	0.333333	0.166667	0.50

The table calculated above contains the solutions for question 2 and 3:

\[ P(high|male)=0.333 \]

and

\[ P(high|female)=0 \]

In order to calculate the conditional probabilities of type \(P(sex|income)\) we can apply the same crosstab(), but now with normalize="columns".

pd.crosstab(data["sex"],data["income"],normalize="columns")

income	h	l	m
sex
f	0.0	0.5	0.5
m	1.0	0.5	0.5

This table contains the answer to question 5:

\[ P(female|high)=0 \]

The crosstab()-method can also be applied for more than two variables, as demonstrated below:

pd.crosstab([data["sex"],data["income"]],data["division"],normalize="all")

	division	b	d	m	s
sex	income
f	l	0.0	0.0	0.1	0.0
f	m	0.1	0.1	0.0	0.1
m	h	0.0	0.1	0.0	0.1
	l	0.1	0.0	0.0	0.0
	m	0.0	0.3	0.0	0.0

From the table calculated above, we can derive the answer for question 2:

\[ P(male,low,backoffice)=0.1 \]

pd.crosstab([data["sex"],data["income"]],data["division"],normalize="index")

	division	b	d	m	s
sex	income
f	l	0.000000	0.000000	1.0	0.000000
f	m	0.333333	0.333333	0.0	0.333333
m	h	0.000000	0.500000	0.0	0.500000
	l	1.000000	0.000000	0.0	0.000000
	m	0.000000	1.000000	0.0	0.000000

The table calculated above, contains the answer of question 6:

\[ P(design|male,medium)=1.0 \]

In order to calculate the answer for question 7, we set normalize="columns":

pd.crosstab([data["sex"],data["income"]],data["division"],normalize="columns")

	division	b	d	m	s
sex	income
f	l	0.0	0.0	1.0	0.0
f	m	0.5	0.2	0.0	0.5
m	h	0.0	0.2	0.0	0.5
	l	0.5	0.0	0.0	0.0
	m	0.0	0.6	0.0	0.0

From this table we derive the answer for question 7:

\[ P(male,medium|design)=0.6 \]

For calculating the answer of question 8, we apply the crosstab()-method as described below and add the two values in column m which belong to rows that belong to male and at least 4 years:

\[ P(medium|male,\geq 4)= P(medium|male,4) + P(medium|male,10) =0 + 0 = 0 \]

pd.crosstab([data["sex"],data["years"]],data["income"],normalize="index")

	income	h	l	m
sex	years
f	1	0.0	1.0	0.0
	4	0.0	0.0	1.0
	5	0.0	0.0	1.0
	7	0.0	0.0	1.0
m	1	0.0	0.0	1.0
	2	0.0	0.0	1.0
	4	0.5	0.5	0.0
	10	1.0	0.0	0.0

Modelling of Uncertainty

Estimation of Probabilities from Datasets

Estimation of Probabilities from Datasets#