{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Estimation of Probabilities from Datasets\n",
"In this notebook a small dataset of employess is given. Each employee is described by:\n",
"* sex: *m* for male and *f* for female\n",
"* number of years in the company: integer\n",
"* income: *h* (high), *m* (medium) or *l* (low)\n",
"* division: *s* (sales), *d* (design), *b* (backoffice) and *m* (marketing)\n",
"\n",
"The dataset is defined below and represented as a pandas dataframe."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"dataDict={\"sex\":[\"m\",\"m\",\"f\",\"m\",\"f\",\"f\",\"f\",\"m\",\"m\",\"m\"],\n",
" \"years\":[10,2,4,4,5,1,7,2,4,1],\n",
" \"income\":[\"h\",\"m\",\"m\",\"l\",\"m\",\"l\",\"m\",\"m\",\"h\",\"m\"],\n",
" \"division\":[\"s\",\"d\",\"d\",\"b\",\"b\",\"m\",\"s\",\"d\",\"d\",\"d\"]\n",
" }\n",
"data=pd.DataFrame(dataDict)\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following probabilities shall be estimated from the given dataset:\n",
"1. propability for *male* and *high* income -> $P(male,high)$\n",
"2. probability for *male*, *low* income and *backoffice* -> $P(male,low,backoffice)$\n",
"3. probability that a *male* has *high* income -> $P(high|male)$\n",
"4. probability that a *female* has *high* income -> $P(high|female)$\n",
"5. probability that an employee with *high* income is *female* -> $P(female|high)$\n",
"6. probability that a *male* with *medium* income works in division *design* -> $P(design|male,medium)$\n",
"7. probability that an employee in division *design* is a *male* with *high* income -> $P(male,medium|design)$\n",
"8. probability that a *male* which is at least 4 years in the company has *medium* income -> $P(medium|male,\\geq4)$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For calculating joint probabilities and conditional probabilities the [pandas method crosstab()](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html) can be applied. **This method creates a table in which the frequencies of all value-combinations of two or more random variables can be determined.** Moreover, by applying the argument *normalize* of the `crosstab()`-method it is possible to calculate instead of the frequencies the joint probabilities or conditional probabilities of all value-combinations. This is demonstrated below: "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we calculate the frequencies of all value-combinations of the variables *sex* and *income*: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" income | \n",
" h | \n",
" l | \n",
" m | \n",
"
\n",
" \n",
" sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" 0 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
" m | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income h l m\n",
"sex \n",
"f 0 1 3\n",
"m 2 1 3"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(data[\"sex\"],data[\"income\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we, set the argument `normalize=\"all\"` in the same method call. The result is the complete joint probability distribution of these two variables."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" income | \n",
" h | \n",
" l | \n",
" m | \n",
"
\n",
" \n",
" sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" 0.0 | \n",
" 0.1 | \n",
" 0.3 | \n",
"
\n",
" \n",
" m | \n",
" 0.2 | \n",
" 0.1 | \n",
" 0.3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income h l m\n",
"sex \n",
"f 0.0 0.1 0.3\n",
"m 0.2 0.1 0.3"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the table calculated above, we can derive the answer for question 1:\n",
"\n",
"$$\n",
"P(male,high)=0.2\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we set the argument `normalize=\"index\"`. The calculated values are the conditional probabilities $P(income|sex)$:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" income | \n",
" h | \n",
" l | \n",
" m | \n",
"
\n",
" \n",
" sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" 0.000000 | \n",
" 0.250000 | \n",
" 0.75 | \n",
"
\n",
" \n",
" m | \n",
" 0.333333 | \n",
" 0.166667 | \n",
" 0.50 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income h l m\n",
"sex \n",
"f 0.000000 0.250000 0.75\n",
"m 0.333333 0.166667 0.50"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"index\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The table calculated above contains the solutions for question 2 and 3:\n",
"\n",
"$$\n",
"P(high|male)=0.333\n",
"$$\n",
"\n",
"and\n",
"\n",
"$$\n",
"P(high|female)=0\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to calculate the conditional probabilities of type $P(sex|income)$ we can apply the same `crosstab()`, but now with `normalize=\"columns\"`. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" income | \n",
" h | \n",
" l | \n",
" m | \n",
"
\n",
" \n",
" sex | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" 0.0 | \n",
" 0.5 | \n",
" 0.5 | \n",
"
\n",
" \n",
" m | \n",
" 1.0 | \n",
" 0.5 | \n",
" 0.5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income h l m\n",
"sex \n",
"f 0.0 0.5 0.5\n",
"m 1.0 0.5 0.5"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"columns\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This table contains the answer to question 5:\n",
"\n",
"$$\n",
"P(female|high)=0\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `crosstab()`-method can also be applied for more than two variables, as demonstrated below:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true,
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" division | \n",
" b | \n",
" d | \n",
" m | \n",
" s | \n",
"
\n",
" \n",
" sex | \n",
" income | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" l | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.1 | \n",
" 0.0 | \n",
"
\n",
" \n",
" m | \n",
" 0.1 | \n",
" 0.1 | \n",
" 0.0 | \n",
" 0.1 | \n",
"
\n",
" \n",
" m | \n",
" h | \n",
" 0.0 | \n",
" 0.1 | \n",
" 0.0 | \n",
" 0.1 | \n",
"
\n",
" \n",
" l | \n",
" 0.1 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" m | \n",
" 0.0 | \n",
" 0.3 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"division b d m s\n",
"sex income \n",
"f l 0.0 0.0 0.1 0.0\n",
" m 0.1 0.1 0.0 0.1\n",
"m h 0.0 0.1 0.0 0.1\n",
" l 0.1 0.0 0.0 0.0\n",
" m 0.0 0.3 0.0 0.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the table calculated above, we can derive the answer for question 2:\n",
"\n",
"$$\n",
"P(male,low,backoffice)=0.1\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" division | \n",
" b | \n",
" d | \n",
" m | \n",
" s | \n",
"
\n",
" \n",
" sex | \n",
" income | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" l | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 1.0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" m | \n",
" 0.333333 | \n",
" 0.333333 | \n",
" 0.0 | \n",
" 0.333333 | \n",
"
\n",
" \n",
" m | \n",
" h | \n",
" 0.000000 | \n",
" 0.500000 | \n",
" 0.0 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" l | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" m | \n",
" 0.000000 | \n",
" 1.000000 | \n",
" 0.0 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"division b d m s\n",
"sex income \n",
"f l 0.000000 0.000000 1.0 0.000000\n",
" m 0.333333 0.333333 0.0 0.333333\n",
"m h 0.000000 0.500000 0.0 0.500000\n",
" l 1.000000 0.000000 0.0 0.000000\n",
" m 0.000000 1.000000 0.0 0.000000"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"index\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The table calculated above, contains the answer of question 6:\n",
"\n",
"$$\n",
"P(design|male,medium)=1.0\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to calculate the answer for question 7, we set `normalize=\"columns\"`:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" division | \n",
" b | \n",
" d | \n",
" m | \n",
" s | \n",
"
\n",
" \n",
" sex | \n",
" income | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" l | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" m | \n",
" 0.5 | \n",
" 0.2 | \n",
" 0.0 | \n",
" 0.5 | \n",
"
\n",
" \n",
" m | \n",
" h | \n",
" 0.0 | \n",
" 0.2 | \n",
" 0.0 | \n",
" 0.5 | \n",
"
\n",
" \n",
" l | \n",
" 0.5 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" m | \n",
" 0.0 | \n",
" 0.6 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"division b d m s\n",
"sex income \n",
"f l 0.0 0.0 1.0 0.0\n",
" m 0.5 0.2 0.0 0.5\n",
"m h 0.0 0.2 0.0 0.5\n",
" l 0.5 0.0 0.0 0.0\n",
" m 0.0 0.6 0.0 0.0"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"columns\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From this table we derive the answer for question 7:\n",
"\n",
"$$\n",
"P(male,medium|design)=0.6\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For calculating the answer of question 8, we apply the `crosstab()`-method as described below and add the two values in column `m` which belong to rows that belong to *male* and at least 4 years:\n",
"\n",
"$$\n",
"P(medium|male,\\geq 4)= P(medium|male,4) + P(medium|male,10) =0 + 0 = 0 \n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": [
"hide-input"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" income | \n",
" h | \n",
" l | \n",
" m | \n",
"
\n",
" \n",
" sex | \n",
" years | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" f | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 7 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" m | \n",
" 1 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.5 | \n",
" 0.5 | \n",
" 0.0 | \n",
"
\n",
" \n",
" 10 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 0.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
"income h l m\n",
"sex years \n",
"f 1 0.0 1.0 0.0\n",
" 4 0.0 0.0 1.0\n",
" 5 0.0 0.0 1.0\n",
" 7 0.0 0.0 1.0\n",
"m 1 0.0 0.0 1.0\n",
" 2 0.0 0.0 1.0\n",
" 4 0.5 0.5 0.0\n",
" 10 1.0 0.0 0.0"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.crosstab([data[\"sex\"],data[\"years\"]],data[\"income\"],normalize=\"index\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}