{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Estimation of Probabilities from Datasets\n", "In this notebook a small dataset of employess is given. Each employee is described by:\n", "* sex: *m* for male and *f* for female\n", "* number of years in the company: integer\n", "* income: *h* (high), *m* (medium) or *l* (low)\n", "* division: *s* (sales), *d* (design), *b* (backoffice) and *m* (marketing)\n", "\n", "The dataset is defined below and represented as a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "dataDict={\"sex\":[\"m\",\"m\",\"f\",\"m\",\"f\",\"f\",\"f\",\"m\",\"m\",\"m\"],\n", " \"years\":[10,2,4,4,5,1,7,2,4,1],\n", " \"income\":[\"h\",\"m\",\"m\",\"l\",\"m\",\"l\",\"m\",\"m\",\"h\",\"m\"],\n", " \"division\":[\"s\",\"d\",\"d\",\"b\",\"b\",\"m\",\"s\",\"d\",\"d\",\"d\"]\n", " }\n", "data=pd.DataFrame(dataDict)\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following probabilities shall be estimated from the given dataset:\n", "1. propability for *male* and *high* income -> $P(male,high)$\n", "2. probability for *male*, *low* income and *backoffice* -> $P(male,low,backoffice)$\n", "3. probability that a *male* has *high* income -> $P(high|male)$\n", "4. probability that a *female* has *high* income -> $P(high|female)$\n", "5. probability that an employee with *high* income is *female* -> $P(female|high)$\n", "6. probability that a *male* with *medium* income works in division *design* -> $P(design|male,medium)$\n", "7. probability that an employee in division *design* is a *male* with *high* income -> $P(male,medium|design)$\n", "8. probability that a *male* which is at least 4 years in the company has *medium* income -> $P(medium|male,\\geq4)$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For calculating joint probabilities and conditional probabilities the [pandas method crosstab()](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html) can be applied. **This method creates a table in which the frequencies of all value-combinations of two or more random variables can be determined.** Moreover, by applying the argument *normalize* of the `crosstab()`-method it is possible to calculate instead of the frequencies the joint probabilities or conditional probabilities of all value-combinations. This is demonstrated below: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we calculate the frequencies of all value-combinations of the variables *sex* and *income*: " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehlm
sex
f013
m213
\n", "
" ], "text/plain": [ "income h l m\n", "sex \n", "f 0 1 3\n", "m 2 1 3" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(data[\"sex\"],data[\"income\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we, set the argument `normalize=\"all\"` in the same method call. The result is the complete joint probability distribution of these two variables." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehlm
sex
f0.00.10.3
m0.20.10.3
\n", "
" ], "text/plain": [ "income h l m\n", "sex \n", "f 0.0 0.1 0.3\n", "m 0.2 0.1 0.3" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the table calculated above, we can derive the answer for question 1:\n", "\n", "$$\n", "P(male,high)=0.2\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set the argument `normalize=\"index\"`. The calculated values are the conditional probabilities $P(income|sex)$:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehlm
sex
f0.0000000.2500000.75
m0.3333330.1666670.50
\n", "
" ], "text/plain": [ "income h l m\n", "sex \n", "f 0.000000 0.250000 0.75\n", "m 0.333333 0.166667 0.50" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"index\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table calculated above contains the solutions for question 2 and 3:\n", "\n", "$$\n", "P(high|male)=0.333\n", "$$\n", "\n", "and\n", "\n", "$$\n", "P(high|female)=0\n", "$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to calculate the conditional probabilities of type $P(sex|income)$ we can apply the same `crosstab()`, but now with `normalize=\"columns\"`. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehlm
sex
f0.00.50.5
m1.00.50.5
\n", "
" ], "text/plain": [ "income h l m\n", "sex \n", "f 0.0 0.5 0.5\n", "m 1.0 0.5 0.5" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab(data[\"sex\"],data[\"income\"],normalize=\"columns\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This table contains the answer to question 5:\n", "\n", "$$\n", "P(female|high)=0\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `crosstab()`-method can also be applied for more than two variables, as demonstrated below:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
divisionbdms
sexincome
fl0.00.00.10.0
m0.10.10.00.1
mh0.00.10.00.1
l0.10.00.00.0
m0.00.30.00.0
\n", "
" ], "text/plain": [ "division b d m s\n", "sex income \n", "f l 0.0 0.0 0.1 0.0\n", " m 0.1 0.1 0.0 0.1\n", "m h 0.0 0.1 0.0 0.1\n", " l 0.1 0.0 0.0 0.0\n", " m 0.0 0.3 0.0 0.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"all\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the table calculated above, we can derive the answer for question 2:\n", "\n", "$$\n", "P(male,low,backoffice)=0.1\n", "$$" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
divisionbdms
sexincome
fl0.0000000.0000001.00.000000
m0.3333330.3333330.00.333333
mh0.0000000.5000000.00.500000
l1.0000000.0000000.00.000000
m0.0000001.0000000.00.000000
\n", "
" ], "text/plain": [ "division b d m s\n", "sex income \n", "f l 0.000000 0.000000 1.0 0.000000\n", " m 0.333333 0.333333 0.0 0.333333\n", "m h 0.000000 0.500000 0.0 0.500000\n", " l 1.000000 0.000000 0.0 0.000000\n", " m 0.000000 1.000000 0.0 0.000000" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"index\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The table calculated above, contains the answer of question 6:\n", "\n", "$$\n", "P(design|male,medium)=1.0\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to calculate the answer for question 7, we set `normalize=\"columns\"`:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
divisionbdms
sexincome
fl0.00.01.00.0
m0.50.20.00.5
mh0.00.20.00.5
l0.50.00.00.0
m0.00.60.00.0
\n", "
" ], "text/plain": [ "division b d m s\n", "sex income \n", "f l 0.0 0.0 1.0 0.0\n", " m 0.5 0.2 0.0 0.5\n", "m h 0.0 0.2 0.0 0.5\n", " l 0.5 0.0 0.0 0.0\n", " m 0.0 0.6 0.0 0.0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([data[\"sex\"],data[\"income\"]],data[\"division\"],normalize=\"columns\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this table we derive the answer for question 7:\n", "\n", "$$\n", "P(male,medium|design)=0.6\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For calculating the answer of question 8, we apply the `crosstab()`-method as described below and add the two values in column `m` which belong to rows that belong to *male* and at least 4 years:\n", "\n", "$$\n", "P(medium|male,\\geq 4)= P(medium|male,4) + P(medium|male,10) =0 + 0 = 0 \n", "$$" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
incomehlm
sexyears
f10.01.00.0
40.00.01.0
50.00.01.0
70.00.01.0
m10.00.01.0
20.00.01.0
40.50.50.0
101.00.00.0
\n", "
" ], "text/plain": [ "income h l m\n", "sex years \n", "f 1 0.0 1.0 0.0\n", " 4 0.0 0.0 1.0\n", " 5 0.0 0.0 1.0\n", " 7 0.0 0.0 1.0\n", "m 1 0.0 0.0 1.0\n", " 2 0.0 0.0 1.0\n", " 4 0.5 0.5 0.0\n", " 10 1.0 0.0 0.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.crosstab([data[\"sex\"],data[\"years\"]],data[\"income\"],normalize=\"index\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }