{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Linear Regression\n", "\n", "In this example (generalized) linear regression, as introduced [in the previous section](LinReg) is implemented and applied for estimating a function $f()$ that maps the speed of long distance runners to their heartrate. \n", "\n", "$$\n", "heartrate = f(speed)\n", "$$\n", "\n", "For training the model, a set of 30 samples is applied, each containing the speed (in m/s) of a runner and the heartrate measured at this speed. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note that in this example input data consists of the single feature *speed*, i.e. it is 1-dimensional (d=1). All functions implemented below are tailored to this one-dimensional case." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Required Modules:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "#%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import math\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read data from file. The first column contains an ID, the second column is the speed in m/s and the third column is the heartrate in beats/s." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
speedheartrate
14.50155.15
25.00166.68
34.50164.37
45.25160.82
54.50148.51
64.75169.83
75.00188.01
85.50187.90
94.50157.96
105.25178.29
115.25179.55
125.50203.81
134.00150.74
144.50171.78
155.00172.52
164.25148.92
174.25160.02
185.00183.83
195.00156.67
205.00162.40
215.00171.39
224.00156.10
234.50153.15
244.75161.35
254.25163.16
265.25165.42
275.25189.25
284.25165.56
295.25172.35
304.00158.07
\n", "
" ], "text/plain": [ " speed heartrate\n", "1 4.50 155.15\n", "2 5.00 166.68\n", "3 4.50 164.37\n", "4 5.25 160.82\n", "5 4.50 148.51\n", "6 4.75 169.83\n", "7 5.00 188.01\n", "8 5.50 187.90\n", "9 4.50 157.96\n", "10 5.25 178.29\n", "11 5.25 179.55\n", "12 5.50 203.81\n", "13 4.00 150.74\n", "14 4.50 171.78\n", "15 5.00 172.52\n", "16 4.25 148.92\n", "17 4.25 160.02\n", "18 5.00 183.83\n", "19 5.00 156.67\n", "20 5.00 162.40\n", "21 5.00 171.39\n", "22 4.00 156.10\n", "23 4.50 153.15\n", "24 4.75 161.35\n", "25 4.25 163.16\n", "26 5.25 165.42\n", "27 5.25 189.25\n", "28 4.25 165.56\n", "29 5.25 172.35\n", "30 4.00 158.07" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataframe=pd.read_csv(\"HeartRate.csv\",header=None,sep=\";\",decimal=\",\",index_col=0,names=[\"speed\",\"heartrate\"])\n", "dataframe" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of samples: 30\n" ] } ], "source": [ "numdata=dataframe.shape[0]\n", "print(\"Number of samples: \",numdata)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the function `calculateWeights(X,r,deg)` the weights are calculated by applying the already introduced equation\n", "\n", "$$\n", "w=\\left( D^T D\\right)^{-1} D^T r\n", "$$\n", "\n", "The function is tailored to the case, where input data consists of only a single feature. However, the function is implemented such that, it can not only be applied to learn a linear function, but a polynomial of arbitrary degree. The degree of the polynomial can be set by the `deg`-argument of the function." ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "def calculateWeights(X,r,deg):\n", " numdata=X.shape[0]\n", " D=np.zeros((numdata,deg+1))\n", " for p in range(numdata):\n", " for ex in range(deg+1):\n", " D[p][ex]=math.pow(float(X[p]),ex)\n", " DT=np.transpose(D)\n", " DTD=np.dot(DT,D)\n", " y=np.dot(DT,r)\n", " w=np.linalg.lstsq(DTD,y,rcond=None)[0]\n", " return w" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [], "source": [ "features=dataframe[\"speed\"].values\n", "targets=dataframe[\"heartrate\"].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn linear function\n", "First, we learn the best linear function \n", "\n", "$$\n", "heartrate = w_0+w_1 \\cdot speed\n", "$$\n", "\n", "by setting the `deg`-argument of the function `calculateWeights()` to 1:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calculated weights:\n", "w0 = 67.68\n", "w1 = 20.93\n" ] } ], "source": [ "degree=1\n", "w=calculateWeights(features,targets,degree)\n", "print('Calculated weights:')\n", "for i in range(len(w)):\n", " print(\"w%d = %3.2f\"%(i,w[i]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The learned model and the training samples are plotted below:" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10,8))\n", "plt.scatter(features,targets,marker='o', color='red')\n", "plt.title('heartrate vs. speed of long distance runners')\n", "plt.xlabel('speed in m/s')\n", "plt.ylabel('heartrate in bps')\n", "RES=0.05 # resolution of speed-axis\n", "# plot calculated linear regression \n", "minS=np.min(features)\n", "maxS=np.max(features)\n", "speedrange=np.arange(minS,maxS+RES,RES)\n", "hrrange=np.zeros(speedrange.shape[0])\n", "for si,s in enumerate(speedrange):\n", " hrrange[si]=np.sum([w[d]*s**d for d in range(degree+1)])\n", "plt.plot(speedrange,hrrange)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally the mean absolute distance (MAD) and the Mean Square Error (MSE) are calculated." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.544456857402351\n", "MAD = 7.544456857402351\n", "MSE = 84.61650344232515\n" ] } ], "source": [ "pred=np.zeros(numdata)\n", "for si,x in enumerate(features):\n", " pred[si]=np.sum([w[d]*x**d for d in range(degree+1)])\n", " \n", "mad=1.0/numdata*np.sum(np.abs(pred-targets))\n", "mse=1.0/numdata*np.sum((pred-targets)**2)\n", "print(mad) \n", "print('MAD = ',mad) \n", "print('MSE = ',mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that here the metrics MAD and MSE have been calculated on the training data. Hence, the corresponding values describe how well the model is fitted to training data. But these values are useless for determining how good the model will perform on new data. Usually in Machine Learning performance metrics such as MAD and MSE are calculated on test-data. But in this example we haven't split the set of labeled data into a training- and a test-partition." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn quadratic function\n", "\n", "In order to learn the best quadratic function\n", "\n", "$$\n", "heartrate = w_0+w_1 \\cdot speed +w_2 \\cdot (speed)^2\n", "$$\n", "\n", "we repeat the steps for `deg=2`:" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calculated weights:\n", "w0 = 445.13\n", "w1 = -140.19\n", "w2 = 17.04\n" ] } ], "source": [ "degree=2\n", "w=calculateWeights(features,targets,degree)\n", "print('Calculated weights:')\n", "for i in range(len(w)):\n", " print(\"w%d = %3.2f\"%(i,w[i]))" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10,8))\n", "plt.scatter(features,targets,marker='o', color='red')\n", "plt.title('heartrate vs. speed of long distance runners')\n", "plt.xlabel('speed in m/s')\n", "plt.ylabel('heartrate in bps')\n", "RES=0.05 # resolution of speed-axis\n", "# plot calculated linear regression \n", "minS=np.min(features)\n", "maxS=np.max(features)\n", "speedrange=np.arange(minS,maxS+RES,RES)\n", "hrrange=np.zeros(speedrange.shape[0])\n", "for si,s in enumerate(speedrange):\n", " hrrange[si]=np.sum([w[d]*s**d for d in range(degree+1)])\n", "plt.plot(speedrange,hrrange)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.914621088469685\n", "MAD = 6.914621088469685\n", "MSE = 74.97805153085808\n" ] } ], "source": [ "pred=np.zeros(numdata)\n", "for si,x in enumerate(features):\n", " pred[si]=np.sum([w[d]*x**d for d in range(degree+1)])\n", " \n", "mad=1.0/numdata*np.sum(np.abs(pred-targets))\n", "mse=1.0/numdata*np.sum((pred-targets)**2)\n", "print(mad) \n", "print('MAD = ',mad) \n", "print('MSE = ',mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn cubic function\n", "\n", "In order to learn the best cubic function\n", "\n", "$$\n", "heartrate = w_0+w_1 \\cdot speed +w_2 \\cdot (speed)^2 +w_3 \\cdot (speed)^3\n", "$$\n", "\n", "we repeat the steps for `deg=3`:" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Calculated weights:\n", "w0 = -1374.83\n", "w1 = 1025.96\n", "w2 = -230.63\n", "w3 = 17.44\n" ] } ], "source": [ "degree=3\n", "w=calculateWeights(features,targets,degree)\n", "print('Calculated weights:')\n", "for i in range(len(w)):\n", " print(\"w%d = %3.2f\"%(i,w[i]))" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10,8))\n", "plt.scatter(features,targets,marker='o', color='red')\n", "plt.title('heartrate vs. speed of long distance runners')\n", "plt.xlabel('speed in m/s')\n", "plt.ylabel('heartrate in bps')\n", "RES=0.05 # resolution of speed-axis\n", "# plot calculated linear regression \n", "minS=np.min(features)\n", "maxS=np.max(features)\n", "speedrange=np.arange(minS,maxS+RES,RES)\n", "hrrange=np.zeros(speedrange.shape[0])\n", "for si,s in enumerate(speedrange):\n", " hrrange[si]=np.sum([w[d]*s**d for d in range(degree+1)])\n", "plt.plot(speedrange,hrrange)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "6.950713112061212\n", "MAD = 6.950713112061212\n", "MSE = 72.8395306512783\n" ] } ], "source": [ "pred=np.zeros(numdata)\n", "for si,x in enumerate(features):\n", " pred[si]=np.sum([w[d]*x**d for d in range(degree+1)])\n", " \n", "mad=1.0/numdata*np.sum(np.abs(pred-targets))\n", "mse=1.0/numdata*np.sum((pred-targets)**2)\n", "print(mad) \n", "print('MAD = ',mad) \n", "print('MSE = ',mse)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Same solution, now using Scikit Learn" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "degree=3" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LinearRegression()\n", "Degree = 3\n", "Learned coefficients w0, w1, w2, ....:\n", "[-1374.8269195 1025.96432274 -230.63229039 17.43724445]\n" ] } ], "source": [ "from sklearn import linear_model\n", "speed=np.transpose(np.atleast_2d(dataframe.values[:,0]))\n", "for d in range(1,degree):\n", " newcol=np.transpose(np.atleast_2d(np.power(speed[:,0],d+1)))\n", " speed=np.concatenate((speed,newcol),axis=1)\n", "heartrate=dataframe.values[:,1]\n", "\n", "# Train Linear Regression Model\n", "reg=linear_model.LinearRegression()\n", "reg.fit(speed,heartrate)\n", "print(reg)\n", "\n", "# Parameters of Trained Model \n", "print(\"Degree = \",degree)\n", "print(\"Learned coefficients w0, w1, w2, ....:\")\n", "wlist=[reg.intercept_]\n", "wlist.extend(reg.coef_)\n", "w=np.array(wlist)\n", "print(w)" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Plot training samples\n", "plt.figure(figsize=(10,8))\n", "plt.scatter(speed[:,0],heartrate,marker='o', color='red')\n", "plt.title('heartrate vs. speed of long distance runners')\n", "plt.xlabel('speed in m/s')\n", "plt.ylabel('heartrate in bps')\n", "#plt.hold(True)\n", "for si,s in enumerate(speedrange):\n", " hrrange[si]=np.sum([w[d]*s**d for d in range(degree+1)])\n", "plt.plot(speedrange,hrrange)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.0" }, "nav_menu": {}, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }