{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayes- and Naive Bayes Classifier\n", "\n", "In this notebook a parametric classifier for 1-dimensional input data is developed. The task is to predict the **category of car ($C_i$)**, a customers will purchase, if his **annual income ($x$)** is known. \n", "\n", "The classification shall be realized by applying Bayes-Theorem, which in this context is:\n", "\n", "$$\n", "P(C_i|x)=\\frac{p(x|C_i)P(C_i)}{P(x)} = \\frac{p(x|C_i)P(C_i)}{\\sum_k p(x|C_k)P(C_k)}\n", "$$\n", "\n", "In the **training phase** the gaussian distributed likelihood $p(x|C_i)$ and the a-priori $P(C_i)$ for each of the 3 car classes $C_i$ is estimated from a sample of 27 training instances, each containing the annual income and the purchased car of a former customer. The file containing the training data can be ob obtained from [here](AutoKunden.txt) " ] }, { "cell_type": "markdown", "metadata": { "tags": [ "hide-input" ] }, "source": [ "Required Python modules:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:05.397000Z", "start_time": "2017-10-25T19:08:05.378000Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "np.set_printoptions(precision=5,suppress=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Access labeled data\n", "Read customer data from file. Each row in the file represents one custoner. The first column is the customer ID, the second column is the annual income of the customer and the third column is the class of car he or she bought: \n", "\n", "* 0 = Low Class\n", "* 1 = Middle Class\n", "* 2 = Premium Class" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "autoDF=pd.read_csv(\"AutoKunden.csv\",index_col=0)#,header=None,names=[\"income\",\"class\"],sep=\" \",index_col=0)\n", "autoDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above data shall be applied for training the classifier. **The trained model shall then be applied to classify customers, whose annual income is defined in the list below:**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:13.451000Z", "start_time": "2017-10-25T19:08:13.446000Z" } }, "outputs": [], "source": [ "AnnualIncomeList=[25000,29000,63000,69000] #customers with this annual income shall be classified" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training\n", "In the training-phase for each car-class $C_i$ the likelihood-function $p(x|C_i)$ and the a-priori probability $p(C_i)$ must be determined. It is assumed that the likelihoods are gaussian normal distributions. Hence, for each class the **mean** and the **standard-deviation** must be estimated from the given data. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classincomeapriori
countmeanstd
class
0819651.6255099.4436090.296296
11242385.00017406.0726450.444444
2777884.00010666.2500440.259259
\n", "
" ], "text/plain": [ " class income apriori\n", " count mean std \n", "class \n", "0 8 19651.625 5099.443609 0.296296\n", "1 12 42385.000 17406.072645 0.444444\n", "2 7 77884.000 10666.250044 0.259259" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "classStats=autoDF.groupby(by=\"class\").agg({\"class\":\"count\",\"income\":[\"mean\",\"std\"]})\n", "classStats[\"apriori\"]=classStats[\"class\",\"count\"].apply(lambda x:x/autoDF.shape[0])\n", "classStats" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(10,8))\n", "Aposteriori=[]\n", "x=list(range(0,100000,100))\n", "for c in classStats.index:\n", " p=classStats[\"apriori\"].values[c]\n", " m=classStats[\"income\"][\"mean\"].values[c]\n", " s=classStats[\"income\"][\"std\"].values[c]\n", " likelihood = 1.0/(s * np.sqrt(2 * np.pi))*np.exp( - (x - m)**2 / (2 * s**2) )\n", " aposterioriMod=p*likelihood\n", " Aposteriori.append(aposterioriMod)\n", " plt.plot(x,aposterioriMod,label='class '+str(c))\n", "plt.grid(True)\n", "for AnnualIncome in AnnualIncomeList: #plot vertical lines at the annual incomes for which classification is required\n", " plt.axvline(x=AnnualIncome,color='m',ls='dashed')\n", "plt.legend()\n", "plt.xlabel(\"Annual Income\")\n", "plt.ylabel(\"Probability\")\n", "plt.title(\"Likelihood times A-Priori Probability for all 3 classes\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification (Inference Phase)\n", "\n", "Once the model is trained the likelihood $p(x|C_i)$ and the a-priori probability $P(C_i)$ is known for all 3 classes $C_i$. \n", "\n", "The most probable class is then calculated as follows: \n", "\n", "$$\n", "C_{pred} = argmax_{C_i}\\left( \\frac{p(x|C_i) \\cdot p(C_i)}{p(x)}\\right) = argmax_{C_i}\\left( \\frac{p(x|C_i)P(C_i)}{\\sum_k p(x|C_k)P(C_k)}\\right) \n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the code-cell below, customers with incomes of $25.000.-,29000.-,63000.-$ and $69000.-$ Euro are classified by the learned model:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:21.387000Z", "start_time": "2017-10-25T19:08:21.363000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--------------------\n", "Annual Income = 25000.00\n", "APosteriori propabilitiy of class 0 = 0.6837\n", "APosteriori propabilitiy of class 1 = 0.3163\n", "APosteriori propabilitiy of class 2 = 0.0000\n", "Most probable class for customer with income 25000.00 Euro is 0 \n", "--------------------\n", "Annual Income = 29000.00\n", "APosteriori propabilitiy of class 0 = 0.3630\n", "APosteriori propabilitiy of class 1 = 0.6370\n", "APosteriori propabilitiy of class 2 = 0.0000\n", "Most probable class for customer with income 29000.00 Euro is 1 \n", "--------------------\n", "Annual Income = 63000.00\n", "APosteriori propabilitiy of class 0 = 0.0000\n", "APosteriori propabilitiy of class 1 = 0.5797\n", "APosteriori propabilitiy of class 2 = 0.4203\n", "Most probable class for customer with income 63000.00 Euro is 1 \n", "--------------------\n", "Annual Income = 69000.00\n", "APosteriori propabilitiy of class 0 = 0.0000\n", "APosteriori propabilitiy of class 1 = 0.3159\n", "APosteriori propabilitiy of class 2 = 0.6841\n", "Most probable class for customer with income 69000.00 Euro is 2 \n" ] } ], "source": [ "for AnnualIncome in AnnualIncomeList:\n", " print('-'*20)\n", " print(\"Annual Income = %7.2f\"%AnnualIncome)\n", " i=int(round(AnnualIncome/100))\n", " proVal=[x[i] for x in Aposteriori]\n", " sumProbs=np.sum(proVal)\n", " for i,p in enumerate(proVal):\n", " print('APosteriori propabilitiy of class %d = %1.4f'% (i,p/sumProbs))\n", " print('Most probable class for customer with income %5.2f Euro is %d '% (AnnualIncome,np.argmax(np.array(proVal))))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian Classification with Scikit-Learn\n", "For Bayesian Classification Scikit-Learn provides Naive Bayes Classifiers for Gaussian-, Bernoulli- and Multinomial distributed data. In the example above 1-dimensional Gaussian distributed input-data has been applied. In this case the Scikit-Learn Naive Bayes Classifier for Gaussian-distributed data, `GaussianNB` learns the same model as the classifier implemented in the previous sections of this notebook. This is demonstrated in the following code-cells:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:26.005000Z", "start_time": "2017-10-25T19:08:26Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "from sklearn.naive_bayes import GaussianNB\n", "\n", "Income = np.atleast_2d(autoDF.values[:,0]).T\n", "labels = autoDF.values[:,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train the Naive Bayes Classifier:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:30.599000Z", "start_time": "2017-10-25T19:08:30.588000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/plain": [ "GaussianNB()" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf=GaussianNB()\n", "clf.fit(Income,labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The parameters mean and standarddeviation of the learned likelihoods are:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:33.088000Z", "start_time": "2017-10-25T19:08:33.079000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learned mean values for each of the 3 classes: \n", " [[19651.625]\n", " [42385. ]\n", " [77884. ]]\n", "Learned standard deviations for each of the 3 classes: \n", " [[ 4770.09278]\n", " [16665.04581]\n", " [ 9875.02871]]\n", "Note that std is slightly different as above. This is because std of pandas divides by (N-1)\n" ] } ], "source": [ "print(\"Learned mean values for each of the 3 classes: \\n\",clf.theta_)\n", "print(\"Learned standard deviations for each of the 3 classes: \\n\",np.sqrt(clf.sigma_))\n", "print(\"Note that std is slightly different as above. This is because std of pandas divides by (N-1)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use the trained model for predictions" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:37.348000Z", "start_time": "2017-10-25T19:08:37.336000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most probable class for annual income of 25000.-Euro is 0\n", "Most probable class for annual income of 29000.-Euro is 1\n", "Most probable class for annual income of 63000.-Euro is 1\n", "Most probable class for annual income of 69000.-Euro is 2\n" ] } ], "source": [ "Income=np.atleast_2d(AnnualIncomeList).T\n", "predictions=clf.predict(Income)\n", "for inc,pre in zip(AnnualIncomeList,predictions):\n", " print(\"Most probable class for annual income of %7d.-Euro is %2d\"%(inc,pre))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `predict(input)`-method returns the estimated class for the given input. If the a-posteriori probability $P(C_i|\\mathbf{x})$ is of interest, the `predict_proba(input)`-method can be applied:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:40.840000Z", "start_time": "2017-10-25T19:08:40.818000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A-Posteriori for class 0: 0.682 ; class 1: 0.318 ; class 3 0.000 for user with income 25000\n", "A-Posteriori for class 0: 0.320 ; class 1: 0.680 ; class 3 0.000 for user with income 29000\n", "A-Posteriori for class 0: 0.000 ; class 1: 0.595 ; class 3 0.405 for user with income 63000\n", "A-Posteriori for class 0: 0.000 ; class 1: 0.298 ; class 3 0.702 for user with income 69000\n" ] } ], "source": [ "predictionsProb=clf.predict_proba(Income)\n", "for i,inc in enumerate(AnnualIncomeList):\n", " print(\"A-Posteriori for class 0: %1.3f ; class 1: %1.3f ; class 3 %1.3f for user with income %7d\"%(predictionsProb[i,0], predictionsProb[i,1],predictionsProb[i,2],inc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Accuracy on training data" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:44.170000Z", "start_time": "2017-10-25T19:08:44.164000Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "Income=np.atleast_2d(autoDF.values[:,0]).T\n", "predictions=clf.predict(Income)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:44.998000Z", "start_time": "2017-10-25T19:08:44.991000Z" }, "scrolled": true, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ True False True True True False True True True True True False\n", " True True True True False True True True True False True True\n", " False True True]\n" ] } ], "source": [ "correctClassification=predictions==labels\n", "print(correctClassification)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:46.949000Z", "start_time": "2017-10-25T19:08:46.943000Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "numCorrect=np.sum(correctClassification)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:49.122000Z", "start_time": "2017-10-25T19:08:49.114000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy on training data is: 0.778\n" ] } ], "source": [ "accuracyTrain=float(numCorrect)/autoDF.shape[0]\n", "print(\"Accuracy on training data is: %1.3f\"%accuracyTrain)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:49.951000Z", "start_time": "2017-10-25T19:08:49.940000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/plain": [ "array([[7, 1, 0],\n", " [3, 7, 2],\n", " [0, 0, 7]])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix\n", "confusion_matrix(y_true=labels,y_pred=predictions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the confusion matrix the entry $C_{i,j}$ in row $i$, column $j$ is the number of instances, which are known to be in class $i$, but predicted to class $j$. For example the confusion matrix above indicates, that 3 elements of true class $1$ have been predicted as class $0$-instances. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross Validation\n", "The accuracy on training data should not be applied for model evaluation. Instead a model should be evaluated by determining the accuracy (or other performance figures) on data, which has not been applied for training. Since we have only few labeled data in this example cross-validation is applied for determining the model's accuracy:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:08:54.030000Z", "start_time": "2017-10-25T19:08:54.008000Z" }, "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0.66667 0.66667 0.8 0.8 0.8 ]\n" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "clf=GaussianNB()\n", "print(cross_val_score(clf,Income,labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Naive Bayes Classifier for Multidimensional data\n", "In the playground-example above the input-features where only one dimensional: The only input feature has been the annual income of a customer. The 1-dimensional case is quite unusual in practice. In the code-cell below a **Naive Bayes Classifier** is evaluated for multidimensional data. This is just to demonstrate that the same process as applied above for the simple dataset, can also be applied for arbitrary complex datasets.\n", "\n", "Again we start from the Bayes Theorem:\n", "\n", "$$\n", "P(C_i|\\mathbf{x})=\\frac{p(\\mathbf{x}|C_i)P(C_i)}{P(\\mathbf{x})}.\n", "$$\n", "\n", "However, the crucial difference to the simple example above is, that not only one random variable $X$ constitutes the input, but many many random variables $X_1,X_2,\\ldots,X_N$. I.e a concrete input is a vector \n", "\n", "$$\n", "\\mathbf{x}=(x_{i_1},x_{i_1},\\ldots,x_{i_N})\n", "$$\n", "\n", "The problem is then: **Of what type is the N-dimensional likelihood $p(\\mathbf{x}|C_i)$ and how to estimate this likelihood?**\n", "\n", "For the general case, where some of the input variables are discrete and others are numeric, there does not exist a joint-likelihood. Therefore, one **naively** assumes that all input variables $X_i$ are independent of each other. Then the N-dimensional likelihood $p(\\mathbf{x}|C_i)$ can be factorised into $N$ 1-dimensional likelihoods and these 1-dimenensional likelihoods can be easily estimated from the given training data (as shown above). This is the widely applied **Naive Bayes Classifier:**\n", "\n", "$$\n", "P(C_i|\\mathbf{x})=\\frac{p(\\mathbf{x}|C_i)}{P(\\mathbf{x})}P(C_i) = \\frac{\\prod_{j=1}^N p(x_j|C_i)}{P(\\mathbf{x})} P(C_i)\n", "$$\n", "\n", " \n", "Below, we apply the [wine dataset](wine.data). This dataset is described [here](wine.names.txt). Actually it is also relatively small, but it contains multidimensional data. \n", "\n", "\n", "In the dataset the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of $N=13$ constituents found in each of the three types of wines. The task is to predict the wine-type (first column of the dataset) from the 13 features, that have been obtained in the chemical analysis." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012345678910111213
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
.............................................
173313.715.652.4520.5951.680.610.521.067.700.641.74740
174313.403.912.4823.01021.800.750.431.417.300.701.56750
175313.274.282.2620.01201.590.690.431.3510.200.591.56835
176313.172.592.3720.01201.650.680.531.469.300.601.62840
177314.134.102.7424.5962.050.760.561.359.200.611.60560
\n", "

178 rows × 14 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 10 11 \\\n", "0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 \n", "1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 \n", "2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 \n", "3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 \n", "4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 \n", ".. .. ... ... ... ... ... ... ... ... ... ... ... \n", "173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.52 1.06 7.70 0.64 \n", "174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.43 1.41 7.30 0.70 \n", "175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.43 1.35 10.20 0.59 \n", "176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.53 1.46 9.30 0.60 \n", "177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.56 1.35 9.20 0.61 \n", "\n", " 12 13 \n", "0 3.92 1065 \n", "1 3.40 1050 \n", "2 3.17 1185 \n", "3 3.45 1480 \n", "4 2.93 735 \n", ".. ... ... \n", "173 1.74 740 \n", "174 1.56 750 \n", "175 1.56 835 \n", "176 1.62 840 \n", "177 1.60 560 \n", "\n", "[178 rows x 14 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "wineDataFrame=pd.read_csv(\"wine.data\",header=None)\n", "wineDataFrame" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:09:02.356000Z", "start_time": "2017-10-25T19:09:02.351000Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "wineData=wineDataFrame.values\n", "print(wineData.shape)\n", "\n", "features=wineData[:,1:] #features are in columns 1 to end\n", "labels=wineData[:,0] #class label is in column 0" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2017-10-25T19:09:03.207000Z", "start_time": "2017-10-25T19:09:03.168000Z" }, "tags": [ "hide-input" ] }, "outputs": [], "source": [ "clf=GaussianNB()\n", "acc=cross_val_score(clf,features,labels,cv=5)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Accuracy is 0.9663492063492063\n" ] } ], "source": [ "print(\"Mean Accuracy is \",acc.mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "nav_menu": {}, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false }, "toc_position": { "height": "485px", "left": "0px", "right": "1068px", "top": "125px", "width": "212px" } }, "nbformat": 4, "nbformat_minor": 1 }