Linear Regression
Contents
Linear Regression#
For a given set of labeled training data
where the targets \(r_t\) are numeric values, usually from \(\mathcal{R}\) and the inputs \(\mathbf{x}_t=(x_{1,t}, \ldots, x_{d,t})\) are numeric vectors of length \(d\), the goal of regression is to learn a function, which maps the input-vectors to the target-values.
We usually assume that the targets \(r\) can be calculated as a sum of a determinisitc function \(f(\mathbf{x})\) and a non-determininstic noise \(n\)
and we like to find a good approximation \(g(\mathbf{x}|\Theta)\) for the unknown deterministic part \(f(\mathbf{x})\).
In linear regression we assume that the approximation \(g(x|\Theta)\) is a linear function of type
This means, that we assume a certain type of function and we want to learn the parameters
from data, such that the corresponding \(g(\mathbf{x}|\Theta)\) is a good estimate for \(f(\mathbf{x})\).
Concerning the non-deterministic part \(n\) one often assumes that it is a Gaussian-distributed random variable of mean \(\mu=0\) and standard-deviation \(\sigma\). In this case the posterior \(p(r|\mathbf{x})\) is also a Gaussian distribution with standard-deviation \(\sigma\) and mean \(\mu=g(\mathbf{x} \mid \theta)\).
Maximum-Likelihood Estimation#
Maximum-Likelihood Estimation (MLE) estimates the parameters \(\Theta\), such that the corresponding \(g(\mathbf{x}|\Theta)\) is the one, which most likely generates the given set of training data \(T\).
Under the assumption that noise \(n\) is a Gaussian-distributed variable of mean \(\mu=0\), one can prove that the MLE approach can be realized by minimizing the Sum of Squared Error (SSE) Loss function:
This means that the learning task is: Determine the parameters \(w_i \in \Theta\), which minimize the SSE (function (2)).
In general, in order to find the value \(x\), at which a function \(f(x)\) is minimal, we have to calculate the first derivation \(f'(x)=\frac{\partial f}{\partial x} \) and determine it’s zeros.
Here we have an error function \(E(\Theta | T)\), which depends not only on a single variable, but on all weights \(w_i \in \Theta\). Hence, we have to determine all partial derivatives
and set them equal to zero. This yields a system of \(d+1\) linear equations, which can be written as a matrix-multiplication
and solving for \(\mathbf{w}\) yields:
where
For an efficient solution one usually calulates \(\mathbf{A}\) and \(\mathbf{y}\) as follows:
where
By expressing \(\mathbf{A}\) and \(\mathbf{y}\) in eauation (4) in terms of \(\mathbf{D}\) and \(\mathbf{r}\) the weights, which minimize the loss function are:
Note
The approach described above is a quite old type of linear regression. In the literature it is often called Ordinary Least Square (OLS). In scikit-learn the name of the corresponding OLS-module is just LinearRegression
Generalized Linear Regression#
With Linear Regression one can not only learn linear functions \(g(\mathbf{x})\) of type (1). Since we are free to preprocess the input vectors \(\mathbf{x}\) with an arbitrary aomount \(z\) of preprocessing functions \(\Phi_i\) of arbitrary type (linear and non-linear), a Generlized Linear Regression of type
can be learned. Note that this is still called linear regression, because we are still linear in the variable’s \(w_i\).
Example Generalized Linear Regression with
Assume that the input vectors are of dimension \(d=2\), i.e. \(\mathbf{x}=(x_1,x_2)\) and we like to learn a quadratic function. For this we can define a set of functions \(\Phi_i\):
The learning task is then to determine the weights \(w_i\) of the polynomial
which yields the minimum loss.
Finding the weights \(w_i\), which minimize the loss function, can be performed in the same way as described above. One just has to replace all \(x_{i,t}\) by \(\Phi_i(\mathbf{x}_t)\). In particular \(\mathbf{w}\) can be calculated as in equation (5), but now matrix \(D\) is
Regularisation#
In Machine Learning Regularisation is a technique to avoid overfitting. With regularisation the weights are learned such that they not only minimize a loss function on training data (e.g. the Mean Squared Error) but simultaneously have as low as possible absolut values. This additional restriction - absolute values of weights shall be low - yields better generalisation because functions with lower coefficients \(w_i\) are smoother. However, the challenge is to find a good trade-off between minimizing the error on training data and minimizing the weights \(w_i\). If too much emphasis is put on the weight-minimizsation the learned function may be to simple, i.e. it is underfitted to training data.
The different regularisation methods, described below, learn weights by minimizing training-error and a regularisation term simultaneously:
The different techniques described below all perform linear regression, but differ in the used regularisation-term. The hyperparameter \(\lambda\) is used to control the trade-off between error-minimisation and weight-minimisation.
Ridge Regression#
In Ridge-Regression the error-function \(E(w,T)\) in equation (10) is the MSE and the regularisation term is the squared L2-norm. I.e. Ridge-Regression minimizes
where the p-norm \(||w||_p\) is defined to be
For Ridge Regression, scikit-learn provides the class Ridge.
Lasso#
Lasso regularisation provides sparse weights. This means that many weights are zero or very close to zero and only the weights which belong to the most important features are non-zero. This sparsity can be achieved by applying the L1-Norm as regularisation term.
For Lasso Regression, scikit-learn provides the class Lasso.
Elastic-Net#
Elastic-Net regularisation applies a regularisation term, which is a weighted sum of L1- and L2-norm.
For Elastic-Net Regression, scikit-learn provides the class ElasticNet.