This is a revisit on regressions (linear regression, logistic regression, and softmax regression), focused on model inference.

Linear regression

Problem: Given a data instance \(\mathbb{x} = (x_1, x_2, \cdots, x_n)\), we want to determine the corresponding regression value \(y\), with the regression model \(\mathrm{M}\) where the linear regression weights/coefficients are \(\mathbb{\omega} = (\omega_1, \omega_2, \cdots, \omega_n)\).

Prediction formula:

Cost function: For each instance \(i\) the cost function is:

Total cost function:

In order to obtain a better model, we want to minimize the total cost, as \(\underset{\mathbb{\omega}^*}{\arg\min}\thinspace\text{cost}(\mathbb{\omega})\).

Though there is closed form solution to this problem, here I’d like to introduce the gradient descent method, a iterative way for finding a better model in each iteration.

Gradient of each instance is:

Logistic regression

Problem: Given a data instance \(\mathbb{x} = (x_1, x_2, \cdots, x_n)\), we want to do binary classification for label \(y\) with the regression model \(\mathrm{M}\) where the logistic regression weights/coefficients are \(\mathbb{\omega} = (\omega_1, \omega_2, \cdots, \omega_n)\).

Prediction formula for \( y \in \{0, 1\} \):

Prediction formula for \( y \in \{-1, 1\} \):

Cost function: For each instance when \( y \in \{0, 1\} \):

Cost function: For each instance when \( y \in \{-1, 1\} \):

Gradient of each instance when \( y \in \{0, 1\} \):

Gradient of each instance when \( y \in \{-1, 1\} \):

Softmax regression

Problem: Given a data instance \(\mathbb{x} = (x_1, x_2, \cdots, x_n) \), we want to do multiclass classification for label \(y\) with the regression model \(\mathrm{M}\) where the softmax regression weights/coefficients are \(\mathbb{\Omega}\), now it is a matrix instead of a vector.

where each \(\mathbb{\omega_k} = (\omega_{k1}, \omega_{k2}, \cdots, \omega_{kn})\).

Prediction formula with \( y \in {1, 2, \cdots, K} \):

Cost function for each instance:

Derivative of each instance w.r.t each \(\mathbb{\omega}_k\) is:

Gradient of each instance:

Special property of parameterization:

Softmax regression has an unusual property that it has a redundant set of parameters.

Explained as bellow:

Given a softmax model \(\mathrm{M}\) with weights \(\mathbb{\Omega}\), for each \(\mathbb{\omega}_k\), we subtract an arbitrary fixed vector \(\mathbb{\psi}\) from it, and so that we get a new model \(\mathrm{M}’\) with weights \(\mathbb{\Omega}’\). Now the prediction becomes:

Subtracting \(\mathbb{\psi}\) from every \(\mathbb{\omega}_k\) gives us identical prediction result, which shows that softmax regression’s parameters are redundant, or say, overparameterized. It means that for any hypothesis we fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function mapping from input \(\mathbb{x}\) to the predictions.

It also means that if the cost function is minimized by \((\mathbb{\omega}_1, \mathbb{\omega}_2, \cdots, \mathbb{\omega}_K)\), then it is also minimized by \((\mathbb{\omega}_1 - \mathbb{\psi}, \mathbb{\omega}_2 - \mathbb{\psi}, \cdots, \mathbb{\omega}_K - \mathbb{\psi})\) for any value of \(\mathbb{\psi}\). (Interestingly, the cost function is still convex, and thus gradient descent will not run into local optima problems. But the Hessian is singular, which will cause a straightforward implementation of Newton’s method to run into numerical problems, thus penalty is needed for one reason.)

So we can create \(K - 1\) binary logistic regression models by choosing one class as reference or pivot.

If the last class \(K\) is selected as the reference, then the probability of the reference class is:

Since the general form of the probability is

Now, with the class \(K\) as the reference class, \(\omega_K = (0, 0, \cdots, 0)^T\), we have

In the end, the prediction formula for class \(k \lt K\) becomes:

Notice: the choice of reference class is not important in maximum likelihood. With penalised maximum likelihood, or bayesian inference, it can often be more useful to leave the probabilities over-parameterised, let the penalty choose a way of handing the over-parameterisation. This is because most penalty functions/priors are not invariant with respect to the choice of reference class.