Posted on June 4, 2017

Basic Princple of Machine Learning,
Regressions and Classifications

Ryan J. Kung

ryankung(at)ieee.org

I Introduction

1.1 What is Machine Learning[1]

Two definitions of Machine Learning are offered:

Arthur Samuel described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

1.1.1 Supervised Learning and Unsupervised Learning

Supervised Learning[2]

Unsupervised Learning[3]

1.2 Model and Cost Function

Input and Output

Hypothesis Function

Note that the \(hypothesis\ function\) is denoted as \(\phi(x)\) in the Deep Learning book.

Cost Function[4]

Cost Function :: MSE

This function is otherwise called the “Squared error function”, or “Mean squared error”.

\[\begin{equation} J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2 \end{equation}\]

1.3 Gradient Descent

For minimizing the cost Function Z, We keep changing \(\theta\) to reduce \(J(\theta)\).

repeat until convergance { \[\begin{equation} \theta_i := \theta_i - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \end{equation}\]

}

Where \(\alpha\) is the learning rate. And \(\theta_{1..k}\) should be update simultaneously.

II, Linear Regression with Multiple Variables

2.1 Multi Features

Denotes:

Hypothesis:

We suppose that:

\[\begin{align*} x_0^{(i)}=1\\ \end{align*}\]

So:

\[\begin{align*} h_{\theta} &= \theta_0 + \theta_1x_1 + \theta_2x_2 +...+\theta_kx_k\\ &=\theta_0x_0 + \theta_1x_1 + \theta_2x_2 +...+\theta_kx_k \\&=\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}\\ &= \Theta^T X \end{align*}\]

Feature Scaling

\[\begin{align*} x_i := \dfrac{x_i - \mu_i}{s_i} \end{align*}\]

Where \(\mu_i\) is the average of all values for feature(i), and \(s_i\) is the range value of values \((max-min)\), or \(s_i\) is the standard deviation.[5]

Learning Rate

Measure \(J(\theta, iteration)\) to makeing sure gradient descent working correctly. \(J(\theta, iteratorn)\) should be a \(convex\ function\).

To choose \(\alpha\) try:

…, 0.001, 0.01, .., 0.1, 1, …

Polynomial Regression

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Linear Function -> Quadratic Function -> Cubic Function -> …

2.2 Computing Parameters Analytically

Normal Equation

Normal Equation: Method to solve for \(\theta\) analytically.

To Set: \[\begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) = \frac{d}{d \theta_j} J(\theta) = 0 \end{align*}\]

Solve for \(\theta\)

\[\begin{align*} J(\Theta)=\frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)}-y^{(i)})^2 \end{align*}\] To Set: \[\begin{align*} \frac{\partial}{\partial \theta_j} J(\theta) = 0 \end{align*}\]

(for every \(j\))

Solve for \(\Theta\)

pinv(X'*X)*X'*Y

The following is a comparison of gradient descent and the normal equation:

Gradient Descent |Norma Normal Equation
Need to choose \(\alpha\) |No nee No need to choose \(\alpha\)
Needs many iterations | No need to iterate
\(O (kn2)\) |$O ( \(O (n3)\), need to calculate inverse of \(X^TX^{-1}\)
Works well when \(n\) is large |S Slow if \(n\) is very large

Normal Equation Non-inveribility

If \(X^TX\) is non-invertible.

III, Logistic Regression

3.1 Classification and Representation

Classification

\(y \in \{0, 1\}\)

Threshold classifier output \(h_{\theta}(x)\) at 0.5.

if \(h_{\theta}(x) \geq 0.5\), predict “\(y=1\)

if \(h_{\theta}(x) \leq 0.5\), predict “\(y=0\)

Classification: \(y=0, 1\)

\(h(x)\) can be >1 or <0

Hypothesis Representation

Logistic Regression Model

Want

\[\begin{align*} h_{\theta}(x) \in [0, 1] \end{align*}\] \[\begin{align*} h_{\theta}(x)=g(\theta^Tx) \end{align*}\]

With sigmoid function (or logistic function, also activation function in neural netowrk case):

\[\begin{align*} g(z)=\frac{1}{1+e^{-z}}\\ h_{\theta}(x)=\frac{1}{1+e^{-\theta^Tx}} \end{align*}\] \[\begin{align*} h_{\theta}(x)=P(y=1\ |\ x;\theta) \end{align*}\]

Decision Boundary

predict “\(y=1\)” if \(h_{\theta}\geq0.5\)

predict “\(y=0\)” if \(h_{\theta}\lt0.5\)

\(g(z)\geq0.5\) when \(z\geq0\)

\(h_{\theta}(x)=g(\theta^Tx)\geq0.5\) wherever \(\theta^Tx\geq 0\)

Predict \(y=1\) if \(\theta^Tx\geq 0\)

Cost Function

\[\begin{align*} J(\theta) = \sum_{i=1}^m Cost(h_{\theta}(x^{(i)}), y) \end{align*}\]

If we use the linear regression cost function as cost functon of logistic regression, It would be none-convex function of \(\theta\)

Logisitc regression cost function:

\[\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}\]

Which can be rewrite as:

\[\begin{align*} J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] \end{align*}\]

This cost function can be derived from statistics using the principle of maximum likelihood estimation[7].

Advanced Optimization

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.

3.2 Multiclass Classification

To train datasets with binary logisic classification as the binary tree

\[\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}\]

3.3 The Problem of Overfitting

overfitting is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

There are two main options to address the issue of overfitting:

Since that we have a lot of features(a hundred maybe), and dont know select which \(\theta\) to shrink, so the basic idea is to do summation for all \(\theta_{1,n}\).

Note that we should not shrink \(\theta_0\), which make very little difference to the result.

\[\begin{align*} min_\theta\ \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \end{align*}\]

The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Regularized Linear Regression

On Gradient decent
\[\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace\\ \end{align*}\]

And \(\theta_j\) can represent with: \(\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)

The first term in the above equation, \(1 - \alpha\frac{\lambda}{m}\) will usually less than 1, because alpha times lambda over m is going to be positive, and usually if your learning rate is small and if m is large, this is usually pretty small[6].

Address to \(1 - \alpha\frac{\lambda}{m}\) is less than 1, through the iteration, the \(\theta_j\) will become smaller and smaller.

On Normal Equation
\[\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}\]

Note that \(L\) is an n-1 x n-1 matrix.

If m < n, then \(X^TX\) is non-invertible. However, when we add the term λ⋅L, then \(X^TX + λ⋅L\) becomes invertible.

Regularized Logistic Regression

Cost Function
\[\begin{align*} J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2 \end{align*}\]

Gradient Descent

\[\begin{align*} & \text{Repeat}\ \lbrace \newline & \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline & \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline & \rbrace\\ \end{align*}\]

IV Neural Network

\(a_i^{(j)}\)= “activation” of unit \(i\) in layer \(j\).

\(\Theta^{(j)}\) = matrix of weights controlling function mapping from layer \(j\) to layer \(j+1\).

If network has \(s_j\) units in layer \(j\), \(s_j+1\) units in layer \(j+1\), then \(\Theta^{(j)}\) will be of dimension \(s_{j+1} \times (s_j+1)\)

\(L\) = total \(no.\) of layer of neural network

\(S_l\) = \(no.\) of units (not counting bias unit) in layer \(l\)

\(K\) = the number of units in the output layer.

Cost Function

\[\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}\]

Which is similar to Logistic Regression:

\[\begin{gather*} J(\theta) = - \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2 \end{gather*}\]

Note:

So we Focusing on a single example \(x^{(i)}\), \(y^{(i)}\), the cause of 1 output unit, and ignore the reqularization (\(\lambda=0\))

\[\begin{gather*} cost(i) = y^{(i)}\log{h_{\Theta}(x^{(i)}+(1-y^{(i)}logh_{\Theta}(x^{(i)}))} \end{gather*}\]

We can think of that \(cost(i)\approx (h_{\Theta}(x^{(i)}-y^{i}))^2\) which is \(MSE\), for judge how well is the network doing on example \(i\).

Gradient computation

\[\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}\]

Goal:

\[\begin{gather*} \min_\Theta J(\Theta) \end{gather*}\]

Need code to compute:

\[\begin{gather*} J(\Theta)\ {and}\ \dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta) \end{gather*}\]

Backpropagation algorithm:

Backpropagation is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

Intuition: \(\delta^l_j\) = “error” of cost for \(a_j^{(l)}\) (the activation value on node \(j\) in layer \(l\)):

Formally, \(\delta_j^{(l)}=\frac{\partial}{\partial{z_j^{(l)}}}cost(i)\) (for \(j \geq 0\)), where:

\[\begin{gather*} cost(i) = y^{(i)}\log{h_{\Theta}(x^{(i)}+(1-y^{(i)}logh_{\Theta}(x^{(i)}))} \end{gather*}\]
In example of applying backpropagation algorithm on a (L=4) Network:
\[\begin{gather*} \delta_j^{(4)}=a_j^{(4)}-y_j=(h_{\Theta}(x))_j-y_i \end{gather*}\]

Each of these \(\delta_j\), \(a_j\) and \(y\), is a vector whose dimension is equal to \(K\).

\[\begin{gather*} \delta^{(3)}=(\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \end{gather*}\] \[\begin{gather*} \delta^{(2)}=(\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \end{gather*}\]

This term \(g'(z^{(l)})\), that formally is actually the derivative of the activation function \(g\) evaluated at the input values given by \(z^{(l)}\).

So we can describe the algorithm as:

Training set \({(x^{(1)}, y^{(1)}),...,(x^{(m)},y^{(m)})}\)

Set \(\Delta^{(l)}_{ij}=0\) (for all \(l, i, j)\)

For \(i=1\) to \(m\) {

— Set \(a^{(1)} = x^{(i)}\)

— Perform forward proagation to compute \(a^{(l)}\) for \(l=2,3,...,L\)

— Using \(y^{(i)}\), compute \(\delta^{(L)}=a^{(L)}-y^{(i)}\)

— Compute \(\delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}\)
\(\Delta^{(l)}_{ij}:=\Delta^{(l)}_ij+a^{(l)}_j\delta^{(l+1)}_i\)

}

\(D^{(l)}_{ij}:=\frac{1}{m}\Delta^{(l)}_{ij}+\lambda\Theta^{(l)}_{ij}\) if \(j \neq 0\)

\(D^{(l)}_{ij}:=\frac{1}{m}\Delta^{(l)}_{ij}\) if \(j = 0\)

Finally we got:

\[\begin{gather*} \frac{\partial}{\partial{\Theta}_{ij}^{(l)}}J(\Theta)=D^{(l)}_{ij} \end{gather*}\]

Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

\[\begin{gather*} \dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon} \end{gather*}\]

With multiple theta matrices, we can approximate the derivative with respect to \(\Theta_j\) as follows:

\[\begin{gather*} \dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}\end{gather*}\]

A small value for \(\epsilon\) such as \(\epsilon=10−4\), guarantees that the math works out properly. If the value for \(\epsilon\) is too small, we can end up with numerical problems

Hence, we are only adding or subtracting epsilon to the Θj matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that \(gradApprox \approx deltaVector\).

Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.