Processing math: 100%
Posted on June 4, 2017

Basic Princple of Machine Learning,
Regressions and Classifications

Ryan J. Kung

ryankung(at)ieee.org

I Introduction

1.1 What is Machine Learning[1]

Two definitions of Machine Learning are offered:

Arthur Samuel described it as: “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

1.1.1 Supervised Learning and Unsupervised Learning

Supervised Learning[2]

Unsupervised Learning[3]

1.2 Model and Cost Function

Input and Output

Hypothesis Function

Note that the hypothesis function is denoted as ϕ(x) in the Deep Learning book.

Cost Function[4]

Cost Function :: MSE

This function is otherwise called the “Squared error function”, or “Mean squared error”.

J(θ0,θ1)=12mmi=1(ˆyiyi)2=12mmi=1(hθ(xi)yi)2

1.3 Gradient Descent

For minimizing the cost Function Z, We keep changing θ to reduce J(θ).

repeat until convergance { θi:=θiαθjJ(θ)

}

Where α is the learning rate. And θ1..k should be update simultaneously.

II, Linear Regression with Multiple Variables

2.1 Multi Features

Denotes:

Hypothesis:

We suppose that:

x(i)0=1

So:

hθ=θ0+θ1x1+θ2x2+...+θkxk=θ0x0+θ1x1+θ2x2+...+θkxk=[θ0θ1...θn][x0x1xn]=ΘTX

Feature Scaling

xi:=xiμisi

Where μi is the average of all values for feature(i), and si is the range value of values (maxmin), or si is the standard deviation.[5]

Learning Rate

Measure J(θ,iteration) to makeing sure gradient descent working correctly. J(θ,iteratorn) should be a convex function.

To choose α try:

…, 0.001, 0.01, .., 0.1, 1, …

Polynomial Regression

Polynomial Regression

Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).

Linear Function -> Quadratic Function -> Cubic Function -> …

2.2 Computing Parameters Analytically

Normal Equation

Normal Equation: Method to solve for θ analytically.

To Set: θjJ(θ)=ddθjJ(θ)=0

Solve for θ

J(Θ)=12mmi=1(hθ(x(i)y(i))2 To Set: θjJ(θ)=0

(for every j)

Solve for Θ

pinv(X'*X)*X'*Y

The following is a comparison of gradient descent and the normal equation:

Gradient Descent |Norma Normal Equation
Need to choose α |No nee No need to choose α
Needs many iterations | No need to iterate
O(kn2) |$O ( O(n3), need to calculate inverse of XTX1
Works well when n is large |S Slow if n is very large

Normal Equation Non-inveribility

If XTX is non-invertible.

III, Logistic Regression

3.1 Classification and Representation

Classification

y{0,1}

Threshold classifier output hθ(x) at 0.5.

if hθ(x)0.5, predict “y=1

if hθ(x)0.5, predict “y=0

Classification: y=0,1

h(x) can be >1 or <0

Hypothesis Representation

Logistic Regression Model

Want

hθ(x)[0,1] hθ(x)=g(θTx)

With sigmoid function (or logistic function, also activation function in neural netowrk case):

g(z)=11+ezhθ(x)=11+eθTx hθ(x)=P(y=1 | x;θ)

Decision Boundary

predict “y=1” if hθ0.5

predict “y=0” if hθ<0.5

g(z)0.5 when z0

hθ(x)=g(θTx)0.5 wherever θTx0

Predict y=1 if θTx0

Cost Function

J(θ)=mi=1Cost(hθ(x(i)),y)

If we use the linear regression cost function as cost functon of logistic regression, It would be none-convex function of θ

Logisitc regression cost function:

J(θ)=1mmi=1Cost(hθ(x(i)),y(i))Cost(hθ(x),y)=log(hθ(x))if y = 1Cost(hθ(x),y)=log(1hθ(x))if y = 0

Which can be rewrite as:

J(θ)=1mmi=1[y(i)log(hθ(x(i)))+(1y(i))log(1hθ(x(i)))]

This cost function can be derived from statistics using the principle of maximum likelihood estimation[7].

Advanced Optimization

“Conjugate gradient”, “BFGS”, and “L-BFGS” are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. We suggest that you should not write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they’re already tested and highly optimized. Octave provides them.

3.2 Multiclass Classification

To train datasets with binary logisic classification as the binary tree

y{0,1...n}h(0)θ(x)=P(y=0|x;θ)h(1)θ(x)=P(y=1|x;θ)h(n)θ(x)=P(y=n|x;θ)prediction=maxi(h(i)θ(x))

3.3 The Problem of Overfitting

overfitting is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.

There are two main options to address the issue of overfitting:

Since that we have a lot of features(a hundred maybe), and dont know select which θ to shrink, so the basic idea is to do summation for all θ1,n.

Note that we should not shrink θ0, which make very little difference to the result.

minθ 12m mi=1(hθ(x(i))y(i))2+λ nj=1θ2j

The λ, or lambda, is the regularization parameter. It determines how much the costs of our theta parameters are inflated.

Regularized Linear Regression

On Gradient decent
Repeat {    θ0:=θ0α 1m mi=1(hθ(x(i))y(i))x(i)0    θj:=θjα [(1m mi=1(hθ(x(i))y(i))x(i)j)+λmθj]          j{1,2...n}}

And θj can represent with: θj:=θj(1αλm)α1mmi=1(hθ(x(i))y(i))x(i)j

The first term in the above equation, 1αλm will usually less than 1, because alpha times lambda over m is going to be positive, and usually if your learning rate is small and if m is large, this is usually pretty small[6].

Address to 1αλm is less than 1, through the iteration, the θj will become smaller and smaller.

On Normal Equation
θ=(XTX+λL)1XTywhere  L=[0111]

Note that L is an n-1 x n-1 matrix.

If m < n, then XTX is non-invertible. However, when we add the term λ⋅L, then XTX+λL becomes invertible.

Regularized Logistic Regression

Cost Function
J(θ)=1mmi=1[y(i) log(hθ(x(i)))+(1y(i)) log(1hθ(x(i)))]+λ2mnj=1θ2j

Gradient Descent

Repeat {    θ0:=θ0α 1m mi=1(hθ(x(i))y(i))x(i)0    θj:=θjα [(1m mi=1(hθ(x(i))y(i))x(i)j)+λmθj]          j{1,2...n}}

IV Neural Network

a(j)i= “activation” of unit i in layer j.

Θ(j) = matrix of weights controlling function mapping from layer j to layer j+1.

If network has sj units in layer j, sj+1 units in layer j+1, then Θ(j) will be of dimension sj+1×(sj+1)

L = total no. of layer of neural network

Sl = no. of units (not counting bias unit) in layer l

K = the number of units in the output layer.

Cost Function

J(Θ)=1mmi=1Kk=1[y(i)klog((hΘ(x(i)))k)+(1y(i)k)log(1(hΘ(x(i)))k)]+λ2mL1l=1sli=1sl+1j=1(Θ(l)j,i)2

Which is similar to Logistic Regression:

J(θ)=1mmi=1[y(i) log(hθ(x(i)))+(1y(i)) log(1hθ(x(i)))]+λ2mnj=1θ2j

Note:

So we Focusing on a single example x(i), y(i), the cause of 1 output unit, and ignore the reqularization (λ=0)

cost(i)=y(i)loghΘ(x(i)+(1y(i)loghΘ(x(i)))

We can think of that cost(i)(hΘ(x(i)yi))2 which is MSE, for judge how well is the network doing on example i.

Gradient computation

J(Θ)=1mmi=1Kk=1[y(i)klog((hΘ(x(i)))k)+(1y(i)k)log(1(hΘ(x(i)))k)]+λ2mL1l=1sli=1sl+1j=1(Θ(l)j,i)2

Goal:

minΘJ(Θ)

Need code to compute:

J(Θ) and Θ(l)i,jJ(Θ)

Backpropagation algorithm:

Backpropagation is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.

Intuition: δlj = “error” of cost for a(l)j (the activation value on node j in layer l):

Formally, δ(l)j=z(l)jcost(i) (for j0), where:

cost(i)=y(i)loghΘ(x(i)+(1y(i)loghΘ(x(i)))
In example of applying backpropagation algorithm on a (L=4) Network:
δ(4)j=a(4)jyj=(hΘ(x))jyi

Each of these δj, aj and y, is a vector whose dimension is equal to K.

δ(3)=(Θ(3))Tδ(4).g(z(3)) δ(2)=(Θ(2))Tδ(3).g(z(2))

This term g(z(l)), that formally is actually the derivative of the activation function g evaluated at the input values given by z(l).

So we can describe the algorithm as:

Training set (x(1),y(1)),...,(x(m),y(m))

Set Δ(l)ij=0 (for all l,i,j)

For i=1 to m {

— Set a(1)=x(i)

— Perform forward proagation to compute a(l) for l=2,3,...,L

— Using y(i), compute δ(L)=a(L)y(i)

— Compute δ(L1),δ(L2),...,δ(2)
Δ(l)ij:=Δ(l)ij+a(l)jδ(l+1)i

}

D(l)ij:=1mΔ(l)ij+λΘ(l)ij if j0

D(l)ij:=1mΔ(l)ij if j=0

Finally we got:

Θ(l)ijJ(Θ)=D(l)ij

Gradient Checking

Gradient checking will assure that our backpropagation works as intended. We can approximate the derivative of our cost function with:

ΘJ(Θ)J(Θ+ϵ)J(Θϵ)2ϵ

With multiple theta matrices, we can approximate the derivative with respect to Θj as follows:

ΘjJ(Θ)J(Θ1,,Θj+ϵ,,Θn)J(Θ1,,Θjϵ,,Θn)2ϵ

A small value for ϵ such as ϵ=104, guarantees that the math works out properly. If the value for ϵ is too small, we can end up with numerical problems

Hence, we are only adding or subtracting epsilon to the Θj matrix. In octave we can do it as follows:

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

We previously saw how to calculate the deltaVector. So once we compute our gradApprox vector, we can check that gradApproxdeltaVector.

Once you have verified once that your backpropagation algorithm is correct, you don’t need to compute gradApprox again. The code to compute gradApprox can be very slow.