Svm andrew ng

Svm andrew ng DEFAULT

Question concerning SVMs in machine learning course CS229 by Andrew Ng

On page 12 in, the author uses the claim that the gradient of the lagrangian with respect to the non-constraint variables is zero. Why is this true? When we're trying to minimize the lagrangian for fixed $\alpha$ - a constraint variable, how does it follow that the minimum is at a point of local minima/stationary point, and the function isn't perhaps unbound? Does it follow from convexity of the lagrangian (convexity implies that a local minima is global)? Or does it have something to do with KKT conditions?

Could someone give me a counterexample of type:

$\textrm{max}_{\lambda} \textrm{min}_x f(x,\lambda) \neq \textrm{max}_{\lambda} \tilde{f}(\lambda)$

where $\tilde{f}{(\lambda)}$ is chosen to be $f(x_0,\lambda)$ for some $x_0$ local minima of $f$ with respect to $\lambda$ being fixed (assuming that a local minima exists for each $\lambda$). I think a function like $f(x,\lambda) = x^3$ could be an example.

Alternatively, in the equation above, is there some sort of neat condition on $\tilde{f}$ could imply an equality? Suppose that the actual extrema is reached at $f(x^*,\lambda^*)=M$ and that we're interested in somehow defining $\tilde{f}$ based on $f$, using $\tilde{f}(\lambda) = f(x_0,\lambda)$ for some choice of $x_0$.

Working with $x_0$ being chosen as some (existing) stationary point with respect to a fixed $\lambda$, and assuming that $x^*$ is the unique stationary point with respect to $\lambda^*$ where minimum is reached with respect to fixed $\lambda^*$, all that's necessary is for the value of $f$ in all the other stationary points to be lower than $M$. But I'm not sure if there's some sort of neat condition that would imply that.

asked Jan 19 '20 at 22:19

John PJohn P

11111 bronze badge


Support Vector Machine

You will see how using different values of the C parameter with SVMs. Informally, the C parameter is a positive value that controls the penalty for misclassified training examples. A large C parameter tells the SVM to try to classify all the examples correctly. C plays a role similar to 1 / λ, where λ is the regularization parameter that we were using previously for logistic regression.

The next part of code will run the SVM training (with C = 1. When C = 1, you should find that the SVM puts the decision boundary in the gap between the two datasets and misclassifies the data point on the far left (Figure 2).

Your task is to try different values of C on this dataset. Specifically, you should change the value of C in the script to C = 100 and run the SVM training again. When C = 100, you should find that the SVM now classifies every single example correctly, but has a decision boundary that does not appear to be a natural fit for the data (Figure 3)

SVM with Gaussian Kernels

In this part of the exercise, you will be using SVMs to do non-linear classification. In particular, you will be using SVMs with Gaussian kernels on datasets that are not linearly separable.


Example Dataset 2

Example Dataset 3

  1. Harford county jobs
  2. Meaning of own
  3. Casio ct x9000in

Stanford Machine Learning


The following notes represent a complete, stand alone interpretation of Stanford's machine learning course presented by Professor Andrew Ng and originally posted on the website during the fall 2011 semester. The topics covered are shown below, although for a more detailed summary see lecture 19. The only content not covered here is the Octave/MATLAB programming.

All diagrams are my own or are directly taken from the lectures, full credit to Professor Ng for a truly exceptional lecture course.

What are these notes?

Originally written as a way for me personally to help solidify and document the concepts, these notes have grown into a reasonably complete block of reference material spanning the course in its entirety in just over 40 000 words and a lot of diagrams! The target audience was originally me, but more broadly, can be someone familiar with programming although no assumption regarding statistics, calculus or linear algebra is made. We go from the very introduction of machine learning to neural networks, recommender systems and even pipeline design. The one thing I will say is that a lot of the later topics build on those of earlier sections, so it's generally advisable to work through in chronological order.

The notes were written in Evernote, and then exported to HTML automatically. As a result I take no credit/blame for the web formatting.

How can you help!?

If you notice errors or typos, inconsistencies or things that are unclear please tell me and I'll update them. It would be hugely appreciated!
You can find me at alex[AT]holehouse[DOT]org

As requested, I've added everything (including this index file) to a .RAR archive, which can be downloaded below. For some reasons linuxboxes seem to have trouble unraring the archive into separate subdirectories, which I think is because they directories are created as html-linked folders. Whatever the case, if you're using Linux and getting a, "Need to override" when extracting error, I'd recommend using this zipped version instead (thanks to Mike for pointing this out). They're identical bar the compression method. [Files updated 5th June].

RAR archive - (~20 MB)

Zip archive - (~20 MB)

A changelog can be found here - Anything in the log has already been updated in the online content, but the archives may not have been - check the timestamp above.



C19 Machine Learning lectures Hilary 2015               Andrew Zisserman

Recommended books:

  • Christopher M. Bishop, "Pattern Recognition and Machine Learning" , Springer (2006), ISBN 0-38-731073-8.
  • Hastie, Tibshirani, Friedman, "Elements of Statistical Learning", Second Edition, Springer, 2009. Pdf available online.
  • Ian H. Witten and Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques" , Second Edition, 2005.
  • David MacKay, "Information Theory, Inference, and Learning Algorithms" Which is freely available online!
  • Tom Mitchell, "Machine Learning" , McGraw Hill, 1997

Web resources

Recommended Machine Learning Courses on the Web:


Support Vector Machines:


Random forests:



Dimensionality Reduction:

Software and data:


Andrew ng svm

Machine Learning theory and applications using Octave or Python.

1. Large Margin Classification

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Optimization Objective

  • So far we have seen mainly 2 algorithms, logistic regression and neural networks. There are more important aspects of machine learning:
    • The amount of training data
    • Skill of applying the algorithms
  • The SVM sometimes give a cleaner and more powerful way to learn parameters
    • This is the last supervised learning algorithm in this introduction to machine learning
  • Alternative view of logistic regression
    • If we want hθ = 1, we need z » 0
    • If we want hθ = 0, we need z « 0
    • If y = 1, only the first term would matter
      • Graph on the left
      • When z is large, cost function would be small
      • Magenta curve is a close approximation of the log cost function
    • If y = 0, only the second term would matter
      • Magenta curve is a close approximation of the log cost function
    • Diagram of cost contributions (y-axis)
  • Support Vector Machine
    • Changes to logistic regression equation
      • We replace the first and second terms of logistic regression with the respective cost functions
      • We remove (1 / m) because it does not matter
      • Instead of A + λB, we use CA + B
        • Parameter C similar to the role (1 / λ)
        • When C = (1 / λ), the two optimization equations would give same parameters θ
  • Compared to logistic regression, it does not output a probability
    • We get a direct prediction of 1 or 0 instead
      • If θTx is => 0
      • If θTx is <= 0
        • hθ(x) = 0

1b. Large Margin Intuition

1c. Mathematics of Large Margin Classification

  • Vector inner product
    • Brief details
      • u_transpose * v is also called inner product
      • length of u = hypotenuse calculated using Pythagoras’ Theorem
    • If we project vector v on vector u (green line)
      • p = length of vector v onto u
        • p can be positive or negative
        • p would be negative when angle between v and u more than 90
        • p would be positive when angle between v and u is less than 90
      • u_transpose * v = p . ll u ll = u1 v1 + u2 v2 = v_transpose * v
  • SVM decision boundary: introduction
    • We set the number of features, n, to 2
    • As you can see that normalization in SVM is minimizing the squared norm of the square length of the parameter θ, ll θ ll^2
  • SVM decision boundary: projections and hypothesis
    • When θ0 = 0, this means the vector passes through the origin
    • θ projection will always be 90 degrees to the decision boundary
    • Decision boundary choice 1: graph on the left
      • p1 is projection of x1 example on θ (red)
        • p1 . ll θ ll >= 1
        • For this to be true ll θ ll has to be large
      • p2 is a projection of x2 example on θ (magenta)
      • For this to be true ll θ ll has to be large
      • But our purpose is to minimise ll θ ll^2
        • This decision boundary choice does not appear to be suitable
    • Decision boundary choice2: graph on the right
      • p1 is projection of x1 example on θ (red)
        • p1 is much bigger so norm of θ, ll θ ll, can be smaller
      • p2 is a projection of x2 example on θ (magenta)
        • p2 is much bigger so norm of θ, ll θ ll, can be smaller
      • Hence ll θ ll^2 would be smaller
      • And this is why SVM would choose this decision boundary
      • Magnitude of margin is value of p1, p2, p3 and so on
        • SVM would end up with a large margin because it tries to maximize the margin to minimize the squared norm of θ, ll θ ll^2

2. Kernels

2a. Kernels I

  • Non-linear decision boundary
    • Given the data, is there a different or better choice of the features f1, f2, f3 … fn?
    • We also see that using high order polynomials is computationally expensive
  • Gaussian kernel
    • We will manually pick 3 landmarks (points)
    • Given an example x, we will define the features as a measure of similarity between x and the landmarks
      • f1 = similarity(x, l(1))
      • f2 = similarity(x, l(2))
      • f3 = similarity(x, l(3))
    • The different similarity functions are Gaussian Kernels
      • This kernel is often denoted as k(x, l(i))
  • Kernels and similarity
  • Kernel Example
    • As you increase sigma square
      • As you move away from l1, the value of the feature falls away much more slowly
  • Kernel Example 2
    • For the first point (magenta), you will predict 1 because hθ >= 0
    • For the second point (cyan), you will predict 0 because hθ < 0
  • We can learn complex non-linear decision boundaries
    • We predict positive when we’re close to the landmarks
    • We predict negative when we’re far away from the landmarks
  • Questions we have yet to answer
    • How do we get these landmarks?
    • How do we choose these landmarks?
    • What other similarity functions can we use beside the Gaussian kernel?

2b. Kernels II

3. SVMs in Practice

Tags: machine_learning

Support Vector Machine (SVM) in 2 minutes

Andrew Ng’s Machine Learning Course in Python (Support Vector Machines)

There is two part in this assignment. First,we will implement Support Vector Machines (SVM) on several 2D data set to have an intuition of the algorithms and how it works. Next, we will use SVM on emails datasets to try and classify spam emails.

To load the dataset, loadmat from is used to open the mat files

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from import loadmatmat = loadmat("ex6data1.mat")
X = mat["X"]
y = mat["y"]

Plotting of the dataset

m,n = X.shape[0],X.shape[1]
pos,neg= (y==1).reshape(m,1), (y==0).reshape(m,1)

We start off with a simple dataset that has a clear linear boundary between the training examples.

As recommended in the lecture, we try not to code SVM from scratch but instead, make use of highly optimized library such as sklearn for this assignment. The official documentation can be found here.

from sklearn.svm import SVC
classifier = SVC(kernel="linear"),np.ravel(y))

Since this is a linear classification problem, we will not be using any kernel for this task. This is equivalent to using the linear kernel in SVC (note that the default kernel setting for SVC is “ rbf”, which stands for Radial basis function). The function here returns an array with size (m, ) which is required for SVC.

plt.scatter(X[neg[:,0],0],X[neg[:,0],1],c="y",marker="o",s=50)# plotting the decision boundary
X_1,X_2 = np.meshgrid(np.linspace(X[:,0].min(),X[:,1].max(),num=100),np.linspace(X[:,1].min(),X[:,1].max(),num=100))

With the default setting of C = 1.0 (remember C = 1/λ), this is the decision boundary we obtained.

# Test C = 100
classifier2 = SVC(C=100,kernel="linear"),np.ravel(y))
plt.scatter(X[neg[:,0],0],X[neg[:,0],1],c="y",marker="o",s=50)# plotting the decision boundary
X_3,X_4 = np.meshgrid(np.linspace(X[:,0].min(),X[:,1].max(),num=100),np.linspace(X[:,1].min(),X[:,1].max(),num=100))

Changing C=100, gave a decision boundary that overfits the training examples.

Next, we will look at a dataset that could not be linearly separable. Here is where kernels come into play to provide us with the functionality of a non-linear classifier. For those having difficulties comprehending the concept of kernels, this article I found gave a pretty good intuition and some mathematics explanation about kernels. For this part of the assignment, we were required to complete the function to aid in the implementation of SVM with Gaussian kernels. I will be skipping this step as SVC contain its own gaussian kernels implementation in the form of Radial basis function (rbf). Here is the Wikipedia page with the equation for rbf, as you can see, it is identical to the Gaussian kernel function from the course.

Loading and plotting of example dataset 2

mat2 = loadmat("ex6data2.mat")
X2 = mat2["X"]
y2 = mat2["y"]m2,n2 = X2.shape[0],X2.shape[1]
pos2,neg2= (y2==1).reshape(m2,1), (y2==0).reshape(m2,1)

To implement SVM with Gaussian kernels

classifier3 = SVC(kernel="rbf",gamma=30),y2.ravel())

In regards to the parameters of SVM with rbf kernel, it uses gamma instead of sigma. The documentation of the parameters can be found here. I found that gamma is similar to 1/σ but not exactly, I hope some domain expert can give me insights into the interpretation of this gamma term. As for this dataset, I found that gamma value of 30 shows the most resemblance to the optimized parameters in the assignment (sigma was 0.1 in the course).

plt.scatter(X2[neg2[:,0],0],X2[neg2[:,0],1],c="y",marker="o")# plotting the decision boundary
X_5,X_6 = np.meshgrid(np.linspace(X2[:,0].min(),X2[:,1].max(),num=100),np.linspace(X2[:,1].min(),X2[:,1].max(),num=100))

As for the last dataset in this part, we perform a simple hyperparameter tuning to determine the best C and gamma values to use.

Loading and plotting of examples dataset 3

mat3 = loadmat("ex6data3.mat")
X3 = mat3["X"]
y3 = mat3["y"]
Xval = mat3["Xval"]
yval = mat3["yval"]m3,n3 = X3.shape[0],X3.shape[1]
pos3,neg3= (y3==1).reshape(m3,1), (y3==0).reshape(m3,1)
def dataset3Params(X, y, Xval, yval,vals):
Returns your choice of C and sigma. You should complete this function to return the optimal C and
sigma based on a cross-validation set.
acc = 0
for i in vals:
C= i
for j in vals:
gamma = 1/j
classifier = SVC(C=C,gamma=gamma),y)
prediction = classifier.predict(Xval)
score = classifier.score(Xval,yval)
if score>acc:
acc =score
best_c =C
return best_c, best_gamma

iterates through the list of given in the function and set C as vals and gamma as 1/vals. An SVC model is constructed using each combination of parameters and the accuracy of the validation set is computed. Based on the accuracy, the best model is chosen and the values for the respective C and gamma are returned.

vals = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
C, gamma = dataset3Params(X3, y3.ravel(), Xval, yval.ravel(),vals)
classifier4 = SVC(C=C,gamma=gamma),y3.ravel())
plt.scatter(X3[neg3[:,0],0],X3[neg3[:,0],1],c="y",marker="o",s=50)# plotting the decision boundary
X_7,X_8 = np.meshgrid(np.linspace(X3[:,0].min(),X3[:,1].max(),num=100),np.linspace(X3[:,1].min(),X3[:,1].max(),num=100))

The optimal values are 0.3 for C and 100 for gamma, this results in similar decision boundary as the assginment.


Now discussing:

Support Vector Machines

جدول المحتويات

Support Vector Machines

Support Vector Machines








Prompt Output

Figure Output

Spam Classification




prompt output

Support Vector Machines








Prompt Output

Figure Output

    Example Dataset 1

    SVM Decision Boundary with C = 1 (Example Dataset 1)

   SVM Decision Boundary with C = 100 (Example Dataset 1)

    SVM Decision Boundary with C = 1000 (Example Dataset 1)

    Example Dataset 2

    SVM (Gaussian Kernel) Decision Boundary (Example Dataset 2)

    Example Dataset 3

    SVM (Gaussian Kernel) Decision Boundary (Example Dataset 3)


Spam Classification




prompt output




1455 1456 1457 1458 1459