Support Vector Machine (SVM)

Anh-Thi Dinh

What's the idea of SVM?

SVM (also called Maximum Margin Classifier) is an algorithm that takes the data as an input and outputs a line/hyperplane that separates those classes if possible.
Suppose that we need to separate two classes of a dataset. The task is to find a line to separate them. However, there are many lines which can do that (countless number of lines). How can we choose the best one?
An idea of support vectors (samples on the margin) and SVM (find the optimal hyperplane).

Using SVM with kernel trick

Most of the time, we cannot separate classes in the current dataset easily (not linearly separable data). We need to use kernel trick first (transform from the current dimension to a higher dimension) and then we use SVM. These classes are not linearly separable.
An idea of kernel and SVM. Transform from 1D to 2D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
An idea of kernel and SVM. Transform from 2D to 3D. Data is not linearly separable in the input space but it is linearly separable in the feature space obtained by a kernel.
A kernel is a dot product in some feature space:
It also measures the similarity between two points and .
We have some popular kernels,
  • Linear kernel: . We use kernel = 'linear' in sklearn.svm.SVM. Linear kernels are rarely used in practice.
  • Gaussian kernel (or Radial Basic Function -- RBF): . It's used the most. We use kernel = 'rbf' (default) with keyword gamma for (must be greater than 0) in sklearn.svm.SVM.
  • Exponential kernel: .
  • Polynomial kernel: . We use kernel = 'poly' with keyword degree for and coef0 for in sklearn.svm.SVM. It's more popular than RBF in NLP. The most common degree is $$d = 2$$ (quadratic), since larger degrees tend to overfit on NLP problems. (ref)
  • Hybrid kernel: .
  • Sigmoidal: . We use kernel = 'sigmoid' with keyword coef0 for $$r$$ in sklearn.svm.SVM.
We can also define a custom kernel thanks to this help.
Choose whatever kernel performs best on cross-validation data. Andrew NG said in his ML course.

Good or Bad?

Advantages:
  • Compared to both logistic regression and NN, a SVM sometimes gives a cleaner way of learning non-linear functions.
  • SVM is better than NN with 1 layer (Perceptron Learning Algorithm) thanks to the largest margin between 2 classes.
  • Accurate in high-dimensional spaces & memory effecient.
  • Good accuracy and perform faster prediction compared to Naïve Bayes algorithm. (ref)
Disadvantages:
  • Prone to overfitting: if number of features are larger than number of samples.
  • Don't provide probability estimation.
  • Not efficient if your data is very big!
  • It works poorly with overlapping classes
  • Sensitive to the type of kernel used.

SVM used for?

Some points: (ref)
  • Classification, regression and outliers detection.
  • Face detection.
  • Text and hypertext categorization.
  • Detecting spam.
  • Classification of images.
  • Bioinformatics.

Using SVM with Scikit-learn

1from sklearn.svm import SVC
2
3svc = SVC(kernel='linear') # default = 'rbf' (Gaussian kernel)
4# other kernels: poly, sigmoid, precomputed or a callable
5
6svc = svc.fit(X, y)
7svc.predict(X)
8
9# gives the support vectors
10svc.support_vectors_
There are other parameters of sklearn.svm.SVM.
⚠️
In the case of linear SVM, we can also use sklearn.svm.LinearSVC. It's similar to sklearn.svm.SVG with kernel='linear' but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. (ref)

Meaning of some parameters

The Regularization parameter (C, default C=1.0): if C is larger, hyperplane has smaller margin but do a better job of classification and otherwise. This is how you can control the trade-off between decision boundary and misclassification term.
  • Higher values of C → a higher possibility of overfitting, the softmargin SVM is equivalent to the hard-margin SVM.
  • Lower values of C → a higher possibility of underfitting. We admit misclassifications in the training data
We use this in the case of not linearly separable data; It's also called soft-margin linear SVM.
An illustration of using C.
An illustration of using C. Bigger C, smaller margin. (ref)
Gamma (gamma, default gamma='auto' which uses 1/n_features): determine the number of points to construct the hyperplane.
An illustration of using gamma. In high-gamma case, we only consider points nearby the hyperplane, it may cause an overfitting.
Bigger gamma, more change to get overfitting (in a XOR problem).
 

References

  • Chris Albon -- Notes about Support Vector Machines.