Kernel method —its basics

The basics of a kernel method is explained and a kernel function is naturally introduced.

T Miyamoto
4 min readJan 30, 2022

Intro

In a machine learning context, kernel method is a convenient method to analyze datasets, and it’s worthwhile to understand the math behind the method. The purpose of this article is to explain the basics of a kernel method.

Here we have a definition of a kernel function.

Suppose set X is an input space, the set of all possible input values. Then, the kernel function for a kernel method is a map from 2 input spaces to the set of real numbers. And a kernel function must be a positive semi-definite.

There are many types of kernel functions.

Linear kernel

When input vector x contains many zeros (like text data), a linear kernel can be used.

Polynomial kernel

It is used for classifying image data.

Gaussian kernel

This is a general-purpose kernel function used when there is no prior knowledge about the data.

Now we want to see how a kernel function is naturally introduced in the context of a kernel method.

Problem Setup

Suppose we have N data points

where x is a vector and t is a scalar. Now we want to apply a simple non-linear regression to the data points:

where w is the set of model parameters for the regression analysis, and phi(x) is a vector function of x. Then we define the loss function by:

where lambda is the parameter for regularization.

Kernel Method

Now we want to minimize the loss function. To do this, we first differentiate the loss function with regard to the model parameters w and equate it to zero. This allows us to have:

This means that we have the following expression for the model parameters :

Conveniently, we introduce a matrix Phi such that

to get a simple expression for the model parameters:

With the matrix, we can see that

and so we get the following expression for the loss function:

Then we introduce an N-by-N matrix K which is defined by

and we call it a gram matrix. The elements of the gram matrix are:

where k(x_i, x_j) is called the kernel function for this Kernel method. Also, we can express the loss function as

with the gram matrix.

Moreover, taking advantage of the fact that

we obtain

where I_N is the N-by-N identity matrix. Therefore the model parameters are expressed by the gram matrix, matrix Phi, parameter lambda, and the data point t.

Thus, we obtain the formula for the non-linear regression

where k is a vector vector function for an unknown point x:

that is

Example

I just put a graph as an example of a kernel method, where a Gaussian kernel is used, changing the parameter lambda. We can see that as the lambda decreases the inferred line tends to be fitted more to the data points.

--

--