Wednesday 27 May 2015

Kernel Smoothing

The generic regression problem is to learn a function $f$ which maps input $X$ to its desired output $Y$. However in many cases we require $f$ to be smooth, meaning that continuously varying input should effectuate continuously varying output. Kernel smoothing methods are directed towards this subset of regression problems.

Regression can be thought of as computing an expectation of the output conditioned on the input
\begin{equation}
\hat{f}(X) = \mathbb{E}[Y|X]
\end{equation}

An approximation to this expectation is the k-nearest-neighbor average. Simply find $k$ inputs in the training data set closest to the query input and average their corresponding outputs. This estimate is piece-wise continuous. It holds constant while the nearest-neighbors remain unchanged but can change abruptly even when one of the k-neighbors changes.

Nadaraya and Watson came up with a solution to the problem in 1964 (perhaps independently) and to no surprise their approach is now known as Nadaraya-Watson kernel regression. The idea behind this was simply to do a weighted average where training data points with inputs closer to the query input would have a larger say as to what the output should be. 

This heuristic has been motivated mathematically in Section 6.3.1 of Bishop's "Pattern Recognition and Machine Learning". A sketch of the result is presented below:

We have training input output pairs $\{x_n,t_n\}$. Using this data the joint distribution of input output variables is modeled by Parzen density estimator $p(x,t)$, which is a smooth non-parametric density estimation model

\begin{equation}
p(x,t) = \frac{1}{N}\sum_n q(x-x_n,t-t_n)
\end{equation}

where $q$ is a component density function. Parzen density estimator is basically trying to approximate the density function as a sum of density functions each sitting at a training data point for all given training points.

Now the conditional expectation can be written down in terms of $p(x,t)$ which can in turn be replaced by the Parzen density estimate. An assumption about component density functions having zero mean  and a quick change of variable later, we get the Nadaraya-Watson model which is

\begin{equation}
f(x) = \frac{\sum_n g(x-x_n)t_n}{\sum_m g(x-x_m)}
\end{equation}

where, $g(x)=\int_{-\infty}^{\infty} q(x,t)dt$. This can be more succinctly be written as 

\begin{equation}
f(x) = \sum_n k(x,x_n)t_n
\end{equation}

where, $k(x,x_n) = \frac{g(x-x_n)}{\sum_m g(x-x_m)}$. 

A visual comparison of k-nearest-neighbor and kernel smoothing using Nadaraya-Watson model with Epanechnikov Kernel is presented in Figure 6.1 in the second edition of "The Elements of Statistical Learning".