The generic regression problem is to learn a function $f$ which maps input $X$ to its desired output $Y$. However in many cases we require $f$ to be smooth, meaning that continuously varying input should effectuate continuously varying output. Kernel smoothing methods are directed towards this subset of regression problems.

\hat{f}(X) = \mathbb{E}[Y|X]

\end{equation}

\begin{equation}

p(x,t) = \frac{1}{N}\sum_n q(x-x_n,t-t_n)

\end{equation}

\begin{equation}

f(x) = \frac{\sum_n g(x-x_n)t_n}{\sum_m g(x-x_m)}

\end{equation}

\begin{equation}

f(x) = \sum_n k(x,x_n)t_n

\end{equation}

Regression can be thought of as computing an expectation of the output conditioned on the input

\begin{equation}\hat{f}(X) = \mathbb{E}[Y|X]

\end{equation}

An approximation to this expectation is the

*k-*nearest-neighbor average. Simply find $k$ inputs in the training data set closest to the query input and average their corresponding outputs. This estimate is piece-wise continuous. It holds constant while the nearest-neighbors remain unchanged but can change abruptly even when one of the*k*-neighbors changes.
Nadaraya and Watson came up with a solution to the problem in 1964 (perhaps independently) and to no surprise their approach is now known as Nadaraya-Watson kernel regression. The idea behind this was simply to do a weighted average where training data points with inputs closer to the query input would have a larger say as to what the output should be.

This heuristic has been motivated mathematically in Section 6.3.1 of Bishop's "Pattern Recognition and Machine Learning". A sketch of the result is presented below:

We have training input output pairs $\{x_n,t_n\}$. Using this data the joint distribution of input output variables is modeled by Parzen density estimator $p(x,t)$, which is a smooth non-parametric density estimation model

p(x,t) = \frac{1}{N}\sum_n q(x-x_n,t-t_n)

\end{equation}

where $q$ is a component density function. Parzen density estimator is basically trying to approximate the density function as a sum of density functions each sitting at a training data point for all given training points.

Now the conditional expectation can be written down in terms of $p(x,t)$ which can in turn be replaced by the Parzen density estimate. An assumption about component density functions having zero mean and a quick change of variable later, we get the Nadaraya-Watson model which is

f(x) = \frac{\sum_n g(x-x_n)t_n}{\sum_m g(x-x_m)}

\end{equation}

where, $g(x)=\int_{-\infty}^{\infty} q(x,t)dt$. This can be more succinctly be written as

f(x) = \sum_n k(x,x_n)t_n

\end{equation}

where, $k(x,x_n) = \frac{g(x-x_n)}{\sum_m g(x-x_m)}$.

A visual comparison of

*k*-nearest-neighbor and kernel smoothing using Nadaraya-Watson model with Epanechnikov Kernel is presented in Figure 6.1 in the second edition of "The Elements of Statistical Learning".