Third post of our series on classification from scratch, following the previous post introducing smoothing techniques, with (b)-splines. Consider here kernel based techniques. Note that here, we do not use the “logistic” model… it is purely non-parametric.

## kernel based estimated, from scratch

I like kernels because they are somehow very intuitive. With GLMs, the goal is to estimate \hat{m}(\mathbf{x})=\mathbb{E}(Y|\mathbf{X}=\mathbf{x}). Heuritically, we want to compute the (conditional) expected value on the neighborhood of \mathbf{x}. If we consider some spatial model, where \mathbf{x} is the location, we want the expected value of some variable Y, “on the neighborhood” of \mathbf{x}. A natural approach is to use some administrative region (county, departement, region, etc). This means that we have a partition of \mathcal{X} (the space with the variable(s) lies). This will yield the regressogram, introduced in Tukey (1961). For convenience, assume some interval / rectangle / box type of partition. In the univariate case, consider \hat{m}_{\mathbf{a}}(x)=\frac{\sum_{i=1}^n \mathbf{1}(x_i\in[a_j,a_{j+1}))y_i}{\sum_{i=1}^n \mathbf{1}(x_i\in[a_j,a_{j+1}))}or the moving regressogram \hat{m}(x)=\frac{\sum_{i=1}^n \mathbf{1}(x_i\in[x\pm h])y_i}{\sum_{i=1}^n \mathbf{1}(x_i\in[x\pm h])}In that case, the neighborhood is defined as the interval (x\pm h). That’s nice, but clearly very simplistic. If \mathbf{x}_i=\mathbf{x} and \mathbf{x}_j=\mathbf{x}-h+\varepsilon (with \varepsilon>0), both observations are used to compute the conditional expected value. But if \mathbf{x}_{j'}=\mathbf{x}-h-\varepsilon, only \mathbf{x}_i is considered. Even if the distance between \mathbf{x}_{j} and \mathbf{x}_{j'} is extremely extremely small. Thus, a natural idea is to use weights that are function of the distance between \mathbf{x}_{i}‘s and \mathbf{x}.Use\tilde{m}(x)=\frac{\sum_{i=1}^ny_i\cdot k_h\left({x-x_i}\right)}{\sum_{i=1}^nk_h\left({x-x_i}\right)}where (classically)k_h(x)=k\left(\frac{x}{h}\right)for some kernel k (a non-negative function that integrates to one) and some bandwidth h. Usually, kernels are denoted with capital letter K, but I prefer to use k, because it can be interpreted as the density of some random noise we add to all observations (independently).

Actually, one can derive that estimate by using kernel-based estimators of densities. Recall that\tilde{f}(\mathbf{y})=\frac{1}{n|\mathbf{H}|^{1/2}}\sum_{i=1}^n k\left(\mathbf{H}^{-1/2}(\mathbf{y}-\mathbf{y}_i)\right)

Now, use the fact that the expected value can be defined asm(x)=\int yf(y|x)dy=\frac{\int y f(y,x)dy}{\int f(y,x)dy}Consider now a bivariate (product) kernel to estimate the joint density. The numerator is estimated by\frac{1}{nh}\sum_{i=1}^n\int y_i k\left(t,\frac{x-x_i}{h}\right)dt=\frac{1}{nh}\sum_{i=1}^ny_i \kappa\left(\frac{x-x_i}{h}\right)while the denominator is estimated by\frac{1}{nh^2}\sum_{i=1}^n \int k\left(\frac{y-y_i}{h},\frac{x-x_i}{h}\right)=\frac{1}{nh}\sum_{i=1}^n\kappa\left(\frac{x-x_i}{h}\right)In a general setting, we still use product kernels between Y and \mathbf{X} and write \widehat{m}_{\mathbf{H}}(\mathbf{x})=\displaystyle{\frac{\sum_{i=1}^ny_i\cdot k_{\mathbf{H}}(\mathbf{x}_i-\mathbf{x})}{\sum_{i=1}^n k_{\mathbf{H}}(\mathbf{x}_i-\mathbf{x})}}for some symmetric positive definite bandwidth matrix \mathbf{H}, and k_{\mathbf{H}}(\mathbf{x})=\det[\mathbf{H}]^{-1}k(\mathbf{H}^{-1}\mathbf{x})

Now that we know what kernel estimates are, let us use them. For instance, assume that k is the density of the \mathcal{N}(0,1) distribution. At point x, with a bandwidth h we get the following code

mean_x = function(x,bw){ w = dnorm((myocarde$INSYS-x)/bw, mean=0,sd=1) weighted.mean(myocarde$PRONO,w)} u = seq(5,55,length=201) v = Vectorize(function(x) mean_x(x,3))(u) plot(u,v,ylim=0:1,type="l",col="red") points(myocarde$INSYS,myocarde$PRONO,pch=19) |

and of course, we can change the bandwidth.

v = Vectorize(function(x) mean_x(x,2))(u) plot(u,v,ylim=0:1,type="l",col="red") points(myocarde$INSYS,myocarde$PRONO,pch=19) |

We observe what we can read in any textbook : with a smaller bandwidth, we get more variance, less bias. “More variance” means here more variability (since the neighborhood is smaller, there are less points to compute the average, and the estimate is more volatile), and “less bias” in the sense that the expected value is supposed to be compute at point x, so the smaller the neighborhood, the better.

## Using ksmooth R function

Actually, there is a function in R to compute this kernel regression.

reg = ksmooth(myocarde$INSYS,myocarde$PRONO,"normal",bandwidth = 2*exp(1)) plot(reg$x,reg$y,ylim=0:1,type="l",col="red",lwd=2,xlab="INSYS",ylab="") points(myocarde$INSYS,myocarde$PRONO,pch=19) |

We can replicate our previous estimate. Nevertheless, the output is not a function, but two series of vectors. That’s nice to get a graph, but that’s all we get. Furthermore, as we can see, the bandwidth is not exactly the same as the one we used before. I did not find any information online, so I tried to replicate the function we wrote before

g=function(bk=3){ reg = ksmooth(myocarde$INSYS,myocarde$PRONO,"normal",bandwidth = bk) f=function(bm){ v = Vectorize(function(x) mean_x(x,bm))(reg$x) z=reg$y-v sum((z[!is.na(z)])^2)} optim(bk,f)$par} x=seq(1,10,by=.1) y=Vectorize(g)(x) plot(x,y) abline(0,exp(-1),col="red") abline(0,.37,col="blue") |

There is a slope of 0.37, which is actually e^{-1}. Coincidence ? I don’t know to be honest…

## Application in higher dimension

Consider now our bivariate dataset, and consider some product of univariate (Gaussian) kernels

u = seq(0,1,length=101) p = function(x,y){ bw1 = .2; bw2 = .2 w = dnorm((df$x1-x)/bw1, mean=0,sd=1)* dnorm((df$x2-y)/bw2, mean=0,sd=1) weighted.mean(df$y=="1",w) } v = outer(u,u,Vectorize(p)) image(u,u,v,col=clr10,breaks=(0:10)/10) points(df$x1,df$x2,pch=19,cex=1.5,col="white") points(df$x1,df$x2,pch=c(1,19)[1+(df$y=="1")],cex=1.5) contour(u,u,v,levels = .5,add=TRUE) |

We get the following prediction

Here, the different colors are probabilities.

## k-nearest neighbors

An alternative is to consider a neighborhood not defined using a distance to point \mathbf{x} but the k-neighbors, with the n observations we got.\tilde{m}_k(\mathbf{x})=\frac{1}{n}\sum_{i=1}^n\omega_{i,k}(\mathbf{x})y_i

where \omega_{i,k}(\mathbf{x})=n/k if i\in\mathcal{I}_{\mathbf{x}}^k with

\mathcal{I}_{\mathbf{x}}^k=\{i:\mathbf{x}_i\text{ one of the }k\text{ nearest observations to }\mathbf{x}\}

The difficult part here is that we need a valid distance. If units are very different on each component, using the Euclidean distance will be meaningless. So, quite naturally, let us consider here the Mahalanobis distance

Sigma = var(myocarde[,1:7]) Sigma_Inv = solve(Sigma) d2_mahalanobis = function(x,y,Sinv){as.numeric(x-y)%*%Sinv%*%t(x-y)} k_closest = function(i,k){ vect_dist = function(j) d2_mahalanobis(myocarde[i,1:7],myocarde[j,1:7],Sigma_Inv) vect = Vectorize(vect_dist)((1:nrow(myocarde))) which((rank(vect)))} |

Here we have a function to find the k closest neighbor for some observation. Then two things can be done to get a prediction. The goal is to predict a class, so we can think of using a majority rule : the prediction for y_i is the same as the one the majority of the neighbors.

k_majority = function(k){ Y=rep(NA,nrow(myocarde)) for(i in 1:length(Y)) Y[i] = sort(myocarde$PRONO[k_closest(i,k)])[(k+1)/2] return(Y)} |

But we can also compute the proportion of black points among the closest neighbors. It can actually be interpreted as the probability to be black (that’s actually what was said at the beginning of this post, with kernels),

k_mean = function(k){ Y=rep(NA,nrow(myocarde)) for(i in 1:length(Y)) Y[i] = mean(myocarde$PRONO[k_closest(i,k)]) return(Y)} |

We can see on our dataset the observation, the prediction based on the majority rule, and the proportion of dead individuals among the 7 closest neighbors

cbind(OBSERVED=myocarde$PRONO, MAJORITY=k_majority(7),PROPORTION=k_mean(7)) OBSERVED MAJORITY PROPORTION [1,] 1 1 0.7142857 [2,] 0 1 0.5714286 [3,] 0 0 0.1428571 [4,] 1 1 0.5714286 [5,] 0 1 0.7142857 [6,] 0 0 0.2857143 [7,] 1 1 0.7142857 [8,] 1 0 0.4285714 [9,] 1 1 0.7142857 [10,] 1 1 0.8571429 [11,] 1 1 1.0000000 [12,] 1 1 1.0000000 |

Here, we got a prediction for an observed point, located at \boldsymbol{x}_i, but actually, it is possible to seek the k closest neighbors of any point \boldsymbol{x}. Back on our univariate example (to get a graph), we have

mean_x = function(x,k=9){ w = rank(abs(myocarde$INSYS-x),ties.method ="random") mean(myocarde$PRONO[which(w<=9)])} u=seq(5,55,length=201) v=Vectorize(function(x) mean_x(x,3))(u) plot(u,v,ylim=0:1,type="l",col="red",lwd=2,xlab="INSYS",ylab="") points(myocarde$INSYS,myocarde$PRONO,pch=19) |

That’s not very smooth, but we do not have a lot of points either.

If we use that technique on our two-dimensional dataset, we obtain the following

Sigma_Inv = solve(var(df[,c("x1","x2")])) u = seq(0,1,length=51) p = function(x,y){ k = 6 vect_dist = function(j) d2_mahalanobis(c(x,y),df[j,c("x1","x2")],Sigma_Inv) vect = Vectorize(vect_dist)(1:nrow(df)) idx = which(rank(vect)<=k) return(mean((df$y==1)[idx]))} v = outer(u,u,Vectorize(p)) image(u,u,v,xlab="Variable 1",ylab="Variable 2",col=clr10,breaks=(0:10)/10) points(df$x1,df$x2,pch=19,cex=1.5,col="white") points(df$x1,df$x2,pch=c(1,19)[1+z],cex=1.5) contour(u,u,v,levels = .5,add=TRUE) |

This is the idea of local inference, using either kernel on a neighborhood of \mathbf{x} or simply using the k nearest neighbors. Next time, we will investigate penalized logistic regressions, to be continued…

Great work 🙂 Please mention part four (4/8) link in this post.