I am going through a book on statistical learning and ran into a problem concerning k nearest neighbor methods. The book says that using least squares to determine optimal k will lead to $k=1$. I tried proving this but am not sure how to differentiate the summation part i.e $\sum_{x_{i} \epsilon N_{k}(x)} Y_{i}$ since the k is involved in the summation.
To help me out could someone show me how to differentiate the kNN formula itself with respect to k please.
Thanks! PS the formula for kNN is:
$\hat{Y}(x)=\frac{1}{k}\sum_{x_{i} \epsilon N_{k}(x)} Y_{i}$
Here is crude information of what I have in mind. Note that it might be one way of formulating the problem.
Firstly, note that a real machine learning problem is a discrete problem, in the sense that some data points, with the corresponding labels (classes) are given to you and you have to guess the label of a new data point, according to the rule
$$\hat{Y}(X)=\frac{1}{k}\sum_{X_{i} \epsilon N_{k}(X)} Y_{i}$$
So, given is labeled data points $T=\{(X_i,Y_i)|X_i \in \mathbb{R}^n,Y_i \in\{0,1,...,c-1\}\}$, where we have $c$ classes.
At this point, we would like to define a function $f:\mathbb{R}^n \rightarrow\mathbb{R}$, which is continuous, such that the floor of the function ($\lfloor{f(X)}\rfloor$)gives the correct classes, when inputs $X_i$ from the set $T$ is given as input to it.
For any optimisation, you need to define a cost function. One can be defined as below.
$$Cost=\sum_{i}(\hat{Y}(X_i)-\lfloor{f(X_i)}\rfloor)^2=\sum_{i}(\frac{1}{k}\sum_{X_{i} \epsilon N_{k}(x)} Y_{i}-\lfloor{f(X_i)}\rfloor)^2$$
In order to use calculus to analyse the problem, we need to make further modifications.
$$Cost=\int_{X \in \mathbb{R}^n} \bigg(\frac{1}{V(S_r(X))}\int_{T \in S_r(X)}f(T)dT -f(X) \bigg )^2 dX$$
$S_r(X)$ is the open sphere, centred at $X$ with radius $r$. $V(S_r(x))$ is the volume of the open sphere. $V(S_r(x))$ is supposed to be the continuous counterpart of $k$. Also, we have $V(S_r(x))=\alpha r^n$, where $\alpha$ is a constant.
$$V(S_r(x))=\alpha r^n$$
We may rewrite the cost function to get.
$$Cost=\int_{X \in \mathbb{R}^n} \bigg( \frac{1}{\alpha r^n} \int_{T \in S_r(X)}f(T)dT -f(X) \bigg )^2 dX$$
The term $\int_{T \in S_r(X)}f(T)dT$ can be approximated by
$$\int_{T \in \mathbb{R}^n} \delta (X,r)f(T)dT$$
where $\delta(X,r)$ is Gaussian distribution with mean $X$ and variance $r$ and $r=d(T,X)$ is the Euclidean distance between $X$ and $T$. As $r$ goes to $0$, the integral goes to $\alpha r^n f(X)$.
According to the last two equations, you should come to a conclusion that for $r=0$ you get the minimum. Note that $r=0$ is equivalent to $k=1$, which means that you want to consider only the point $X$ itself for evaluation of the label $Y(X)$.
You can also take the derivative of the cost function, with respect to $r$ and put it equal to $0$, which is a tedious process.