Assume $Y$ is a continuously differential function of $X$. Given i.i.d. data $(x_i,y_i)_{i=1}^n$, I would like to estimate $E\left[g\left(\left.\frac{\partial Y}{\partial X}\right|_{X=X_0}\right)\right]$ for some known nonlinear function $g$ ($g$ can be assumed continuously differential, if needed, for example $g(x)=x^2$).
This question is an extension of Estimating the expectation of a derivative where $g$ was linear and the derivative and integral could be exchanged and kernel techniques used.
I am interested to know how the expectation can be estimated when $g$ is nonlinear and the integral and derivative cannot be exchanged. The underlying motivation for the question is estimation of $\frac{\partial Y}{\partial X}$ when there is randomness.
My idea was to do something similar to kernel estimation by giving weights to points around $X_0$ based on how close they are to $X_0$ and then sample two points at a time based on the weights, say $(y_i,x_i)$ and $(y_j,x_j)$, and calculate $g(\frac{y_i-y_j}{x_i-x_j})$ and do this many times and take the sample average.
The Derivative, is a point-wise property in general case.
When talking about a linear regression $y=mx+n$, obviously, the derivative is the same throughout the sample $y'=m$. (i.e. the slope)
However, in the general case you need to take one of two approaches:
Have a prior
If you know your data is going to behave in a certain way, use your assumptions as a model.
For example, if you assume that $y=ax^2+bx+c$, then our model parameters are $a$,$b$ and $c$.
fit your data to this model using Maximum-likelihood or Bayesian inference and the derivative will be $y'=2a_*x+b_*$ where $(a_*,b_*,c_*)$ are your best fit.
Local linearity
Spline techniques are the best practice for modelling general functions.
Splines are (very roughly speaking) dividing your data into segments, and modelling each of them separately, and stitching the models together in to a continuous function.
For example, In your case you could assume that $0<x_i<1$ behave linearly as $y_1=a_1x+b_1$, and $1<x<2$ behave as $y_2=a_2x+b_2$.
The common practice is to use linear,quadratic, or cubic splines, depending on the properties you want the functions to have on the "stitch points".
I would suggest modelling a general function as linear approximation of a moving window, and the derivative at each point would be the slope corresponding to each window.