Estimating the expectation of a derivative

1k Views Asked by At

Assume $Y$ is a continuously differential function of $X$. Given i.i.d. data $(x_i,y_i)_{i=1}^n$, I would like to estimate $E\left[\left.\frac{\partial Y}{\partial X}\right|_{X=X_0}\right]$.

What got me thinking about this problem was estimation of the coefficients in a linear regression using $E\left[\frac{\partial Y}{\partial X}\right]$ (I know it is not the best way to estimate the coefficients, and may even be a bad way to do so).

From this question Derivative of a random variable w.r.t. a deterministic variable, I know that $\frac{\partial Y}{\partial X}$ makes sense but I'm trying to understand how to estimate $\frac{\partial Y}{\partial X}$ when there is randomness (without randomness estimation can be done by finite differences, for example).

I don't have any good ideas how to estimate $E\left[\left.\frac{\partial Y}{\partial X}\right|_{X=X_0}\right]$ but if I was forced to give a way, I would give weights to points around $X_0$ based on how close they are to $X_0$ and then sample two points at a time based on the weights and first difference the two points and do this many times and take the sample average.

References / summary of techniques / specifics are greatly appreciated.

1

There are 1 best solutions below

14
On BEST ANSWER

This looks like a good opportunity to apply Kernel Regression (part of the vast field of Nonparametric Regression).

You actually described the basic idea in your last full paragraph. You will be approximating $Y=f(x)$ using kernel-weighted sum of points about $x$, based on their distance from the point.

There are a number of possible kernels, but since you want $Y$ to be smooth, we should choose a kernel that uses all data points in it calculation (so there are no discontinuities). A familiar kernel is the Gaussian Kernel $K_{\sigma}$ with "bandwidth" $\sigma$:

$$K_g(x;\sigma):=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{x^2}{\sigma^2}}$$

If we use the Nadaraya-Watson Kernel, we can see that our function will have the following form for a set of sample pairs $(y_i,x_i) \in S$:

$$E[Y|X=x_0]\approx f(x_0;S,\sigma)=\frac{\sum\limits_{(x_i,y_i)\in S} K_{\sigma}(x_i-x_0)y_i}{\sum\limits_{(x_i,y_i)\in S} K_{\sigma}(x_i-x_0)}$$

We can approximate $E\left[\left.\frac{\partial Y}{\partial X}\right|_{X=X_0}\right]$ by taking the derivative of $f(x_0;S,\sigma)$ wrt. $x_0$:

$$E\left[\left.\frac{\partial Y}{\partial X}\right|_{X=X_0}\right] \approx \frac{\partial}{\partial x_0}\frac{\sum\limits_{(x_i,y_i)\in S} K_{\sigma}(x_i-x_0)y_i}{\sum\limits_{(x_i,y_i)\in S} K_{\sigma}(x_i-x_0)}$$

Not necessarily a pretty formula, but it will be (a) differentiable and (b) take into account all your data, not just data near your point of interest.

I don't have time right now to work out the whole derivative, but you can apply the quotient rule yourself to see what it will look like.


Response to OP Comment

The OP has stated that they want a "model free" estimate of the derivative at a given point. Unfortunately, the existence of the derivative presumes some model. The nonparametric approach above makes very weak assumptions about the nature of this function, based essentially on just the observed data and some basic smoothing transformations (to ensure that the derivative makes sense).

Also, to clarify: the link to the post about the derivative of a random variable wrt a deterministic variable assumes that the underlying point in the sample space is held fixed and that there is no interaction term between the deterministic and random parts of $Y$: (i.e., we are thinking of $Y$ as a linear function of $(\omega,x)$, so $\frac{\partial Y}{\partial x}$ ends up being an ordinary function. If this is not the case, then you've entered the realm of stochastic differential equations, and its not simple at all. Now, if you have a bunch of data, you cannot assume that $\omega$ is the same for each point, so we need to develop methods to handle this. My approach above is model-free in this regard, but I'll expand on this below.

For notational simplicity, let's define the conditional random variable $Y_z:=(Y|X=z)$. Then we can model the derivative as follows:

$$\frac{d}{dz}Y_z = \lim_{z'\to z^+} \frac{Y_{z'}-Y_{z}}{z'-z} := Y'_z$$

Lets define the "secant" random variable as: $$\delta Y_{z,z'} := \frac{Y_{z'}-Y_{z}}{z'-z}$$

The expected value of $\delta Y_{z,z'}$ follows from linearity of expectation:

$$E[\delta Y_{z,z'}] = \frac{E[Y|X=z'] - E[Y|X=z]}{z'-z} $$

Since you are assuming that $Y$ is continuously differentiable function of $X$, we know that $E[Y_z]$ is a smooth, univariate function of $z$, to which we can apply simple Calculus I concepts:

$$ \lim_{z'\to z^+} E[\delta Y_{z,z'}] = E\left[\lim_{z'\to z^+} \delta Y_{z,z'}\right] = E[Y'_z]$$

But, we also have:

$$ \lim_{z'\to z^+} E[\delta Y_{z,z'}] = \lim_{z'\to z^+} \frac{E[Y|X=z'] - E[Y|X=z]}{z'-z}= \frac{d}{dz}E[Y_z]$$

Therefore, $E[Y'_z]=\frac{d}{dz}E[Y_z]$. The latter we were able to get nonparametrically using kernel estimators.


Response No. 2 - Discussion of convergence

I'll start by stepping through the first convergence result:

$$ \lim_{z'\to z^+} E[\delta Y_{z,z'}] = E\left[\lim_{z'\to z^+} \delta Y_{z,z'}\right] = E[Y'_z]$$

Given all the subscripts involved, it may help to recast this using a notation that clarifies the variables:

$E[Y|X=z]\equiv E[Y_z]$ is a function of $z$ only, so lets call it $g(z)$. Then, we can see that:

$$\lim_{z'\to z^+} E[\delta Y_{z,z'}] = \lim_{z'\to z^+} \frac{g(z')-g(z)}{z'-z} \equiv \frac{dg}{dz}:=g'(z)\equiv \frac{d}{dz}E[Y_z](z)$$

So, this step is just the basic application of calculus (assuming the derivative even exists, which your question presumes).

Since you care about a particular $z=x_0$, the limit will just be a number, say, $c=g'(x_0)$.

Next, we have the issue of exchanging the limit and the integral. Let $f_{Y|X=z}(y):=h_z(y)$ be the density of $Y|X=z$. We need to show:

$$\lim_{z' \to z^+} \int \frac{y[h_{z'}(y)-h_{z}(y)]}{z'-z}dy= \int \lim_{z' \to z^+} \frac{y[h_{z'}(y)-h_{z}(y)]}{z'-z}dy$$

Indeed, this is where things are more technically complicated. In general, you can't interchange limit and integration. But, we can if we satisfy the Dominated Convergence Theorem:

Lets define a sequence $z_n:z_n\geq z, \lim_{n\to \infty} z_n = z$, and let:

$$f_n(y):=\frac{y[h_{z_n}(y)-h_{z}(y)]}{z_n-z}$$

Then we need to show:

$$\exists g(y): |f_n(y)|\leq g(y)\;\forall n,y\; \textrm{and}\; \int |g|dy <\infty$$

If we assume that $\delta Y_{z,z'} \xrightarrow{d} Y'_z$ and both posses smooth probability distributions (no discontinuities), then $h_{z_n} \to h_z$ pointwise. Now, we need to construct a bounding function $g$.

At this point we need additional assumptions on the distribution of $Y|X$. For example, if $Y$ is bounded then issues of integrability go away.For unbounded $Y$, we need some information on how $h_{z'}$ converges to $h_z$. Is it just pointwise or uniform too? (the latter is needed to allow exchange of the limit and integral).

So, I guess there was a bit more than simple calc I here...and the result may not be satisfactory from a theoretical sense.

Heres a practical suggestion:

Construct a series of relationships $Y(X)$ that cover the range of relationship types you expect (e.g., exponential, linear, log, sinusoidal, sigmoid). Program a script to generate random samples from $(Y,X)$ (with some distribution or lattice on X), and have it estimate expected value of the derivative using the kernel method and compare it to the theoretical expected derivative (you will need to decide on a distribution for your stochastic/random part in $Y(X)$.)

You will be able to do this comparison across an entire range of $X$ and many repetitions. This will give you the a reliable way to see if the method works for the kinds of problems you expect. Unfortunately, there are very few methods that work optimally in all cases, but the kernel method should get you close for reasonably behaved functions.