Minimization of Expected Value

288 Views Asked by At

I'd like to know how I can minimize, with respect to $\hat{y}(x)$, $$ \DeclareMathOperator{\Tr}{Tr} \mathbb{E}_{p(x,y)}[(\hat{y}(x)-y)^2 + (\hat{y}(x)-y)\Tr(\nabla^2_x\hat{y}(x)) + ||\nabla_x\hat{y}(x)||^2_2], $$ where $x$ is a vector and $y, \hat{y}(x)$ scalars.

I googled for functional derivatives and I read a little about the Calculus of Variations. I now know how to minimize functionals of the form $$ \theta(y(t)) = \int_0^T F(t,y(t),y'(t))dt $$ by using the Euler equation $$ F_y - \frac{d F_{y'}}{dt} = 0 $$ The problem is that the expression above has a completely different form: $$ \int_Y\left[\int_X F(\hat{y}(x),\nabla_x\hat y(x),\nabla_x^2\hat y(x))dx\right] dy $$ We have a gradient, a Hessian and $x$ is a vector!

Is there a general method to minimize functionals like this?

Note: I found this problem in a book about Deep Learning (see page 216 of this).

1

There are 1 best solutions below

0
On BEST ANSWER

We can minimize the functional by computing the Gâteaux derivative (a generalization of the directional derivative) and equate it to zero.

It's important to note that the derivative must be zero for every choice of $h(x)$. $$ \DeclareMathOperator{\Tr}{Tr} \theta(\hat y) = \mathbb{E}_{p(x,y)}[(\hat y - y)^2+\nu(\hat y-y)\Tr(\nabla_x^2\hat y) + \nu||\nabla_x\hat y||^2] $$ $$ d_h\theta(\hat y) = \left.\frac{d}{d\alpha}\theta(\hat y + \alpha h)\right|_{\alpha=0} = $$ $$ \left.\frac{d}{d\alpha}\mathbb{E}[(\hat y+\alpha h-y)^2 + \nu(\hat y + \alpha h-y)\Tr(\nabla_x^2\hat y + \alpha\nabla_x^2 h) + \nu||\nabla_x\hat y + \alpha\nabla_x h||^2]\right|_{\alpha=0} = $$ $$ \left.\mathbb{E}\left[2(\hat y+\alpha h-y)h + \nu\frac{d}{d\alpha}(...)\right]\right|_{\alpha=0} = \mathbb{E}[2(\hat y-y)h + O(\nu)] = 0 \iff $$ $$ \mathbb{E}[(\hat y - y)h + O(\nu)] = 0 $$ Let's rewrite the expectation more explicitly: $$ \int_X\int_Y p(x,y)((\hat y(x)-y)h(x) + O(\nu))dy dx = $$ $$ \int_X h(x)\left[\int_Y p(x,y)((\hat y(x)-y) + O(\nu))dy\right]dx = 0 \iff $$ $$ \int_Y p(x,y)((\hat y(x)-y) + O(\nu))dy = 0 $$ $$ p(x)(\hat y(x) + O(\nu))\int_Y p(y|x)dy = p(x)\int_Y p(y|x)y dy $$ $$ \hat y(x) + O(\nu) = \mathbb{E}_{p(y|x)}[y] $$ $$ \hat y(x) = \mathbb{E}_{p(y|x)}[y] + O(\nu) $$ which is the result in the book.