Derivation of Search Gradients for Natural Evolution Strategies

55 Views Asked by At

My question refers to chapter 2 of the Natural Evolution Strategies paper. I want to understand the derivation of the search gradients.

Given z, a solution vector sampled from a probability distribution function π(z,θ). θ denotes the parameters of density π(z | θ) and f(z) denotes the fitness function for samples z. The idea is to maximise the expected value of the fitness score of a sampled solution.

We can write the expected fitness under the search distribution as

enter image description here

If we apply a probabilistic identity trick, where we multiply our expressions by one (formed by the division of a probability density by itself), we can express the search gradients the following way:

enter image description here

Then obtain the estimate of the search gradient from samples z1...zλ:

enter image description here

Here I am failing to understand two things:

1) The step of gradient derivation where we replace the integral with the expectation E (the first question mark). From school I remember that integral can be found by summation: enter image description here

With a fairly small step it gives accurate value of the integral. So for sin(x) we could calculate the integral as

sin(x1)*dx + sin(x2)*dx + ... sin(xN)*dx + ...

Similarly we could rewrite the last expression with the integral (underscored yellow) as

[f(z1)*∇log(p(z1))*p(z1)]*dz + [f(z2)*∇log(p(z2))*p(z2)]*dz + ... [f(zN)*∇log(p(zN))*p(zN)]*dz + ...

On the other hand, the expected value E[f(z) * ∇log π(z,θ)], which is said to be equal to the formula with the integral, is calculated as

E[f(z) * ∇log π(z,θ)] = [f(z1)*∇log p(z1)]*p(z1) + [f(z2)*∇log p(z2)]*p(z2) + ...

Thus, the expected value differs from the integral by the summation step (dz). Could somebody explain what I am missing?

2) The second thing which is unclear to me is the approximation of the search gradient (the second question mark in the picture). We calculate the expected value by summing up [f(zk) * ∇log π(zk,θ)] and divide it by the number of samples λ. In my understanding it makes sense only if p(zk) = 1/λ for all k samples. That means z must be uniformly distributed, however z is sampled from π(z,θ) which may not necessarily be a uniform distribution.