My question refers to chapter 2 of the Natural Evolution Strategies paper. I want to understand the derivation of the search gradients.
Given z, a solution vector sampled from a probability distribution function π(z,θ). θ denotes the parameters of density π(z | θ) and f(z) denotes the fitness function for samples z. The idea is to maximise the expected value of the fitness score of a sampled solution.
We can write the expected fitness under the search distribution as
If we apply a probabilistic identity trick, where we multiply our expressions by one (formed by the division of a probability density by itself), we can express the search gradients the following way:
Then obtain the estimate of the search gradient from samples z1...zλ:
Here I am failing to understand two things:
1) The step of gradient derivation where we replace the integral with the expectation E (the first question mark). From school I remember that integral can be found by summation:

With a fairly small step it gives accurate value of the integral. So for sin(x) we could calculate the integral as
sin(x1)*dx + sin(x2)*dx + ... sin(xN)*dx + ...
Similarly we could rewrite the last expression with the integral (underscored yellow) as
[f(z1)*∇log(p(z1))*p(z1)]*dz + [f(z2)*∇log(p(z2))*p(z2)]*dz + ... [f(zN)*∇log(p(zN))*p(zN)]*dz + ...
On the other hand, the expected value E[f(z) * ∇log π(z,θ)], which is said to be equal to the formula with the integral, is calculated as
E[f(z) * ∇log π(z,θ)] = [f(z1)*∇log p(z1)]*p(z1) + [f(z2)*∇log p(z2)]*p(z2) + ...
Thus, the expected value differs from the integral by the summation step (dz). Could somebody explain what I am missing?
2) The second thing which is unclear to me is the approximation of the search gradient (the second question mark in the picture). We calculate the expected value by summing up [f(zk) * ∇log π(zk,θ)] and divide it by the number of samples λ. In my understanding it makes sense only if p(zk) = 1/λ for all k samples. That means z must be uniformly distributed, however z is sampled from π(z,θ) which may not necessarily be a uniform distribution.


