Assume we have some data $(x_1, \dots, x_n) \sim P(x)$ and a family of densities $p(x\mid\theta)$ and we estimate $\hat \theta$ somehow. Can we estimate entropy of $P(x)$ as $\hat H(P) = -\frac{1}{n}\sum_i \log p(x_i\mid\hat \theta)$ since $H(X) = -\mathbb E_{x\sim P}\log P(x)$ and the best estimator of expected value is average. There will be of course bias caused by choice of $p(x|\theta)$ and variance from finite $n$, but my question is:
Whether this is even a valid approach for estimating entropy: building a density estimator from data and then using this density model on same data to compute entropy as a sample average of negative log likelihoods.
Essentially $\hat \theta$ in MLE is chosen to minimize $\hat H$, so the result is an upper bound on real entropy $H$? Ultimately I want to use it to estimate $D_{KL}(P, Q)$ between $(x_1, \dots, x_n) \sim P(x)$ and $(y_1, \dots, y_n) \sim Q(x)$ using a good parametric density model $p(x\mid\theta)$.
I seem to have found a related discussion here stating that MLE loss converges to a sum of true distribution and a KL-divergence between true distribution and our estimate
$$-\frac{1}{n}\sum_{i} \log p(x_i|\theta) = \mathcal L(\theta, X) \to_{n} D_{KL}(P(x) \mid\mid p(x\mid\theta)) + H(P(x))$$
as number of samples grows. But I don't see a way to apply this to relative entropy $H(P \mid Q) = - \mathbb E_{x\sim P}\log Q(x)$ also required in computation of $D_{KL}$ holds:
$$ -\frac{1}{n}\sum_{x_i \sim P} \log q(x_i|\theta) = \mathcal L(\theta, X) \to_n D_{KL}(Q(x) \mid\mid q(x\mid\theta)) + H(P(x), Q(x))$$
? No, actually. I have no idea how to takle this then.