I've taken pretty standard statistic courses but this semester I decided to take a Bayesian analysis course and I am completely out of my league, in terms of the mathematical difficulty of the course. I have this question on an exercise:
Assume y1, . . . , yn | θ are ⊥ and yi ∼ Poisson(θxi), for i = 1, . . . , n
Here xi is the (known) exposure
- Assume a conjugate Gamma(a, b)
- Show the Posterior
$$\theta|y1,...,yn ∼ Gamma(a + \sum_{i=0}^nyi, b \sum_{i=0}^nxi)$$
If someone could give me more clarity on what the problem is asking of me(In layman's terms), give me a direction on how to solve this and further Show questions, and/or material to boost my mathematical understandings, I would grateful. To be clear I am not asking for the "answer" to this question but more along the lines of what is the question.
Let's consider a simpler example to illustrate the principles.
Suppose there is a single Poisson variable, $$Y \mid \theta, x \sim \operatorname{Poisson}(\theta x), \\ \Pr[Y = y \mid \theta, x] = e^{-\theta x} \frac{(\theta x)^y}{y!}, \quad y = 0, 1, 2, \ldots,$$ where $\theta$ is the parameter of interest, and $x$ is some known constant. So for instance, if we know $x = 2$, and we observe $Y = 100$, this would suggest that $\theta$ would be "large," at least compared to a scenario where we observe $Y = 1$. The reason is because the Poisson mean is $$\operatorname{E}[Y \mid \theta, x] = \theta x = 2\theta.$$ So we would surmise that if we observe $Y = 100$, $\theta$ is much more likely to be around $50$; but if $Y = 1$, it is more likely that $\theta$ is around $0.5$. So even if we only see one observation from this Poisson distribution, it gives us some information about what $\theta$ could be.
Now, the central theme in Bayesian inference is that the data we observe is fixed, and the parameters are random variables. The subtlety here is that before the data is observed, they are presumed to follow some kind of parametric distribution. Once the data are realized, they are no longer random but fixed. Subsequent observations update our beliefs about the distribution of the parameters of interest. In order to model this behavior, we suppose that $\theta$ follows some prior distribution, and the act of observing the data allows us to update this distribution, which is called the posterior distribution.
So for this question, we presume that $\theta$ follows a gamma distribution with hyperparameters $a, b$: $$\theta \sim \operatorname{Gamma}(a,b), \\ f(\theta \mid a,b) = \frac{b^a \theta^{a-1} e^{-b\theta}}{\Gamma(a)}, \quad \theta > 0.$$ We are free to choose $a, b$ in accordance to our prior beliefs about how $\theta$ is distributed. For example, the prior mean for $\theta$ is $\operatorname{E}[\theta] = a/b$, and the variance is $\operatorname{Var}[\theta] = a/b^2$. So if if we are relatively uncertain about $\theta$, we might want to choose a large $a$ and a small $b$.
Now, by Bayes' rule, we have $$f(\theta \mid Y = y, x, a, b) = \frac{\Pr[Y = y \mid \theta, x]f(\theta \mid a, b)}{\Pr[Y = y]}.$$ But since the denominator is not a function of $\theta$, we can also write this as $$f(\theta \mid Y = y, x, a, b) \propto \Pr[Y = y \mid \theta, x]f(\theta \mid a, b).$$ That is to say, the posterior density of $\theta$ given the observation $Y = y$ is proportional to the probability of observing $Y = y$ given $\theta$, times the prior density of $\theta$. This makes sense because the RHS is a likelihood of $\theta$ given the data. Concretely, $$f(\theta \mid Y = y, x, a, b) \propto e^{-\theta x} \frac{(\theta x)^y}{y!} \cdot \frac{b^a \theta^{a-1} e^{-b \theta}}{\Gamma(a)}.$$ The key to understanding this expression is to regard every variable except for $\theta$ as a constant, and trying to see what probability distribution $\theta$ looks like. Specifically, we can ignore any factors that are constants with respect to $\theta$: $$f(\theta \mid Y = y, x, a, b) \propto e^{-\theta x} \theta^y \theta^{a-1} e^{-b \theta} = e^{-(b+x)\theta} \theta^{a+y-1}.$$ This looks like a gamma distribution with shape $a+y$, and rate $b+x$, since such a density would be $$\frac{(b+x)^{a+y} \theta^{a+y-1} e^{-(b+x)\theta}}{\Gamma(a+y)}.$$ The extra factors $(b+x)^{a+y}$ and $\Gamma(a+y)$ are constants with respect to $\theta$. So the posterior distribution of $\theta$ given a single observation $Y = y$ is also gamma distributed, but the posterior hyperparameters are $$a^* = a+y, \quad b^* = b+x.$$
For example, suppose $x = 2$ as above. Our prior belief about $\theta$ is uncertain, so say we choose $a = 10$, $b = 0.25$. If we observe $Y = 18$, then our posterior hyperparameters become $$a^* = 10 + 18 = 28, \quad b^* = 0.25 + 2 = 2.25.$$
So now that we have gone through the simple case, how would we generalize to the case where we have multiple data points? The key is to note that the sum of the observations $$\sum_{i=1}^n Y_i \sim \operatorname{Poisson}\left(\theta \sum_{i=1}^n x_i \right).$$ In other words, the sum of independent Poisson random variables, with rates $\theta x_i$, is Poisson with rate equal to the sum of the rates. Then we can treat the sample as equivalent to a single observation from a Poisson distribution with modified rate, and perform our posterior density calculation on this observation.