Say I'm trying to set a bayesian prior for a Bernoulli trial of coin flips.
The equation I'm interested in is the $p(x|I)$ from the Bayesian numerator: $$P(x|data)\propto x^{N_H}(1-x)^{N_t}p(x|I)$$ where $I$ is the background information.
NOTE: The Bayes denominator takes the form (for a uniform prior $p(x|I)$): $$\int_0^1{x^{N_H}(1-x)^{N_t}p(x|I)} = \frac{\Gamma(N_H+1)\Gamma(N-N_H-1)}{\Gamma(N+2)}$$ Alternative, if I choose the conjugate prior $p(x|i)=x^{\alpha}(1-x)^\beta$, the form of the Bayes denominator stays the same: $$\int_0^1{x^{N_H}(1-x)^{N_t}p(x|I)} = \frac{\Gamma(N_H+\beta+1)\Gamma(N-N_H+\alpha+1)}{\Gamma(N+\alpha+\beta+2)}$$
Now, say I'm interested in using this conjugate prior with the following data:
HTHTTHTTTHHTHTHTTTTHHTHTTTHTHTHHTTH $$N=35, N_H=15, N_T=20$$
I feel like the way I choose the $\alpha$ and $\beta$ parameters for $p(x|I)=x^{\alpha}(1-x)^\beta$ is to calculate the mean and stdev for the given data and then somehow "fit" those values to the prior. But I can't seem to figure out the mechanics of doing so.
Hoping someone can put me on the right track
According to a usual Bayesian analysis, the data ($n=35$ tosses with $x = 15$ Heads and $n-x = 20$ Tails, from a binomial distribution with unknown success probability $\theta$) are expressed in terms of the likelihood $$p(x|\theta) \propto \theta^x(1-\theta)^{n-x} = \theta^{15}(1-\theta)^{20}.$$
The parameters $\alpha$ and $\beta$ of the conjugate prior distribution $p(\theta) \propto \theta^{\alpha -1}(1-\theta)^{\beta = 1}$ are chosen based on prior information. This may come from past experience with coins (some people say actual physical coins seldom have heads probabilities much different from 1/2), faith in the honesty of the person who supplied the coin, and so on. Accordingly, you might pick a relatively non-informative prior: with $\alpha = \beta = 1$ (uniform) or $\alpha = \beta = 1/2$ (Jeffrey's); see section 2 of this Wikipedia article.
By contrast, if you have a vague hunch the coin might be slightly biased in favor of heads, then you might pick an informative beta prior $\mathsf{Beta}(\alpha = 22, \beta = 16),$ which has mean $\mu = 22/38 \approx 0.58$ and puts roughly 95% of its probability in the interval $(0.42, 0.73),$ as confirmed by the computation below in R statistical software.
Then you would obtain the posterior probability as the product of the prior and likelihood distributions. Using the informative prior above, this would give
$$p(\theta|x) \propto p(\theta) \times p(x|\theta) \propto \theta^{22 - 1}(1-\theta)^{16-1} \times \theta^{15}(1-\theta)^{20} = \theta^{37-1}(1-\theta)^{36-1},$$
where we notice that the right hand expression is the kernel (density without normalization constant) of the distribution $\mathsf{Beta}(\alpha=37,\,\beta=36).$ In that case a 95% Bayesian credible (probability) interval for $\theta$ is $(0.39,.0.62).$
If you had used the non-informative uniform prior, the posterior would be $\mathsf{Beta}(\alpha=15,\,\beta=20)$ and the 95% Bayesian interval estimate of $\theta$ would be $(0.27, 0.59).$
Both Bayesian interval estimates include $\theta = 0.5,$ but the informative prior information has 'melded' with the binomial data to give a slightly different result than with the noninformative or 'flat' prior. The interval estimate from the flat prior is almost entirely due to the data. [The frequentist Agresti 95% confidence based just on the data is $(0.28, 0.59)$.]
In practice, there is usually no precise way to turn a pre-experiment 'hunch' into a prior distribution that reflects that hunch. Considerations with beta priors are that the distribution mean is $\mu = \frac{\alpha}{\alpha + \beta},$ there is also a (somewhat messier) formula for $\sigma$ in terms of $\alpha$ and $\beta$, and with software such as R, it is easy to find priors that put a 'reasonable' amount of probability into a 'reasonable' interval. In continuing, sequential investigations, one might use the posterior distribution for one phase of experimentation as the prior distribution for the next.
Reasons for using a conjugate beta prior (to go with the the binomial likelihood) are that the mathematics is simple, it is not necessary to deal with the 'denominator' of Bayes' Theorem, and the posterior distribution is easily recognized.