I recently went through the exercise of using Bayesian probability to figure out the most likely probability for "heads" given that [itex]H[/itex] tosses yielded heads out of [itex]N[/itex] trials. The derivation was enormously complicated, but the answer was very simple: [itex]p = \frac{H+1}{N+2}[/itex]. In the limit as [itex]N \rightarrow \infty[/itex] and [itex]H \rightarrow \infty[/itex], this approaches the relative frequency, [itex]\frac{H}{N}[/itex], but it actually is better-behaved. Before you ever toss the first coin, with [itex]N = H = 0[/itex], the Bayesian estimate gives [itex]p = \frac{1}{2}[/itex]. If you get heads for the first toss, this estimate gives [itex]p = \frac{2}{3}[/itex], rather than the relative frequency estimate, [itex]p = 1[/itex].
I should probably explain what I mean by "the most likely probability". I start off assuming that each coin has a parameter--I'm going to call it [itex]B[/itex], for bias--that characterizes the coin tosses. The model is that:
[itex]P(H | B) = B[/itex]
So the bias is just the probability of heads. But I'm treating it as a parameter of the model. As a parameter, it has a range of possible values, [itex]0 \leq B \leq 1[/itex]. If i have no idea what the value of [itex]B[/itex] is, I can use the least informative prior, which is to assume that [itex]B[/itex] is uniformly distributed in the range [itex][0,1][/itex].
That's kind of an odd concept--we're talking about the probability of a probability. Kind of weird, but let's go on.
So we toss the coin [itex]N[/itex] times and get [itex]H[/itex] heads. Then Bayesian updating tells us the adjusted, posterior probability distribution for [itex]B[/itex], given that data. The rule is (letting [itex]E(H,N)[/itex] be the fact that I got [itex]H[/itex] heads when I flipped the coin [itex]N[/itex] times):
[itex]P(B | E(H,N)) = \frac{P(E(H,N)| B) P(B)}{P(E(H,N)}[/itex]
where [itex]P(E(H,N) | B)[/itex] is the probability of [itex]E(H,N)[/itex], given [itex]B[/itex], and [itex]P(B)[/itex] is the prior probability density of [itex]B[/itex] (which is just 1 for the least informative prior), and [itex]P(E(H,N))[/itex] is the prior probability of [itex]E(H,N)[/itex], not knowing anything about [itex]B[/itex].
These can be computed readily enough:
[itex]P(B) = 1[/itex]
[itex]P(E(H,N) | B) = B^H (1-B)^{N-H} \frac{N!}{H! (N-H)!}[/itex]
[itex]P(E(H,N)) = \int dB P(B) P(E(H,N)|B) = \frac{N!}{H! (N-H)!} \int dB\ B^H (1-B)^{N-H}[/itex]
That last integral is hard to do, but it's done here:
https://math.stackexchange.com/questions/86542/prove-binomnk-1-n1-int-01xk1-xn-kdx-for-0-leq-k-le
[itex]\int dB\ B^H (1-B)^{N-H} = \frac{H! (N-H)!}{(N+1)!}[/itex]
That gives: [itex]P(E(H,N)) = \frac{1}{N+1}[/itex]
So our posterior probability distribution for [itex]B[/itex] is:
[itex]P(B|E(H,N)) = \frac{(N+1)!}{H! (N-H)!} B^H (1-B)^{N-H}[/itex]
Now, we compute [itex]\langle B \rangle_{E(H,N)}[/itex], which is the expected value of [itex]B[/itex], given [itex]E(H,N)[/itex]. The formula for expectation values is:
[itex]\langle B \rangle_{E(H,N)} = \int dB\ B\ P(B | E(H,N)) = \frac{(N+1)!}{H! (N-H)!} \int dB\ B^{H+1} (1-B)^{N-H}[/itex]
We can write: [itex]\int dB\ B^{H+1} (1-B)^{N-H} = \int dB\ B^{H+1} (1-B)^{(N+1)-(H+1)} = \frac{(H+1)! (N-H)!}{(N+2)!}[/itex]. So we can immediately write:
[itex]\langle B \rangle_{E(H,N)} = \frac{(N+1)!}{H! (N-H)!} \frac{(H+1)! (N+1-H)!}{(N+2)!} = \frac{H+1}{N+2}[/itex]
Like I said, very simple result that is very complicated to derive.