mfb said:
How did you get that number?
We end a sequence with A, that increases the probability that the next sequence starts directly (and gives a deviation larger than the 2% I estimated earlier - which then is even more increased by the fact that we look for ~3sigma outliers). In the extreme case, something like "AAAA" would give a much higher rate because the next sequence has 1/4 probability to start right after the previous one.
I made a 50-state markov chain in excel and I get 0.009123921 for at least 10 occurences.
Other probabilities:
0: 0.017
1: 0.070
2: 0.142
3: 0.192
4: 0.195
5: 0.158
6: 0.107
7: 0.062
8: 0.031
9: 0.014
I viewed occurrences of AGGA as a
delayed renewal process. The time (number of spaces) until the first occurrence is ##T_0##, which has pmf ##f_0 = (3/4) f_1 + (1/4) f_2##, while the times between successive later occurrences has pmf ##f_2##. Here, ##f_1, f_2 ## are the first-passage pmfs from states 1 and 2 to state 5 in the following 5-state Markov chain.
The Markov chain.
State 1 = start, or start over; state 2 = A (first A after starting), state 3 = AG, state 4 = AGG, and state 5 = AGGA (stop).
The 1-step transition probability matrix is
P = \pmatrix{3/4 & 1/4 & 0 & 0 & 0\\ 1/2 & 1/4 & 1/4 & 0 & 0 \\ 1/2 & 1/4 & 0 & 1/4 & 0 \\ 3/4 & 0 & 0 & 0 & 1/4 \\ 0 & 0 & 0 & 0 & 1}
Let ##f_i(n)## denote the probability that ##n## is the first time to reach state 5 from state ##i##. We have ##f_i(1) = p_{i,5}, i = 1,2,3,4## and
f_i(n+1) = \sum_{j < 5} p_{ij} f_j(n), \; i=1, \ldots, 4
This can be implemented numerically by letting ##f(n) = ## column vector ##(f_i(n), i=1,\ldots, 4)##, and by letting ##Q=## the ##4 \times 4## submatrix of ##P## consisting of rows and columns 1--4. We then have
f(n+1) = Q f(n), n = 1,2, \ldots
with ## f(1) = (0,0,0,1/4)^T ##. The pmf of ##T_1## is ##\{ f_1(n), n = 1,2, \ldots \}##, while the pmf of ##T_2## is ##\{ f_2(n), n = 1,2, \ldots \}##. We just compute these up to ##n = 1230##.
To find the probability that 10 occurrences take place in the interval ##t = 1, 2, \ldots, 1027## we start by letting
S_{10} = T_0 + \sum_{j=1}^{9} T'_j,
which is the number of steps needed to reach 10 occurrences of the pattern. Here, ##T_0, T'_1, T'_2, \ldots ## are independent, ##T_0## has df ##f_0## and the ##T'_k## have df ##f_2##. We take ##f_0 = (3/4) f_1 + (1/4) f_2## because if the first letter is not 'A' we start the Markov chain from state 1, while if the first letter is 'A' we start from state 2. The event ##\{ S_{10} \leq 1027 \} ## occurs when the 10th pattern is on or before time 1027 (so the number occurring on or before 1027 is ##\geq 11##). The event ##\{ \text{No. of occurrences } = 10 \} = \{ S_{10} \leq 1027 \} \cap \{ S_{11} > 1027\}##, so has probability
\sum_{j=1}^{1027} P(S_{10} = j) P(T_2 > 1027 - j) = \sum_{j=1}^{1027} f_{10}(j) \left(1-\sum_{k=1}^{1027-j} f_2(k) \right)
I just did all the computations in Maple.