# Probability question

1. Oct 30, 2008

### dabd

Hi,

I have the following probability question:

Given a random bit string S of length m, if I take a copy of a proper suffix of S (a suffix of S that is neither equal to S, nor empty), and align it with the start of the string, how many consecutive matches starting from the beginning of the string do I expect to see on average?

For example:
Given
S[1..9] = 101010110
If I take the suffix starting from position 3: S[3..9] = 1010110
and align it with S:
S[1..9] = 101010110
S[3..9] = 1010110

How many consecutive matches on average starting from position 1 do I expect to see?

My guess is that since for each position the probability that a symbol matches another is 1/2, half of the sequence would be matches, but I am not sure of this since this line of thought doesn't take in account the consecutive ordering of the sequence of matches.

Thanks.

2. Oct 30, 2008

### Office_Shredder

Staff Emeritus
Basically, you're comparing the element in position k with the element in position k+r for some r>0. If the events are independent, whether they are equal has probability 1/2. So the probability distribution of the length of consecutive matches is just geometric, with the added caveat that you can't have more than m-r consecutive matches (since the suffix is of length m).

3. Oct 30, 2008

### dabd

So the expected length of matches is 2-(m-r)?

4. Oct 30, 2008

### Office_Shredder

Staff Emeritus
Obviously not, since you expect the expected length to be at least 1/2. The probability that the length is k is going to be geometric in k (since you need to have k matches then a failure To calculate the expected value, just use the formula

$$E(X) = \sum_{n=1}^{m-r}P(X=k)*k$$

Which in this case is

$$\sum_{n=1}^{m-r}\frac{k}{2^k}$$

I'm not sure what a closed form for that is off the top of my head

5. Nov 2, 2008

### dabd

According to the information here I don't understand how you arrived at that formula for the expected value.
http://en.wikipedia.org/wiki/Geometric_distribution

Shouldn't it be
$$\sum_{k=0}^{m-r}{(1-p)^kp \cdot k}$$

Ok. For the case of binary strings your formula is correct because p = 1/2 but if you consider strings over a finite alfabet
$$|\Sigma|>2$$
then
$$p=\frac{1}{|\Sigma|}$$
and this formula is the one to use.

Last edited: Nov 2, 2008