Probability Q: Average Consecutive Matches in Bit Strings

  • Context: Graduate 
  • Thread starter Thread starter dabd
  • Start date Start date
  • Tags Tags
    Probability
Click For Summary

Discussion Overview

The discussion revolves around a probability question concerning the expected number of consecutive matches when aligning a proper suffix of a random bit string with the start of the string. The participants explore the implications of independence in events and the geometric distribution of match lengths, considering both binary strings and strings over larger alphabets.

Discussion Character

  • Exploratory
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant poses a question about the expected number of consecutive matches when aligning a suffix of a bit string with its start, suggesting that the probability of a match at each position is 1/2.
  • Another participant agrees that if the events are independent, the probability of matches follows a geometric distribution, but notes that the maximum number of matches is limited by the length of the suffix.
  • A later reply questions the expected length of matches, asserting that it should be at least 1/2 and provides a formula for calculating the expected value based on the geometric distribution.
  • Further discussion includes a challenge to the formula presented for expected value, suggesting that it should account for the probability of matches over a finite alphabet, where the probability of a match would be 1 divided by the size of the alphabet.

Areas of Agreement / Disagreement

Participants express differing views on the expected length of matches and the appropriate formulas to use, indicating that no consensus has been reached regarding the correct approach or formula for the expected value.

Contextual Notes

Participants mention limitations regarding the assumptions of independence and the implications of using different alphabets, which may affect the probability calculations.

dabd
Messages
25
Reaction score
0
Hi,

I have the following probability question:

Given a random bit string S of length m, if I take a copy of a proper suffix of S (a suffix of S that is neither equal to S, nor empty), and align it with the start of the string, how many consecutive matches starting from the beginning of the string do I expect to see on average?

For example:
Given
S[1..9] = 101010110
If I take the suffix starting from position 3: S[3..9] = 1010110
and align it with S:
S[1..9] = 101010110
S[3..9] = 1010110

How many consecutive matches on average starting from position 1 do I expect to see?

My guess is that since for each position the probability that a symbol matches another is 1/2, half of the sequence would be matches, but I am not sure of this since this line of thought doesn't take in account the consecutive ordering of the sequence of matches.

Can someone please enlighten me?

Thanks.
 
Physics news on Phys.org
Basically, you're comparing the element in position k with the element in position k+r for some r>0. If the events are independent, whether they are equal has probability 1/2. So the probability distribution of the length of consecutive matches is just geometric, with the added caveat that you can't have more than m-r consecutive matches (since the suffix is of length m).
 
Office_Shredder said:
Basically, you're comparing the element in position k with the element in position k+r for some r>0. If the events are independent, whether they are equal has probability 1/2. So the probability distribution of the length of consecutive matches is just geometric, with the added caveat that you can't have more than m-r consecutive matches (since the suffix is of length m).

So the expected length of matches is 2-(m-r)?
 
Obviously not, since you expect the expected length to be at least 1/2. The probability that the length is k is going to be geometric in k (since you need to have k matches then a failure To calculate the expected value, just use the formula

[tex]E(X) = \sum_{n=1}^{m-r}P(X=k)*k[/tex]

Which in this case is

[tex]\sum_{n=1}^{m-r}\frac{k}{2^k}[/tex]

I'm not sure what a closed form for that is off the top of my head
 
Office_Shredder said:
Obviously not, since you expect the expected length to be at least 1/2. The probability that the length is k is going to be geometric in k (since you need to have k matches then a failure To calculate the expected value, just use the formula

[tex]E(X) = \sum_{n=1}^{m-r}P(X=k)*k[/tex]

Which in this case is

[tex]\sum_{n=1}^{m-r}\frac{k}{2^k}[/tex]

I'm not sure what a closed form for that is off the top of my head

According to the information here I don't understand how you arrived at that formula for the expected value.
http://en.wikipedia.org/wiki/Geometric_distribution

Shouldn't it be
[tex]\sum_{k=0}^{m-r}{(1-p)^kp \cdot k}[/tex]

Ok. For the case of binary strings your formula is correct because p = 1/2 but if you consider strings over a finite alfabet
[tex]|\Sigma|>2[/tex]
then
[tex]p=\frac{1}{|\Sigma|}[/tex]
and this formula is the one to use.
 
Last edited:

Similar threads

  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 3 ·
Replies
3
Views
890
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
1
Views
3K
  • · Replies 36 ·
2
Replies
36
Views
5K
  • · Replies 76 ·
3
Replies
76
Views
7K
  • · Replies 3 ·
Replies
3
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K