# Probability of 0 bit in ASCII text files

1. Oct 6, 2012

### Cylab

text files consists of 0&1 bits. But in ASCII, Most Significant Bit(MSB) is 0, so we have more 0s than 1s. Assume that we draw bits at random. Question: What is probability of drawing 0 in 2nd time, given 1st drawing was 0?

My Analysis:
From bayes formula, Pr(0) = 1/8*1 + 7/8*1/2 = 9/16, so Pr(1)=7/16 for first drawing of a bit.
Then, it should be like: Pr(0&0) = Pr(0) * Pr(0|0), but I cant figure out the Pr(0|0)...

2. Oct 6, 2012

### Number Nine

Are the draws independent? Are we drawing with replacement?

3. Oct 6, 2012

### haruspex

Even if drawing with replacement, the result of the first draw adds information regarding the file. While it may be ok to assume that non-leading bits are in general equally likely 0 or 1, it will not be true of a given file. Consider a file consisting of a single character. Initially, we know the leading bit is zero and the others equally likely 0 or 1. If first draw returns a 0, what is the probability that it was the leading zero? If it was not, how does that affect the probability of drawing a zero again if first was replaced? What if not replaced?

4. Oct 6, 2012

### Cylab

(sorry) it is without replacement. (this is part of research) calculation should be based on the description, that is, randomly distributed bits are given, with equal probability of 0 and 1, plus Pr(MSB=0)=1/8(each byte has a MSB=0). So the probability of 0 as whole is 9/16. Question is what is probability of drawing 0 in 2nd time, given 1st time was 0?

5. Oct 7, 2012

### haruspex

OK. If the file is large then clearly the replacement or otherwise has negligible effect, so the result of the first draw tells you nothing about the second.
For a more modest file, N bytes say, can you calculate the probability that the first bit drawn was a MSB?

6. Oct 7, 2012

### Cylab

To consult (please correct it if it is wrong, I am doubtful for the second analysis):
The probability of being MSB for first bit drawn;
1) Pr(MSB) = 1/8 (since only 1 MSB in each byte, and characters are stored as byte).
The probability of being 0 for first bit drawn;
2) Pr(0) = 1/8*1 + 7/8*1/2 (due to Bayes formula, if drawn bit is MSB, it is 0 with 100%, if it is NOT MSB, then it is 0 with 50%).

7. Oct 7, 2012

### haruspex

No, you're not understanding my question. Given that the first bit drawn was a 0, what is the prob that it was an MSB?

8. Oct 7, 2012

### Cylab

So there are more questions come, I wish I will have answers at least one of them..Please share your thoughts.

9. Oct 7, 2012

### haruspex

At first, there are 8 equally likely possibilities, drawing bit 0 (MSB)through to bit 7.
That splits into 15: a double shot at a 0 from bit 0, then 7 equal for a 0 bit from bits 1 to 7, and 7 equal for a 1 bit from bits 1 to 7. Since it turned out to be a 0, you rule out those last 7. That means the prob that it was an MSB is 2/9.
If there are N bytes altogether, there are N MSBs. 2/9 of the time you removed one of these in the first selection, leaving only N-1 MSBs. The other 7/9 times you removed a non-msb.
Can you finish it from there?

10. Oct 7, 2012

### Cylab

Sorry, I am afraid I cant!! Will you please shed more light.
Reading the description several times, still i find myself unable to follow the logic.
(btw, just in case, with initial requirement, when we draw bits, we don`t know where each byte starts, since these bits(Pr(0)=Pr(1)=1/2, plus Pr(MSB=0)=1/8) are randomly distributed.

11. Oct 7, 2012

### Cylab

By the way, any idea about the Pr(0)=? of these ASCII bits??
Sorry taking so much of your time.

12. Oct 7, 2012

### haruspex

With prob 2/9, after getting a 0 bit there are N-1 MSBs and 7N non-MSBs remaining. So prob of drawing a 0 next would be ((N-1)+7N/2)/(8N-1). With prob 7/9 there are N MSBs and 7N-1 non-MSBs remaining, in which case prob of drawing a 0 next would be (N+(7N-1)/2)/(8N-1). Total prob = (2*(N-1)+7N + 7N+7(7N-1)/2)/(9(8N-1)). I leave it to you to simplify that.

13. Oct 8, 2012

### Cylab

Will you please clarify how prob 2/9 is calculated?
From my understanding is that if there are 15 bits(sequence of bit strings), then
there are 1 or 2 MSB in the strings. So little confused about 2/9.

14. Oct 8, 2012

### haruspex

In the first draw, you had 2/16 chance of a 0 from an MSB, 7/16 chance of a 0 from a non-MSB, and 7/16 chance of a 1 from a non-MSB. Since you got a 0, you can eliminate the third of these. That leaves a probability 'weight' of 2 for an MSB compared with 7 for a non-MSB, so the odds that it was in fact an MSB are 2/9.
Note, I'm assuming a whole number of bytes in the file.

15. Oct 8, 2012

### Cylab

What is the "the third of these"?
What is the "'weight' of 2"?
what is "odds"?

16. Oct 8, 2012

### haruspex

The "1 from a non-MSB" case.
By 'weight' I just mean a share of the remaining probability space.
Initially there are 8 equally likely possibilities corresponding to the 8 positions within a byte.
The MSB always gives a 0, while the other 7 can each give a 0 or a 1. So we can divide the total probability 15 ways, but they are not equal now. A 0 from MSB gets 2 units while each of the other 14 get only 1 unit each. This gave you your 9/16 for a 0 in the first draw.
Once you know you got a 0, that eliminates 7 of the total weight of 16, leaving only a total weight of 9. An MSB in the first draw made 2 of those 9, so the probability that it was an MSB is 2/9.
[/QUOTE]
what is "odds"?[/QUOTE]
That's just another word for probability.

17. Oct 8, 2012

### rbj

if this file is text, like English-language text, then some letters are more common than other letters. there are statistics that have results about which letters and characters are more common. i dunno where to find such statistics, but i remember that the letter "e" is the most common. this unequal probability is what is behind the choice of symbols in Morse code. the letter "e" is a single "dit" or "dot" because it is the most common. some similar reason exists with assigning Huffman codes to particular symbols.

18. Oct 8, 2012

### haruspex

OP has stated that all bits except MSB are to be assumed equally likely, and independently, 0 or 1.

19. Oct 9, 2012

### Cylab

Still confused, say, why getting a 0 eliminates 7 of the total weight of 16 etc?.(saying it in simple: we have Pr(MSB=0)=1/8, plus Pr(0)=Pr(1)=1/2, then draw a bit, (as whole) Pr(0)=?, Pr(MSB=0)=?)
Anyway, it seems to follow the rule, where i equals number of 0 drawn sequenced/continual, say i=0,1,2.
1/8 ((i+1)/2^i +(7-i)/2^(i+1))=(i+9)/2^(i+4) .
Agree?

20. Oct 9, 2012

### Cylab

what do you mean by unit now? is it soooooooo hard to explain in plain words?