Probability of 0 bit in ASCII text files

Cylab · Nov 25, 2012

haruspex said:

How do you get that? Should be 3/8*1/2*1/2 + 5/8*1/2*1/2*1/2 = 11/64

No, 13/256

You are right. Thanks.
Please let me confirm about the "1/2".
1st.
Analysis: The number of "1/2 " at the right part (of +) is 2^-(N-1), and left is 2^-N.
right? but Where the " number of 1/2 " comes from? Is that because of the N that is number of successive zero bits taken from ASCII?
2nd.
If N>8, then we should assume it is impossible. In other words, the answer should be 0. Am I correct?

haruspex · Nov 25, 2012

Cylab said:

Where the " number of 1/2 " comes from? Is that because of the N that is number of successive zero bits taken from ASCII?

Yes. If N ≤ 8, one of the N might be a leading bit, but only one.
With probability N/8, there is a leading bit. That bit will be 0. The other N-1 may be 0 or 1, equally likely. So prob that all N are 0 is 2^-(N-1).
With prob 1-N/8, there is no leading bit. All N may be 0 or 1, equally likely. So prob that all N are 0 is 2^-N.
Adding up: (N/8)*2^-(N-1) + (1-N/8)*2^-N = 2^-N(1+N/8).

If N>8, then we should assume it is impossible. In other words, the answer should be 0.

Not at all. You are now guaranteed one leading 0, and the question becomes whether there's 1 or 2. In general, if we write N = 8A+B, B < 8, what do you think the formula would be?

Cylab · Nov 25, 2012

haruspex said:

Adding up: (N/8)*2^-(N-1) + (1-N/8)*2^-N = 2^-N(1+N/8).

Thanks. the explanations are really helpful.
So in other words,
「We can assume that there are seven bits (or N<8) and that there is at most one MSB bit (which means either one MSB or zero MSB) in it. Thus we can compute the probability of having zero MSB, plus the probability of having one MSB. right ?」

haruspex said:

Not at all. You are now guaranteed one leading 0, and the question becomes whether there's 1 or 2. In general, if we write N = 8A+B, B < 8, what do you think the formula would be?

Analysis:
1st. as the same token, If we let N be 15bits , then the it must contain at least one MSB or at most two MSBs. We add the probability of having one MSB and having two MSBs in it.
2nd. further, when N is 37 bits, then it must contain at least four or at most five MSBs. And so on ...
So what it turns out to be? Just replacing N with 8A+B seems not sufficient for the computations, unless I understand what 'A', 'B' stand for.
Will you furnish some explanations further please?

haruspex · Nov 25, 2012

Cylab said:

1st. as the same token, If we let N be 15bits , then the it must contain at least one MSB or at most two MSBs. We add the probability of having one MSB and having two MSBs in it.

How about you complete that and post the formula you get?

2nd. further, when N is 37 bits, then it must contain at least four or at most five MSBs. And so on ...
So what it turns out to be? Just replacing N with 8A+B seems not sufficient for the computations, unless I understand what 'A', 'B' stand for.

A and B are integers, 0 ≤ B < 8, N=8A+B. That completely defines A and B. A = integer part of N/8; B = N modulo 8.

Cylab · Nov 25, 2012

haruspex said:

How about you complete that and post the formula you get?

A and B are integers, 0 ≤ B < 8, N=8A+B. That completely defines A and B. A = integer part of N/8; B = N modulo 8.

OK! But the analysis concentrates on MSB only (say, N=7); so the formula maybe like;
P(MSB)=P(MSB |H0)P(H0) + P(MSB |H1)P(H1)
(Let Hi be the event that there are i MSB bits in N, for i = 0, 1, 2, 3….. )
where P(MSB |H0) stands for conditional probability of MSB bit in N given it is H0 which equals 0 (no MSB) and P(MSB |H1) stands for conditional probability of MSB bit in the N given it is H1 which equals 1/7 (one MSB);

But it is little different from the point of N=8A+B. right?
I could not figure out the computation.

haruspex · Nov 25, 2012

Cylab said:

the analysis concentrates on MSB only (say, N=7); so the formula maybe like;
P(MSB)=P(MSB |H0)P(H0) + P(MSB |H1)P(H1)
(Let Hi be the event that there are i MSB bits in N, for i = 0, 1, 2, 3….. )
where P(MSB |H0) stands for conditional probability of MSB bit in N given it is H0 which equals 0 (no MSB) and P(MSB |H1) stands for conditional probability of MSB bit in the N given it is H1 which equals 1/7 (one MSB);

That's all true, but I'm not sure where it gets you. What are you trying to solve here?
Are we done with the N≤8 case?

Cylab · Nov 26, 2012

haruspex said:

That's all true, but I'm not sure where it gets you. What are you trying to solve here?
Are we done with the N≤8 case?

How about investigation of N > 8 case?
Just can`t quiet figure out how N=8A+B works.

haruspex · Nov 26, 2012

Cylab said:

How about investigation of N > 8 case?
Just can`t quiet figure out how N=8A+B works.

N=8A+B, B<8, means there are A whole bytes and B odd bits, so either A or A+1 MSBs included.
See what gaps you can fill in here:
Prob that this includes A+1 MSB's = ...?; if A+1 MSBs, prob that all N bits are 0 is ...?
Prob that this includes A MSB's = ...?; if A MSBs, prob that all N bits are 0 is ...?
Adding this up, prob that all N bits are 0 is ...?

Cylab · Nov 27, 2012

haruspex said:

N=8A+B, B<8, means there are A whole bytes and B odd bits, so either A or A+1 MSBs included.
See what gaps you can fill in here:
Prob that this includes A+1 MSB's = ...?; if A+1 MSBs, prob that all N bits are 0 is ...?
Prob that this includes A MSB's = ...?; if A MSBs, prob that all N bits are 0 is ...?
Adding this up, prob that all N bits are 0 is ...?

haruspex said:

if A+1 MSBs, prob that all N bits are 0 is ...?

That is not easy for me Sir.
Analysis: Since the N bits are drawn consecutively from ASCII, there is only 1 character (out of 2^256), which is are all 0. So only 1 MSB. Thus, the Prob =1/2^256.
Others seem to follow the conception , or I misunderstood your point?

Question:
1) Pr[0] in ASCII (assume each character appears with same ratio) equals = 1/8+1/2=5/8. Is it OK?
2) Successive 7 bits are drawn at random from ASCII bits (e.g. no bias of character distribution), what is Pr[0] in the 7 bits?
Successive 4 bits are drawn (same condition with above), what is Pr[0] in the 4 bits?
So, say, Successive N bits are drawn (same condition) , what is Pr[0] in the N bits?
Analysis:
Do you think it is same case? remember you explained that N/8*2^-N + (1-N/8)*2^-(N+1) . Does the formula apply to the case of 2).

Shed some lights on please.

haruspex · Nov 27, 2012

Cylab said:

Analysis: Since the N bits are drawn consecutively from ASCII, there is only 1 character (out of 2^256), which is are all 0. So only 1 MSB. Thus, the Prob =1/2^256.

The N bits might start in the middle of one byte, span several whole bytes, and finish part way through the last. For each whole byte, prob of all zeroes is 1/128 (since MSB always zero).
If the N bits include A MSBs then how many non-MSBs do they include?
What is the prob that the A MSBs are all 0?
What is the prob that the non-MSBs are all 0?
So what is the prob that all N bits are 0?

Question:
1) Pr[0] in ASCII (assume each character appears with same ratio) equals = 1/8+1/2=5/8. Is it OK?

No. 1/8 that bit is MSB, so 7/8 that it is non-MSB. P[0] = 1/8 + 7/8*1/2 = 9/16. We went through that much earlier in the thread.

2) Successive 7 bits are drawn at random from ASCII bits (e.g. no bias of character distribution), what is Pr[0] in the 7 bits?
Successive 4 bits are drawn (same condition with above), what is Pr[0] in the 4 bits?
So, say, Successive N bits are drawn (same condition) , what is Pr[0] in the N bits?
Analysis:
Do you think it is same case? remember you explained that N/8*2^-N + (1-N/8)*2^-(N+1) . Does the formula apply to the case of 2).

I believe you are misquoting the formula. In post #27 I explained that the prob of N consecutive 0 bits (N≤8) is 2^-N(1+N/8).

Cylab · Dec 1, 2012

haruspex said:

I believe you are misquoting the formula. In post #27 I explained that the prob of N consecutive 0 bits (N≤8) is 2^-N(1+N/8).

That is right and it was good explanation.
Now, say, two N consecutive bits are taken (or two groups) (_N1=7, and _N2=4).
So the prob is different in following cases in comparing of same amount of 0`s?
1st case: (in _N1=7) Pr[0], Pr[00],..,Pr[0000] =?
2nd case: (in _N2=4) Pr[0], Pr[00],..,Pr[0000] =?

haruspex · Dec 1, 2012

Cylab said:

1st case: (in _N1=7) Pr[0], Pr[00],..,Pr[0000] =?
2nd case: (in _N2=4) Pr[0], Pr[00],..,Pr[0000] =?

I don't understand your question. What does Pr[0] mean in the context of N₁=7? Is it the probability that the next bit is zero given the preceding 7 were?

Cylab · Dec 1, 2012

haruspex said:

I don't understand your question. What does Pr[0] mean in the context of N₁=7? Is it the probability that the next bit is zero given the preceding 7 were?

Sorry!
I meant the prob of 0 within _N1 =7 consecutively drawn from ASCII .
In other words, say, now we have a group of bits consists of many _N1, each of which is consecutive 7 bits drawn from ASCII. what is Prob[0], Prob[00] in the group respectively?

2nd case. another group of bits with same condition, where _N2=4. What is Prob[0], Prob[00] within the second group respectively?

Is (_N1)Prob[0] = (_N2)Prob[0] right, or should it be unequal?

haruspex · Dec 2, 2012

Cylab said:

Is (_N1)Prob[0] = (_N2)Prob[0] right,

Of course. The bits don't know how many others were chosen.

Cylab · Dec 2, 2012

haruspex said:

Of course. The bits don't know how many others were chosen.

So you are saying following are correct?

(_N1=7)Prob[0] = (_N2=4)Prob[0]
(_N1=7)Prob[00] = (_N2=4)Prob[00]
(_N1=7)Prob[000] = (_N2=4)Prob[000]
...

haruspex · Dec 2, 2012

To be completely clear:
If you choose N consecutive bits, the probability that the first R of those bits are all zero (R <= N), depends only on R. It cannot depend on N.
Further, if you choose N consecutive bits, then choose R consecutive bits from those N, the probability that the first R of those bits are all zero depends only on R.
(This seems so obvious that I worry that I have not understood the question.)

Cylab · Dec 2, 2012

haruspex said:

To be completely clear:
If you choose N consecutive bits, the probability that the first R of those bits are all zero (R <= N), depends only on R. It cannot depend on N.
Further, if you choose N consecutive bits, then choose R consecutive bits from those N, the probability that the first R of those bits are all zero depends only on R.
(This seems so obvious that I worry that I have not understood the question.)

case: N₁=7 & N₂=4 . Assume N is taken from X bits, which is ASCII.
take R=2 bits from N₁ & N₂ respectively, what is prob that they are two 0 bits.

1st. (N₁ case) : {_(9X/16)C₂ * _(7X/16)C₅ } / _xC₇.
1st. (N₂ case) : {_(9X/16)C₂ * _(7X/16)C₂ } / _xC₄.

Seems it depends on N too.

haruspex · Dec 2, 2012

Cylab said:

1st. (N₁ case) : {_(9X/16)C₂ * _(7X/16)C₅ } / _xC₇.

I have no idea what that notation means.

Cylab · Dec 2, 2012

haruspex said:

I have no idea what that notation means.

Hypergeometric Distribution.

X: number of ASCII bits , from which N is taken.
Pr[0] = 9/16.

_xC₇ : The number of combinations of x , taken 7 at a time.

haruspex · Dec 2, 2012

Cylab said:

Hypergeometric Distribution.

X: number of ASCII bits , from which N is taken.
Pr[0] = 9/16.

_xC₇ : The number of combinations of x , taken 7 at a time.

So how do I read _(9X/16)C₂? If I plug in X=7, that gives _(63/16)C₂, which is meaningless.

Cylab · Dec 2, 2012

haruspex said:

So how do I read _(9X/16)C₂? If I plug in X=7, that gives _(63/16)C₂, which is meaningless.

C: combinations
Pr[0] = 9/16.
X: number of ASCII bits , from which N is taken.
case: N1=7 & N2=4 . Assume N is taken from X bits, which is ASCII.
Other definitions should be clear

haruspex · Dec 2, 2012

Cylab said:

C: combinations
Pr[0] = 9/16.
X: number of ASCII bits , from which N is taken.
case: N1=7 & N2=4 . Assume N is taken from X bits, which is ASCII.
Other definitions should be clear

You wrote _(9X/16)C₂, and you have still offered no reasonable explanation for that notation. Did you mean (9/16)_XC₂?

Cylab · Dec 3, 2012

haruspex said:

You wrote _(9X/16)C₂, and you have still offered no reasonable explanation for that notation. Did you mean (9/16)_XC₂?

X: The number of bits in ASCII.
9X/16: The number of 0 bits in the X that are classified as successes.
7 or 4: The number(s) of bits taken consecutively from X.
2: The number of 2 zeros in the 7 or 4 that are classified as successes.
_(9X/16)C₂ : The number of combinations of 9X/16, taken two 0 bits at a time.

haruspex · Dec 3, 2012

Cylab said:

X: The number of bits in ASCII.
9X/16: The number of 0 bits in the X that are classified as successes.
7 or 4: The number(s) of bits taken consecutively from X.
2: The number of 2 zeros in the 7 or 4 that are classified as successes.
_(9X/16)C₂ : The number of combinations of 9X/16, taken two 0 bits at a time.

Now that you have explained that, thankyou, I can see where it is wrong.
For one thing, that analysis treats all bits as independently 0 or 1, regardless of their proximity to each other. Bits multiples of 8 positions apart will be positively correlated, and at other distances negatively correlated.
More significantly, let's look at what these represent:
1st. (N1 case) : {(9X/16)C2 * (7X/16)C5 } / xC7.
1st. (N2 case) : {(9X/16)C2 * (7X/16)C2 } / xC4.
The first is the probability of picking 7 bits that are exactly two 0 bits and 5 1 bits; the second is the prob of picking 4 bits that are exactly 2 and 2. No wonder they're different! In the problem I thought we were discussing, P[00] doesn't care what the remaining 2 or 5 bits are.

Cylab · Dec 4, 2012

haruspex said:

Now that you have explained that, thankyou, I can see where it is wrong.
For one thing, that analysis treats all bits as independently 0 or 1, regardless of their proximity to each other. Bits multiples of 8 positions apart will be positively correlated, and at other distances negatively correlated.
More significantly, let's look at what these represent:
1st. (N1 case) : {(9X/16)C2 * (7X/16)C5 } / xC7.
1st. (N2 case) : {(9X/16)C2 * (7X/16)C2 } / xC4.
The first is the probability of picking 7 bits that are exactly two 0 bits and 5 1 bits; the second is the prob of picking 4 bits that are exactly 2 and 2. No wonder they're different! In the problem I thought we were discussing, P[00] doesn't care what the remaining 2 or 5 bits are.

You are right!
P[00] doesn't care what the remaining 2 or 5 bits are.
So does the calculation in the following two cases, which are the prob of P[00] taken from N1 and N2 respectively regardless of the contents of the N1 & N2.
1st. (N1 case) : {(9X/16)C2 * (7X/16)C5 } / xC7.
1st. (N2 case) : {(9X/16)C2 * (7X/16)C2 } / xC4.

haruspex · Dec 4, 2012

Cylab said:

So does the calculation in the following two cases, which are the prob of P[00] taken from N1 and N2 respectively regardless of the contents of the N1 & N2.
1st. (N1 case) : {(9X/16)C2 * (7X/16)C5 } / xC7.
1st. (N2 case) : {(9X/16)C2 * (7X/16)C2 } / xC4.

Once again, I'm not at all sure what you are saying. Are you insisting that the above formulae are correct for P[00]? I have just explained to you why they are not.

Cylab · Dec 4, 2012

haruspex said:

Once again, I'm not at all sure what you are saying. Are you insisting that the above formulae are correct for P[00]? I have just explained to you why they are not.

Just focusing your points.

Following link may help you clarify your analysis mentioned so far.
http://en.wikipedia.org/wiki/Hypergeometric_distribution

Probability of 0 bit in ASCII text files

Similar threads

B A Little Probability Puzzle

I A variant of the Monty Hall problem

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

I Please Explain (actually explain) The Monty Hall Problem

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers