Probability of Duplicates in Randomly Generated Sequences

In summary: The probability of a match after N digest is P(N)=P(N-1)+(1-P(N-1))*{1 \over B-N+2}In summary, the probability of getting a duplicate in a randomly generated sequence of four digits is 0.091.
  • #1
SW VandeCarr
2,199
81
What is the probability of getting a duplicate of a digit in a randomly generated sequence of four digits?

I would approach this intuitively as follows.

(1.0)(0.1)(0.9)(0.9)=0.081
" (0.1)(0.1)(0.9)=0.009
" (0.1)(0.1)(0.1)=0.001
Sum =0.091

There must be a more efficient and elegant way of doing this but I've not been able to find one. Also, is there a more general formula for a set of n distinct objects each with a probability 1/n of being selected?
 
Last edited:
Physics news on Phys.org
  • #2
Hi SW VandeCarr! :smile:
SW VandeCarr said:
What is the probability of getting a duplicate of a digit in a randomly generated sequence of four digits?

hmm :rolleyes: … what is the probability of not getting a duplicate? :wink:
 
  • #3
tiny-tim said:
Hi SW VandeCarr! :smile:


hmm :rolleyes: … what is the probability of not getting a duplicate? :wink:

That's included in my calculation. I believe the calculation is correct. The probability of not getting a duplicate for randomly generated n distinct objects is (n-1/n)^n, but duplicates are included in triples, quadruples,etc.
 
Last edited:
  • #4
SW VandeCarr said:
The probability of not getting a duplicate for randomly generated n distinct objects is (1/n)^n,

sorry, completely wrong :redface:

start again :smile:
 
  • #5
tiny-tim said:
sorry, completely wrong :redface:

start again :smile:

You're too fast. You missed my edit.
 
  • #6
SW VandeCarr said:
You're too fast. You missed my edit.
SW VandeCarr said:
The probability of not getting a duplicate for randomly generated n distinct objects is (n-1/n)^n

oooh … still completely wrong …

and why is there only one variable (n)? :confused:
 
  • #7
tiny-tim said:
oooh … still completely wrong …

and why is there only one variable (n)? :confused:

OK. First, is my calculation correct? If so, is there a general formula for this?

You're correct, there is more than one variable. From a set of N distict objects, a random sequence of n objects is generated. The probability of not getting a duplicate in the sequence of n objects is ((N-1)/N)^n.

Now, let r equal all replications (duplicates, three of a kind, four of a kind, etc) each considered to contain just one duplicate, what is the most efficient way to calculate the probability of a duplicate so defined?
 
  • #8
SW VandeCarr said:
The probability of not getting a duplicate in the sequence of n objects is ((N-1)/N)^n.

No, that's the probability of not getting any 0s
 
  • #9
Well, if the number of digits is greater then the base then the probability is equal to one.
 
  • #10
tiny-tim said:
No, that's the probability of not getting any 0s

I don't understand that answer. The objects in the general case aren't necessarily numbers. Each random selection yields one object which may be selected more than once out of n selections. Perhaps the probability is ((N-1)/N)^n-1. Do you know?

My reasoning regarding the probability of getting only a duplicate in four randomly generated digits is:

The first selection can be any digit and therefore has a probability of one.

The second selection has a probability of 0.1 of being the same digit.

The third and fourth selection have a probability 0.9 of not being the same digit.

Since multiplication is commutative, the actual order doesn't matter.

If we consider the probability of getting at least a duplicate, then we need to add the probabilities of getting three or four of a kind.

I don't think the answer is as simple as 1-(0.9)^3.
 
  • #11
SW VandeCarr said:
The first selection can be any digit and therefore has a probability of one.
The second selection has a probability of 0.1 of being the same digit.
So we're dealing with base 10.

The third and fourth selection have a probability 0.9 of not being the same digit.

Since multiplication is commutative, the actual order doesn't matter.
I Think I get the idea.
If we consider the probability of getting at least a duplicate, then we need to add the probabilities of getting three or four of a kind.

I don't think the answer is as simple as 1-(0.9)^3.

Okay. I agree after two digest:

Probability of a repeat 1/B where B is the base

There are now two digest taken. So when we add another diget:

1/(B)+(1/(B-1))*(1-1/B)

As there is 1/B chance of the third digest matching the fist digit, but if the third digit doesn't match the first that leaves (B-1) numbers in the diget left. So the probability of it matching the second digit is 1/(B-1) multiplied by the probability that it didn't match the first digit.

Let P(N) be the probability of a match after N digest then

[tex]P(N)=P(N-1)+(1-P(N-1))*{1 \over B-N+2}[/tex]

and

[tex]P(2)=1/B[/tex]
 
Last edited:
  • #12
Actually the argument is counter-intuitive if one is thinking in terms of cumulative probability. For cumulative probability 1-(0.9)^3 is correct. That is, once the marker digit is selected, the cumulative probability that a duplicate will be selected in the next three selections is about 0.271. However, cumulative probability is not the same as sequential probability. The probability that the marker will be matched in any particular one of the next three independent selections is still 0.1 in frequentist theory.

The idea of cumulative probability fits with experience, but an it's entirely different concept than the probability of a single independent event. I don't know the answer to this problem, but it is a problem. Randomized scientific trials use threshold probabilities to determine statistical significance but rarely adjust the significance level for the number of similar trials conducted. If you believe in cumulative probability, scientific trials should adjust these levels or be liable for false positive results.
 
Last edited:
  • #13
Which part of my last post do you disagree with and how does this apply to randomized scientific trials. Is there an area of research were these statistics are particularly important? 1-(0.9)^3 is probably an okay approximation if the base is large relative to the number of digits. Since the base is 10 and the number of digest is four, it might be an okay approximation for this problem.

Your formula clearly fails when the number of digits N=10 since it gives a finite probability of no repeats which is impossible with 10 digits.
 
  • #14
SW VandeCarr said:
What is the probability of getting a duplicate of a digit in a randomly generated sequence of four digits?
SW VandeCarr said:
My reasoning regarding the probability of getting only a duplicate in four randomly generated digits is:

uhh? what is the question … getting exactly one duplicate, or getting one or more duplicates? :confused:

i assumed from your original post that it was the latter
 
  • #15
John Creighto said:
Which part of my last post do you disagree with and how does this apply to randomized scientific trials. Is there an area of research were these statistics are particularly important? 1-(0.9)^3 is probably an okay approximation if the base is large relative to the number of digits. Since the base is 10 and the number of digest is four, it might be an okay approximation for this problem.

Your formula clearly fails when the number of digits N=10 since it gives a finite probability of no repeats which is impossible with 10 digits.

The cumulative probability of getting at least one duplicate or at most one duplicate obviously must increase as the number of randomly generated digits increase. For four digits the questions boil down to how many sequences between 0000 and 9999 have at least one duplicate, and how many have at most one duplicate?

For sequential probability, the question is: given an existing digit sequence of n digits with no duplicates, what is the probability that the next digit selected will be a duplicate of one of the previously generated digits? For a generated sequence that already has just one duplicate, the probability that an expanded sequence (by adding more randomly generated digits) will have just one duplicate decreases.
 
  • #16
SW VandeCarr said:
The cumulative probability of getting at least one duplicate or at most one duplicate obviously must increase as the number of randomly generated digits increase. For four digits the questions boil down to how many sequences between 0000 and 9999 have at least one duplicate, and how many have at most one duplicate?
I'm not sure it's a good idea to ask more then one question in the same post. Anyway, you still haven't critiqued the steps in my solution above for the probability of getting at least one duplicate.

For sequential probability, the question is: given an existing digit sequence of n digits with no duplicates, what is the probability that the next digit selected will be a duplicate of one of the previously generated digits?
Again, I thought I solved this problem.

For a generated sequence that already has just one duplicate, the probability that an expanded sequence (by adding more randomly generated digits) will have just one duplicate decreases.

I'll give this problem a shot later. I think though we should do it in a separate thread, and I think the moderate should make the title of this thread more clear.
 
  • #17
John Creighto said:
I'm not sure it's a good idea to ask more then one question in the same post. Anyway, you still haven't critiqued the steps in my solution above for the probability of getting at least one duplicate.

You're using different notation than I am. For digits B=N=10 and n is the length of the randomly generated (rg) string. Your formula seems correct for n=2. For n=10 with no duplicates, its clear that the next rg number is a duplicate with p=1.

Again, I thought I solved this problem.

If your N=n and your B=10 then 1/B-N+2 = 0.5 for N=n=10. I haven't done the full expansion, but the 11th rd digit must be a duplicate (assuming no duplicates in the string n=10) as you have said.

I'll give this problem a shot later. I think though we should do it in a separate thread, and I think the moderate should make the title of this thread more clear.

For cumulative probability there are two related questions. I'm including cumulative probability because tiny tim raised the issue. One is the number of 4 digit sequences that contain at least one identical pair (1213, 3333, 4433, 0051, 0110, ect.) The other is the number of 4 digit sequences that contain at most one identical pair excluding triplets and quadruplets.

The title is "probabilities of duplicates" and I'm not giving a tutorial. I'm asking questions because I don't know the answers.
 
Last edited:
  • #18
tiny-tim said:
Hi SW VandeCarr! :smile:


hmm :rolleyes: … what is the probability of not getting a duplicate? :wink:

Both John Creighto and I have shown you're wrong if the question is at least one duplicate in n trials with a finite base B as for digits 0-9. I suggest the correct formula is quite simple based on the uniform distribution:

P(k(i)=(k(j)) = (n-1)/B where n is the number of trials and k is the kth trial within n and (n-1) less than or equal to B.
(
For for the first trial P=0 and for the eleventh trial P=1 as it should..

For at most one duplicate in n trials, I suggest my first post is correct.

Any comments tiny tim or others?
 
Last edited:

1. What does the term "probabilities of duplicates" mean?

The term "probabilities of duplicates" refers to the likelihood that a set of data or objects contains duplicate values. It is a measure of how likely it is for two or more items to have the same value or characteristics within a given set.

2. How are probabilities of duplicates calculated?

Probabilities of duplicates are typically calculated using mathematical formulas, such as the binomial distribution formula or the hypergeometric distribution formula. These formulas take into account the total number of items in a set and the number of unique values to determine the likelihood of finding duplicates.

3. What factors can affect probabilities of duplicates?

There are several factors that can affect the probabilities of duplicates in a set, including the size of the set, the number of unique values, and the distribution of values. For example, a smaller set with a higher number of unique values will have a lower probability of duplicates compared to a larger set with a lower number of unique values.

4. How can probabilities of duplicates be reduced?

There are a few strategies that can be used to reduce the probabilities of duplicates in a set. One approach is to increase the number of unique values in the set, which decreases the likelihood of duplicates. Another strategy is to use techniques such as data cleaning and normalization to ensure that the data is accurate and consistent, reducing the chances of duplicate values.

5. Why are probabilities of duplicates important?

Understanding probabilities of duplicates is important in many fields, such as statistics, data analysis, and computer science. It can help researchers and analysts make more accurate predictions and conclusions based on data sets, and can also inform decisions related to data management and quality control.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
188
  • Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
  • Programming and Computer Science
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
12
Views
6K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
  • Biology and Medical
2
Replies
63
Views
8K
  • Precalculus Mathematics Homework Help
Replies
7
Views
1K
  • Precalculus Mathematics Homework Help
Replies
6
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
20
Views
5K
Back
Top