# Homework Help: [University Introductory Statistics] DNA crime scene

1. Mar 6, 2016

### eskimotaro

Hello everyone. I have been given a problem in my Introductory Mathematical Statistics class. Been thinking about this one for a while and I am simply stuck.

1. The problem statement, all variables and given/known data

"There has been found a DNA of type S on a crime scene. We will assume a total population of N = 5000000 that are potential contributors to the lead. Next assume there is a DNA-database consisting of n = 30000 individuals. Also assume that there are M = 50 individuals in the whole population that have a DNA of type S."

There are six sub-questions (a)-(f), and I am stuck on (d)-(f). I will simply explain what questions (a)-(c) are, and then write up questions (d)-(f).

2. The attempt at a solution [part 1]

In (a) we let X = the number of individuals with type S in the database. Here I am to find the probability distribution of X. I think that the sample space must be x = {0, 1, 2, ..., 50}. To calculate the distribution of x I have used matlab and a hypergeometric distribution formula. That was no problem.

In (b) I am to use a binomic distribution formula instead to calculate the probability distribution of X, that was also not much of a problem.

For (c) I am just asked to calculate P(X = 1), which was just to take the relevant calculation from (a) or (b). P(X = 1) is approximately 0.22.

3. Sub-questions

Here are the sub-questions (d)-(f) which I am stuck on:

"(d) Assume that every individual in the population have the same likelihood of being a contributor. Let A be the event that the contributor is one of the individuals in the database. Calculate P(A).

(e) Find P(X = 1 | A).

Hint: When we know that the contributor is in the database, then there are M - 1 = 49 left who we do not know is in the database or not. Argue that we then are interested in the probability that none of these are in the database.

(f) Find P(A | X = 1). Argue that this corresponds to the probability that the individual with matchin DNA profile in the database is the culprit."

4. The attempt at a solution [part 2]

I have just not been able to get past these questions. For (d) I think that P(A) might be 1/30000, because that's simply how I interpret the question.

So I would be forever grateful if anyone could give me tips on how to solve this. Excuse my language if anything is unclear; English is my second language.

2. Mar 7, 2016

### Staff: Mentor

For (d), you can ignore the DNA completely. Your contributor is one random individual out of 5 million. What is the probability that this individual is within the database of 30000?

For (e), you have 4999999 individuals that have to be arranged somehow. The problem is similar to (c).

3. Mar 7, 2016

### eskimotaro

For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?

Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor? Does it mean the answer is 29999/4999999? The hint sort of confuses me.

4. Mar 7, 2016

### Staff: Mentor

I think so.
No, (A) already includes that the contributor (who has the DNA by definition) is in the database. If that 3/500 event happens, how likely is it that no one else in the database has the same DNA signature?

5. Mar 7, 2016

### eskimotaro

Hmm. So if no one else in the database are to have the same DNA, then they can not be among the 50 that have S. But we have already included one. So there are 49 persons with S that need to be outside the database?

That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?

6. Mar 7, 2016

### Staff: Mentor

Sure.
I don't understand where that term comes from.

If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?

7. Mar 7, 2016

### eskimotaro

Gotta admit I'm sort of lost here. So if we have 49 persons with DNA S in a population of 4999999. What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?

The hint says that we are somehow interested in the probability of the 49 not being in the database, so I'm thinking that number needs to be incorporated somehow.

8. Mar 7, 2016

### Staff: Mentor

No.
You solved the same problem in (a) and (c) already, just with slightly different numbers.

9. Mar 7, 2016

### eskimotaro

Oh, do you suggest I use the hypergeometric or binomial formula again? So if I use a binomial formula with p = 49/4999999 and n = 29999 I get the following:

$29999*(\frac{49}{4999999})^{1}*\left(1-\frac{49}{4999999}\right)^{29999-1}$

10. Mar 7, 2016

### Staff: Mentor

Sure. But you need the probability that no one is in the sample, instead of 1 (what you calculated).

11. Mar 7, 2016

### eskimotaro

Then it doesn't seem like there's much of a difference between P(X = 1 | A) and P(X = 1). Would you say that's correct? Scratch that I wrote before your edit.

Ah, of course. So instead I should look for the probability of there being 0, but with the numbers excluding the 1 criminal?

$29999*(\frac{49}{4999999})^{0}*\left(1-\frac{49}{4999999}\right)^{29999}$

Which is 0.7453.

(f) I might be able to figure out using Bayes' theorem I think.

Last edited: Mar 7, 2016
12. Mar 7, 2016

### Staff: Mentor

(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.

13. Mar 7, 2016

### eskimotaro

I think I have thought too much about this problem lately, I'm not sure if what I did above even is correct. But it does make sense I think.

I'm thinking

$P( A \mid X = 1) = \frac{P(X = 1 \mid A)*P(A)}{P(X = 1 \mid A)*P(A) + P(X = 1 \mid A^{c})*P(A^{c})}$

Then I need to find

$P(X = 1 \mid A^{c})$

And that seems a bit tricky.

Last edited: Mar 7, 2016
14. Mar 7, 2016

### Staff: Mentor

There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?

15. Mar 8, 2016

### eskimotaro

Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.

EDIT:

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

$P(S^{c}) = 1 - \frac{50}{5000000}$

somehow?

Last edited: Mar 8, 2016
16. Mar 8, 2016

### Staff: Mentor

For (f). It should be a simple question.

I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

I don't understand why you combine numbers like that.
Out of the 30000 samples, most samples (at least 29950...) are not of type S.

17. Mar 8, 2016

### eskimotaro

This one is simply $\frac{1}{50}$. But I'm not sure what to do with that. This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?

EDIT: P(A | X = 1) is the probability that the criminal is chosen, given that we are already choosing among the individuals with DNA type S. So yes, then P(A | X = 1) must be $\frac{1}{50}$, or am I understanding it wrong? I thought I was supposed to be using Bayes' on (f).

EDIT 2: Question (f) has a note which reads: "Here your answer may differ, depending on if you have used numerical values for P(A), P(X = 1 | A), and P(X = 1), or if you have done the calculation algebraically expressed by N, M and n."

Which makes me think that P(A | X = 1) can not be just $\frac{1}{50}$?

EDIT3: I've been thinking more about (e) now.

P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?

Last edited: Mar 8, 2016
18. Mar 8, 2016

### Staff: Mentor

Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

19. Mar 8, 2016

### eskimotaro

If there's is exactly one person with DNA type S in the database, then the probability of picking that person is $\frac{1}{30000}$. That's the answer then I believe? Is that also what you are hinting at?

$P(A | X = 1) = \frac{1}{30000}$

20. Mar 8, 2016

### Staff: Mentor

That would be the answer to "if you pick a random person out of the database, what would be the probability to pick a specific one". That is not the problem statement.

21. Mar 8, 2016

### eskimotaro

But you're saying that the answer isn't $\frac{1}{50}$ either? Given that there is exactly one person with DNA type S in the database, and knowing that there are 50 individuals with type S, the probability that the individual with matching DNA is the culprit is $\frac{1}{50}$?

EDIT: Or do you say that I have to use the result from (d) $P(A) = \frac{3}{500}$ somehow?

EDIT2:
About that. Given that the contributor is in the database, there could also be further samples of S-type DNA in the database, right? The question is asking for the probability of there being only one sample of S-type DNA in the database (which must be from the contributor since that is given). The result P(X = 1| A) = P(X = 1) obviously cannot be applied to the case where P(X = 0), given A, the reason being that event A is defined as 'the contributor is one of the individuals in the database'. The knowledge that the criminal is in the database influences the probability distribution of X only to the extent that X = 0 is no longer part of the distribution.

Last edited: Mar 9, 2016
22. Mar 9, 2016

### Staff: Mentor

It is 1/50.

In general, yes.

That result is wrong. It is an approximation, but not exact for any X.
Imagine it would be exact: We know that P(X = 0| A) + P(X = 1| A) + ... + P(X = 50| A) = 1 (because one of the cases has to be true), but we also know that P(X=0) + P(X=1) + ... + P(X=50) = 1. If one of them is not equal (as we established for X=0), then something else has to differ as well. And all of them differ.

23. Mar 9, 2016

### eskimotaro

Alright, I see. So I thought more about (e) based on what you said earlier and the hint. Using a hypergeometric formula, I calculated this:

$\frac{({49}\ C \ {0})*({4999950}\ C \ {29999})}{({4999999}\ C \ {29999})}$

Which $\approx 0.7446$
That's the probability that 0 of the 49 left are in the database.

EDIT:

Sorry but I have another question regarding (f). Since A is the event that the contributor is one of the individuals in the database, and given that there is only one individual with DNA type S in the database. Does that not mean that the contributor is indeed in the database, so that:

$P(A\ \mid \ X = 1) = 1$?

Last edited: Mar 9, 2016
24. Mar 9, 2016

### Staff: Mentor

Your text would fit to $P(A\ \mid \ X = 1 \land A) = 1$ which is true and trivial.
$P(A\ \mid \ X = 1)$ is not one because you could have someone else with the DNA signature in the database, and your criminal not in the database.

25. Mar 9, 2016

### eskimotaro

Ah, that makes a lot of sense! Guess I misunderstood it.

What do you think about my calculation for (e)? That's the probability that 0 of 49 are not in the database. Which is what the hint says that we're interested in. Is that what I am looking for?