# [University Introductory Statistics] DNA crime scene

1. Mar 6, 2016

### eskimotaro

Hello everyone. I have been given a problem in my Introductory Mathematical Statistics class. Been thinking about this one for a while and I am simply stuck.

1. The problem statement, all variables and given/known data

"There has been found a DNA of type S on a crime scene. We will assume a total population of N = 5000000 that are potential contributors to the lead. Next assume there is a DNA-database consisting of n = 30000 individuals. Also assume that there are M = 50 individuals in the whole population that have a DNA of type S."

There are six sub-questions (a)-(f), and I am stuck on (d)-(f). I will simply explain what questions (a)-(c) are, and then write up questions (d)-(f).

2. The attempt at a solution [part 1]

In (a) we let X = the number of individuals with type S in the database. Here I am to find the probability distribution of X. I think that the sample space must be x = {0, 1, 2, ..., 50}. To calculate the distribution of x I have used matlab and a hypergeometric distribution formula. That was no problem.

In (b) I am to use a binomic distribution formula instead to calculate the probability distribution of X, that was also not much of a problem.

For (c) I am just asked to calculate P(X = 1), which was just to take the relevant calculation from (a) or (b). P(X = 1) is approximately 0.22.

3. Sub-questions

Here are the sub-questions (d)-(f) which I am stuck on:

"(d) Assume that every individual in the population have the same likelihood of being a contributor. Let A be the event that the contributor is one of the individuals in the database. Calculate P(A).

(e) Find P(X = 1 | A).

Hint: When we know that the contributor is in the database, then there are M - 1 = 49 left who we do not know is in the database or not. Argue that we then are interested in the probability that none of these are in the database.

(f) Find P(A | X = 1). Argue that this corresponds to the probability that the individual with matchin DNA profile in the database is the culprit."

4. The attempt at a solution [part 2]

I have just not been able to get past these questions. For (d) I think that P(A) might be 1/30000, because that's simply how I interpret the question.

So I would be forever grateful if anyone could give me tips on how to solve this. Excuse my language if anything is unclear; English is my second language.

2. Mar 7, 2016

### Staff: Mentor

For (d), you can ignore the DNA completely. Your contributor is one random individual out of 5 million. What is the probability that this individual is within the database of 30000?

For (e), you have 4999999 individuals that have to be arranged somehow. The problem is similar to (c).

3. Mar 7, 2016

### eskimotaro

For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?

Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor? Does it mean the answer is 29999/4999999? The hint sort of confuses me.

4. Mar 7, 2016

### Staff: Mentor

I think so.
No, (A) already includes that the contributor (who has the DNA by definition) is in the database. If that 3/500 event happens, how likely is it that no one else in the database has the same DNA signature?

5. Mar 7, 2016

### eskimotaro

Hmm. So if no one else in the database are to have the same DNA, then they can not be among the 50 that have S. But we have already included one. So there are 49 persons with S that need to be outside the database?

That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?

6. Mar 7, 2016

### Staff: Mentor

Sure.
I don't understand where that term comes from.

If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?

7. Mar 7, 2016

### eskimotaro

Gotta admit I'm sort of lost here. So if we have 49 persons with DNA S in a population of 4999999. What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?

The hint says that we are somehow interested in the probability of the 49 not being in the database, so I'm thinking that number needs to be incorporated somehow.

8. Mar 7, 2016

### Staff: Mentor

No.
You solved the same problem in (a) and (c) already, just with slightly different numbers.

9. Mar 7, 2016

### eskimotaro

Oh, do you suggest I use the hypergeometric or binomial formula again? So if I use a binomial formula with p = 49/4999999 and n = 29999 I get the following:

$29999*(\frac{49}{4999999})^{1}*\left(1-\frac{49}{4999999}\right)^{29999-1}$

10. Mar 7, 2016

### Staff: Mentor

Sure. But you need the probability that no one is in the sample, instead of 1 (what you calculated).

11. Mar 7, 2016

### eskimotaro

Then it doesn't seem like there's much of a difference between P(X = 1 | A) and P(X = 1). Would you say that's correct? Scratch that I wrote before your edit.

Ah, of course. So instead I should look for the probability of there being 0, but with the numbers excluding the 1 criminal?

$29999*(\frac{49}{4999999})^{0}*\left(1-\frac{49}{4999999}\right)^{29999}$

Which is 0.7453.

(f) I might be able to figure out using Bayes' theorem I think.

Last edited: Mar 7, 2016
12. Mar 7, 2016

### Staff: Mentor

(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.

13. Mar 7, 2016

### eskimotaro

I think I have thought too much about this problem lately, I'm not sure if what I did above even is correct. But it does make sense I think.

I'm thinking

$P( A \mid X = 1) = \frac{P(X = 1 \mid A)*P(A)}{P(X = 1 \mid A)*P(A) + P(X = 1 \mid A^{c})*P(A^{c})}$

Then I need to find

$P(X = 1 \mid A^{c})$

And that seems a bit tricky.

Last edited: Mar 7, 2016
14. Mar 7, 2016

### Staff: Mentor

There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?

15. Mar 8, 2016

### eskimotaro

Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.

EDIT:

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

$P(S^{c}) = 1 - \frac{50}{5000000}$

somehow?

Last edited: Mar 8, 2016
16. Mar 8, 2016

### Staff: Mentor

For (f). It should be a simple question.

I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

I don't understand why you combine numbers like that.
Out of the 30000 samples, most samples (at least 29950...) are not of type S.

17. Mar 8, 2016

### eskimotaro

This one is simply $\frac{1}{50}$. But I'm not sure what to do with that. This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?

EDIT: P(A | X = 1) is the probability that the criminal is chosen, given that we are already choosing among the individuals with DNA type S. So yes, then P(A | X = 1) must be $\frac{1}{50}$, or am I understanding it wrong? I thought I was supposed to be using Bayes' on (f).

EDIT 2: Question (f) has a note which reads: "Here your answer may differ, depending on if you have used numerical values for P(A), P(X = 1 | A), and P(X = 1), or if you have done the calculation algebraically expressed by N, M and n."

Which makes me think that P(A | X = 1) can not be just $\frac{1}{50}$?

EDIT3: I've been thinking more about (e) now.

P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?

Last edited: Mar 8, 2016
18. Mar 8, 2016

### Staff: Mentor

Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

19. Mar 8, 2016

### eskimotaro

If there's is exactly one person with DNA type S in the database, then the probability of picking that person is $\frac{1}{30000}$. That's the answer then I believe? Is that also what you are hinting at?

$P(A | X = 1) = \frac{1}{30000}$