[University Introductory Statistics] DNA crime scene

eskimotaro · Mar 6, 2016

Hello everyone. I have been given a problem in my Introductory Mathematical Statistics class. Been thinking about this one for a while and I am simply stuck.

1. Homework Statement

"There has been found a DNA of type S on a crime scene. We will assume a total population of N = 5000000 that are potential contributors to the lead. Next assume there is a DNA-database consisting of n = 30000 individuals. Also assume that there are M = 50 individuals in the whole population that have a DNA of type S."

There are six sub-questions (a)-(f), and I am stuck on (d)-(f). I will simply explain what questions (a)-(c) are, and then write up questions (d)-(f).

2. The attempt at a solution [part 1]

In (a) we let X = the number of individuals with type S in the database. Here I am to find the probability distribution of X. I think that the sample space must be x = {0, 1, 2, ..., 50}. To calculate the distribution of x I have used MATLAB and a hypergeometric distribution formula. That was no problem.

In (b) I am to use a binomic distribution formula instead to calculate the probability distribution of X, that was also not much of a problem.

For (c) I am just asked to calculate P(X = 1), which was just to take the relevant calculation from (a) or (b). P(X = 1) is approximately 0.22.

3. Sub-questions

Here are the sub-questions (d)-(f) which I am stuck on:

"(d) Assume that every individual in the population have the same likelihood of being a contributor. Let A be the event that the contributor is one of the individuals in the database. Calculate P(A).

(e) Find P(X = 1 | A).

Hint: When we know that the contributor is in the database, then there are M - 1 = 49 left who we do not know is in the database or not. Argue that we then are interested in the probability that none of these are in the database.

(f) Find P(A | X = 1). Argue that this corresponds to the probability that the individual with matchin DNA profile in the database is the culprit."

4. The attempt at a solution [part 2]

I have just not been able to get past these questions. For (d) I think that P(A) might be 1/30000, because that's simply how I interpret the question.

So I would be forever grateful if anyone could give me tips on how to solve this. Excuse my language if anything is unclear; English is my second language.

mfb · Mar 7, 2016

For (d), you can ignore the DNA completely. Your contributor is one random individual out of 5 million. What is the probability that this individual is within the database of 30000?

For (e), you have 4999999 individuals that have to be arranged somehow. The problem is similar to (c).

eskimotaro · Mar 7, 2016

Thank you for your reply. :)

For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?

Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor? Does it mean the answer is 29999/4999999? The hint sort of confuses me.

mfb · Mar 7, 2016

eskimotaro said:

For (d) I guess the answer must be P(A) = 30000/5000000 = 3/500?

I think so.

eskimotaro said:

Not sure about (e) though. P(X = 1 | A) means the randomly selected person has the DNA, given that he is a contributor?

No, (A) already includes that the contributor (who has the DNA by definition) is in the database. If that 3/500 event happens, how likely is it that no one else in the database has the same DNA signature?

eskimotaro · Mar 7, 2016

Hmm. So if no one else in the database are to have the same DNA, then they can not be among the 50 that have S. But we have already included one. So there are 49 persons with S that need to be outside the database?

That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?

mfb · Mar 7, 2016

eskimotaro said:

So there are 49 persons with S that need to be outside the database?

Sure.

eskimotaro said:

That's 29999 left in the database, and we need to exclude 49? But the rest of the population also need to be arranged. Meaning (29999 - 49) / 4999999?

I don't understand where that term comes from.

If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?

eskimotaro · Mar 7, 2016

mfb said:

If you have a population of 4999999 where 49 people have the specific DNA signature, what is the probability that none of them are in a (randomly chosen) subset of 29999 of the 4999999 people?

Gotta admit I'm sort of lost here. So if we have 49 persons with DNA S in a population of 4999999. What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?

The hint says that we are somehow interested in the probability of the 49 not being in the database, so I'm thinking that number needs to be incorporated somehow.

Truly appreciate your help here!

mfb · Mar 7, 2016

eskimotaro said:

What's the probability that none of them are among the 29999 left in the database? Should that be (4999999 - 29999) / 4999999?

No.
You solved the same problem in (a) and (c) already, just with slightly different numbers.

eskimotaro · Mar 7, 2016

mfb said:

You solved the same problem in (a) and (c) already, just with slightly different numbers.

Oh, do you suggest I use the hypergeometric or binomial formula again? So if I use a binomial formula with p = 49/4999999 and n = 29999 I get the following:

[itex]29999*(\frac{49}{4999999})^{1}*\left(1-\frac{49}{4999999}\right)^{29999-1}[/itex]

Which equals about 0.22 again.

mfb · Mar 7, 2016

Sure. But you need the probability that no one is in the sample, instead of 1 (what you calculated).
The X=1 comes from your criminal already.

eskimotaro · Mar 7, 2016

~~Then it doesn't seem like there's much of a difference between P(X = 1 | A) and P(X = 1). Would you say that's correct?~~ Scratch that I wrote before your edit.

mfb said:

But you need the probability that no one is in the sample, instead of 1 (what you calculated).
The X=1 comes from your criminal already.

Ah, of course. So instead I should look for the probability of there being 0, but with the numbers excluding the 1 criminal?

[itex]29999*(\frac{49}{4999999})^{0}*\left(1-\frac{49}{4999999}\right)^{29999}[/itex]

Which is 0.7453.

(f) I might be able to figure out using Bayes' theorem I think.

mfb · Mar 7, 2016

(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.

eskimotaro · Mar 7, 2016

I think I have thought too much about this problem lately, I'm not sure if what I did above even is correct. But it does make sense I think.

mfb said:

(f) Probably, but there is also a shorter direct approach. You can even do both to cross-check earlier results.

I'm thinking

[itex]P( A \mid X = 1) = \frac{P(X = 1 \mid A)*P(A)}{P(X = 1 \mid A)*P(A) + P(X = 1 \mid A^{c})*P(A^{c})}[/itex]

Then I need to find

[itex]P(X = 1 \mid A^{c})[/itex]

And that seems a bit tricky.

mfb · Mar 7, 2016

There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?

eskimotaro · Mar 8, 2016

mfb said:

There are 50 people with the right DNA. One of them is in the database. What is the probability that this one is the criminal?

Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.

EDIT:

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

[itex]P(S^{c}) = 1 - \frac{50}{5000000}[/itex]

somehow?

mfb · Mar 8, 2016

eskimotaro said:

Is this for the (e) or (f) question? I'm sorry the language confuses me sometimes.

For (f). It should be a simple question.

I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

Should I try to figure our the probability that 49 samples in the database are not type S? Can I use:

[itex]P(S^{c}) = 1 - \frac{50}{5000000}[/itex]

somehow?

I don't understand why you combine numbers like that.
Out of the 30000 samples, most samples (at least 29950...) are not of type S.

eskimotaro · Mar 8, 2016

mfb said:

I have 50 apples, 49 of them are green and one of them is red. I give you a random apple. What is the probability that the apple is red?

This one is simply [itex]\frac{1}{50}[/itex]. But I'm not sure what to do with that. This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?

EDIT: P(A | X = 1) is the probability that the criminal is chosen, given that we are already choosing among the individuals with DNA type S. So yes, then P(A | X = 1) must be [itex]\frac{1}{50}[/itex], or am I understanding it wrong? I thought I was supposed to be using Bayes' on (f).

EDIT 2: Question (f) has a note which reads: "Here your answer may differ, depending on if you have used numerical values for P(A), P(X = 1 | A), and P(X = 1), or if you have done the calculation algebraically expressed by N, M and n."

Which makes me think that P(A | X = 1) can not be just [itex]\frac{1}{50}[/itex]?

EDIT3: I've been thinking more about (e) now.

P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?

mfb · Mar 8, 2016

eskimotaro said:

This presupposes that I am already choosing among the ones I know have DNS type S. Or is that exactly the meaning of P(A | X = 1)?

Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

eskimotaro said:

P(X = 1 | A) is the probability of there being exactly one person with type S in the database, given that the culprit is in the database. Doesn't that mean that P(X = 1 | A) = P(X = 1)?

No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

eskimotaro · Mar 8, 2016

mfb said:

Out of 50 people with the DNA signature, your DNA database has exactly one person (X=1). What is the probability that your criminal (one out of 50 with the DNA signature) is this person?
Yes, it is exactly what you need.

If there's is exactly one person with DNA type S in the database, then the probability of picking that person is [itex]\frac{1}{30000}[/itex]. That's the answer then I believe? Is that also what you are hinting at?

[itex]P(A | X = 1) = \frac{1}{30000}[/itex]

mfb said:

No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

Will have to think more about this one.

mfb · Mar 8, 2016

eskimotaro said:

If there's is exactly one person with DNA type S in the database, then the probability of picking that person is [itex]\frac{1}{30000}[/itex]. That's the answer then I believe? Is that also what you are hinting at?

That would be the answer to "if you pick a random person out of the database, what would be the probability to pick a specific one". That is not the problem statement.

eskimotaro · Mar 8, 2016

mfb said:

That would be the answer to "if you pick a random person out of the database, what would be the probability to pick a specific one". That is not the problem statement.

But you're saying that the answer isn't [itex]\frac{1}{50}[/itex] either? Given that there is exactly one person with DNA type S in the database, and knowing that there are 50 individuals with type S, the probability that the individual with matching DNA is the culprit is [itex]\frac{1}{50}[/itex]?

EDIT: Or do you say that I have to use the result from (d) [itex]P(A) = \frac{3}{500}[/itex] somehow?

EDIT2:

mfb said:

No. The knowledge that the criminal is in the database does influence the distribution.
As a more striking example, compare P(X = 0 | A) and P(X = 0).

About that. Given that the contributor is in the database, there could also be further samples of S-type DNA in the database, right? The question is asking for the probability of there being only one sample of S-type DNA in the database (which must be from the contributor since that is given). The result P(X = 1| A) = P(X = 1) obviously cannot be applied to the case where P(X = 0), given A, the reason being that event A is defined as 'the contributor is one of the individuals in the database'. The knowledge that the criminal is in the database influences the probability distribution of X only to the extent that X = 0 is no longer part of the distribution.

mfb · Mar 9, 2016

eskimotaro said:

But you're saying that the answer isn't [itex]\frac{1}{50}[/itex] either?

It is 1/50.

About that. Given that the contributor is in the database, there could also be further samples of S-type DNA in the database, right?

In general, yes.

The question is asking for the probability of there being only one sample of S-type DNA in the database (which must be from the contributor since that is given). The result P(X = 1| A) = P(X = 1) obviously cannot be applied to the case where P(X = 0), given A, the reason being that event A is defined as 'the contributor is one of the individuals in the database'. The knowledge that the criminal is in the database influences the probability distribution of X only to the extent that X = 0 is no longer part of the distribution.

That result is wrong. It is an approximation, but not exact for any X.
Imagine it would be exact: We know that P(X = 0| A) + P(X = 1| A) + ... + P(X = 50| A) = 1 (because one of the cases has to be true), but we also know that P(X=0) + P(X=1) + ... + P(X=50) = 1. If one of them is not equal (as we established for X=0), then something else has to differ as well. And all of them differ.

eskimotaro · Mar 9, 2016

mfb said:

That result is wrong. It is an approximation, but not exact for any X.
Imagine it would be exact: We know that P(X = 0| A) + P(X = 1| A) + ... + P(X = 50| A) = 1 (because one of the cases has to be true), but we also know that P(X=0) + P(X=1) + ... + P(X=50) = 1. If one of them is not equal (as we established for X=0), then something else has to differ as well. And all of them differ.

Alright, I see. So I thought more about (e) based on what you said earlier and the hint. Using a hypergeometric formula, I calculated this:

[itex]\frac{({49}\ C \ {0})*({4999950}\ C \ {29999})}{({4999999}\ C \ {29999})}[/itex]

Which [itex]\approx 0.7446[/itex]
That's the probability that 0 of the 49 left are in the database.

EDIT:

Sorry but I have another question regarding (f). Since A is the event that the contributor is one of the individuals in the database, and given that there is only one individual with DNA type S in the database. Does that not mean that the contributor is indeed in the database, so that:

[itex]P(A\ \mid \ X = 1) = 1[/itex]?

mfb · Mar 9, 2016

Your text would fit to ##P(A\ \mid \ X = 1 \land A) = 1## which is true and trivial.
##P(A\ \mid \ X = 1)## is not one because you could have someone else with the DNA signature in the database, and your criminal not in the database.

eskimotaro · Mar 9, 2016

mfb said:

Your text would fit to ##P(A\ \mid \ X = 1 \land A) = 1## which is true and trivial.
##P(A\ \mid \ X = 1)## is not one because you could have someone else with the DNA signature in the database, and your criminal not in the database.

Ah, that makes a lot of sense! Guess I misunderstood it.

What do you think about my calculation for (e)? That's the probability that 0 of 49 are not in the database. Which is what the hint says that we're interested in. Is that what I am looking for?

mfb · Mar 9, 2016

Sure.

eskimotaro · Mar 9, 2016

What I mean is, the probability that 0 of 49 people with DNA type S not being in the database should be that same as what I am looking for? The probability that there is exactly one person with DNA type S, given that the contributor is in the database. Since I have already account for the 1 person not being outside by calculating what I did above.

mfb · Mar 9, 2016

eskimotaro said:

What I mean is, the probability that 0 of 49 people with DNA type S not being in the database should be that same as what I am looking for?

There is a duplicate negation, but I guess you mean the right thing.

eskimotaro said:

Since I have already account for the 1 person not being outside by calculating what I did above.

Right.

eskimotaro · Mar 9, 2016

Fantastic! Then I guess I am done. Thank you so much for your help. Truly appreciate it.

The Physics forums seems like a very interesting place, I think I will stick around and lurk. :)

[University Introductory Statistics] DNA crime scene

1. What is DNA evidence and how is it collected at a crime scene?

2. How is DNA analysis used to solve crimes?

3. What are the limitations of using DNA evidence in criminal investigations?

4. Can DNA evidence be falsified or manipulated?

5. How is statistical analysis used in interpreting DNA evidence in court cases?

Similar threads

Hot Threads

Recent Insights