How Can DNA and Big Data Be Used to Find You?

In summary, this study found that around 60% of Americans of European descent can be matched to a third cousin or closer relation through DNA testing services like Ancestry.com, 23andMe, or MyHeritage. This percentage is only set to grow in the coming years, as more people give their genetic information over to these companies. While many people are worried that the police could eventually abuse or misuse this technology, the study concludes that there is no real risk yet.
  • #1
14,785
9,123
https://www.vox.com/science-and-health/2018/10/12/17957268/science-ancestry-dna-privacy

All over the world, some 10 million people have had their DNA analyzed by a direct-to-consumer genetics company like 23andMe, Ancestry.com, or MyHeritage.

...

So many people have now used the services that many of us don’t even need to share our own DNA to be tracked down. Your father — or perhaps a third cousin whom you’ve never even met — could have uploaded their data, which could lead to you. This is how police cracked the cold case of the “Golden State Killer” earlier this year: An old DNA sample from a crime scene matched with the DNA of the killer’s relatives in public databases, which, after some more sleuthing, led to him.

...

The study concludes that around 60 percent of Americans of European descent could be matched to a third cousin or closer relation. And this percentage is only set to grow in the coming years, as more people give their genetic information over to these companies.
 
Biology news on Phys.org
  • #2
Familial DNA searches have been around since the early 2000's but with cases like the the Golden State Killer and the Grim Sleeper being cracked after decades, it's getting noticed by more and more police forces. Could the police eventually abuse or misuse the technology? I have no doubt that it will eventually happen. But for now, I'm glad to see these people caught and removed from society.
 
  • Like
Likes BillTre, phyzguy and jedishrfu
  • #3
If this bothers you, my question is, exactly what are you planning on doing that it bothers you that the police can find you? Are you planning some crime and are worried that you might not be able to get away with it?
 
  • #4
Yes, I'm sure that will happen. There was a case recently in Texas where the primary forensics lab had shoddy procedures that bordered on being criminal and as a result many cases were thrown out and convicted felons were released.

In the case of DNA, it's so easy to accidentally destroy a sample, contaminate a sample or switch samples.
 
  • #5
phyzguy said:
If this bothers you, my question is, exactly what are you planning on doing that it bothers you that the police can find you? Are you planning some crime and are worried that you might not be able to get away with it?
There's the general worry that innocent people can be caught up and harassed by the police because the police believe they did something wrong.

There was the case of the family who lived on a farm in middle america which according to an internet mapping company was the geographic middle of America. They routinely got harassed and investigated because internet ip searches would finger them as the source of some criminal internet transactions.

https://arstechnica.com/tech-policy...m-for-turning-their-life-into-a-digital-hell/

If it can happen to them then it can happen with DNA too.

The traditional use of DNA in a court case was to one have a pool of suspects and two have some DNA from the perpetrator and the intersection of the two groups would identify the criminal. There was a case once where they had DNA and then they went to a known DNA database of criminals and identified ten random felons as suspects picking one who in the end had absolutely nothing to do with the case. Now with the extensive databases of DNA that problem is only magnified and someone will naively do it again with disastrous results.

I could envision a time where someone if identified and the SWAT team shows up, the person is surprised at home and panics causing them to be shot only to discover later that the SWAT team had the wrong person. We've seen this already in the swatting event that one gamer did to another with the second gamer providing a false address and the homeowner was killed.

https://www.cnn.com/2017/12/30/us/kansas-police-shooting-swatting/index.html

Its a Bayesian problem that both prosecutors and defense attorneys have to carefully consider and present.
 
  • Like
Likes krater
  • #6
I do wonder how exactly DNA evidence is presented in court and how it's defended against. If the person's DNA was found at multiple crime scenes or multiple locations in a single crime scene, that would be pretty strong evidence. However, if a single hair was found in a car, it could have gotten there by some other means. What if the car was stolen but had been purchased recently and you happened to have taken it for a test drive before it was bought? Odds are that the car would be owned by someone who lives relatively close to you which would increase your odds of being near the scene on any given day. If that hair is the only available DNA evidence, I wonder how the police would treat it? Good luck trying to explain that away.
 
  • #7
But its worse than that, DNA identification is based on marker presence meaning they don't compare the whole DNA strand but spot check it.

https://en.wikipedia.org/wiki/DNA_profiling

STR analysis[edit]
Main article: Short tandem repeats
The system of DNA profiling used today is based on Polymerase chain reaction (PCR) and uses simple sequences[12] or short tandem repeats (STR). This method uses highly polymorphic regions that have short repeated sequences of DNA (the most common is 4 bases repeated, but there are other lengths in use, including 3 and 5 bases). Because unrelated people almost certainly have different numbers of repeat units, STRs can be used to discriminate between unrelated individuals. These STR loci (locations on a chromosome) are targeted with sequence-specific primers and amplified using PCR. The DNA fragments that result are then separated and detected using electrophoresis. There are two common methods of separation and detection, capillary electrophoresis (CE) and gel electrophoresis.

Each STR is polymorphic, but the number of alleles is very small. Typically each STR allele will be shared by around 5 - 20% of individuals. The power of STR analysis comes from looking at multiple STR loci simultaneously. The pattern of alleles can identify an individual quite accurately. Thus STR analysis provides an excellent identification tool. The more STR regions that are tested in an individual the more discriminating the test becomes.

From country to country, different STR-based DNA-profiling systems are in use. In North America, systems that amplify the CODIS 20[13] core loci are almost universal, whereas in the United Kingdom the https://en.wikipedia.org/w/index.php?title=DNA-17&action=edit&redlink=1 17 loci system (which is compatible with The National DNA Database) is in use, and Australia uses 18 core markers.[14] Whichever system is used, many of the STR regions used are the same. These DNA-profiling systems are based on multiplex reactions, whereby many STR regions will be tested at the same time.

The true power of STR analysis is in its statistical power of discrimination. Because the 20 loci that are currently used for discrimination in CODIS are independently assorted (having a certain number of repeats at one locus does not change the likelihood of having any number of repeats at any other locus), the product rule for probabilities can be applied. This means that, if someone has the DNA type of ABC, where the three loci were independent, we can say that the probability of having that DNA type is the probability of having type A times the probability of having type B times the probability of having type C. This has resulted in the ability to generate match probabilities of 1 in a quintillion (1x1018) or more. However, DNA database searches showed much more frequent than expected false DNA profile matches.[15] Moreover, since there are about 12 million monozygotic twins on Earth, the theoretical probability is not accurate.

In practice, the risk of contaminated-matching is much greater than matching a distant relative, such as contamination of a sample from nearby objects, or from left-over cells transferred from a prior test. The risk is greater for matching the most common person in the samples: Everything collected from, or in contact with, a victim is a major source of contamination for any other samples brought into a lab. For that reason, multiple control-samples are typically tested in order to ensure that they stayed clean, when prepared during the same period as the actual test samples. Unexpected matches (or variations) in several control-samples indicates a high probability of contamination for the actual test samples. In a relationship test, the full DNA profiles should differ (except for twins), to prove that a person was not actually matched as being related to their own DNA in another sample.

Consider this simple game show fail where someone has to identify the missing letters from minimal clues:



versus this one where they get it so wrong

 
  • Like
Likes Borg
  • #8
phyzguy said:
If this bothers you, my question is, exactly what are you planning on doing that it bothers you that the police can find you? Are you planning some crime and are worried that you might not be able to get away with it?
This is the standard defense of nearly any intrusive surveillance technology - "If you don't do anything wrong then you have nothing to worry about." The obvious reply is that the technology can be used for unintended and nefarious purposes, whether by a maverick technician or LEO, or by an abusive government. If human history is any indication, this is a legitimate concern.
 
  • Like
Likes krater and jedishrfu
  • #9
Yes and it may not be nefarious but simply ignorance or shoddiness on the ones doing the analysis.

Remember the case following the Boston bomber where a home was raided by the police. The husband was searching for backpacks for an upcoming trip and the wife happened to search for pressure cookers on the husbands work computer and the combination led someone at the husband's company to conclude a bomb plot was imminent. ( I can't verify the work computer part but remember hearing it at one point perhaps to hide the real reason they popped up)

https://www.theguardian.com/world/2013/aug/01/new-york-police-terrorism-pressure-cooker

https://www.theatlantic.com/nationa...nocking-doors-because-google-searches/312599/
 
  • #10
IMO, this is awesome. Not because of genealogy hobbies, but because of the potential for medical diagnosis, research and treatment. We are way, way overdue in applying big data to medicine.

My main complaint is the exclusivity of the license to use the data. It's not a big issue now since there are a small number of players, but if more come into the mix I'd favor a nationalized database.
jedishrfu said:
There's the general worry that innocent people can be caught up and harassed by the police because the police believe they did something wrong.

There was the case of the family who lived on a farm in middle america which according to an internet mapping company was the geographic middle of America. They routinely got harassed and investigated because internet ip searches would finger them as the source of some criminal internet transactions.

https://arstechnica.com/tech-policy...m-for-turning-their-life-into-a-digital-hell/

If it can happen to them then it can happen with DNA too.
Every identity tool has potential for misuse (accidental or on purpose). My suspicion is that the both should go down as the quality of the ID improves.

Living in a free country where on purpose government misuse is not a significant issue and knowing that I don't break significant laws, I'm much more positively inclined toward this because of its potential for added protection.
 
Last edited:
  • Like
Likes BillTre
  • #11
jedishrfu said:
But its worse than that, DNA identification is based on marker presence meaning they don't compare the whole DNA strand but spot check it.

https://en.wikipedia.org/wiki/DNA_profiling
Consider this simple game show fail where someone has to identify the missing letters from minimal clues:



versus this one where they get it so wrong

The underlying assumption, as the quote you provided explains, is that the markers are highly variable between individuals within all populations and that they are independent of each other. If that is the case, and the science on this seems to be well established, you use the product rule for probability. All you need are enough markers so that the probability of a match (assuming there was a 20% chance of a match between any individualtewith someone else is effectively zero (eg. .215[/SUP or one in 300 billion] But that is never enough. You need to have evidence of context that plausibly ties the accused to the crime - eg. the accused lived in the vicinity of the crimes.

AM
 
  • #12
jedishrfu said:
The traditional use of DNA in a court case was to one have a pool of suspects and two have some DNA from the perpetrator and the intersection of the two groups would identify the criminal. There was a case once where they had DNA and then they went to a known DNA database of criminals and identified ten random felons as suspects picking one who in the end had absolutely nothing to do with the case. Now with the extensive databases of DNA that problem is only magnified and someone will naively do it again with disastrous results.
jedishrfu said:
But its worse than that, DNA identification is based on marker presence meaning they don't compare the whole DNA strand but spot check it.

https://en.wikipedia.org/wiki/DNA_profiling
I don't understand what you are saying. It sounds you are saying you can get ten "random" matches from a DNA test. What does "random" mean in this context? I don't see how that word can apply. If you mean 10 legitimate matches, has a DNA test that bad ever been used? Do you have a reference to the case you are remembering? According to your wiki link, one common test has a risk of coincidental match of 1 in 100 billion (barring twins), so it would be near impossible to have more than one match (barring identical twins).
 
  • #13
jedishrfu said:
The study concludes that around 60 percent of Americans of European descent could be matched to a third cousin or closer relation.

versus

The DNA profiles of nearly four in 10 black men in the UK are on the police's national database - compared with fewer than one in 10 white men, according to figures compiled by the Guardian.
https://www.theguardian.com/world/2006/jan/05/race.ukcrime

Of course, it is difficult to compare the present study on people in the US versus the 2006 data from the UK, but there are plenty of reasons to believe that US police DNA databases would show similar disparities (http://www.councilforresponsiblegenetics.org/pageDocuments/BBIQ0EKC20.pdf).
 
  • #14
jedishrfu said:
But its worse than that, DNA identification is based on marker presence meaning they don't compare the whole DNA strand but spot check it.

https://en.wikipedia.org/wiki/DNA_profiling

Individual human genomes are 99.4% similar. Of course, you will need to spot check known areas that show wide variability across the population rather than read the whole entire genome. Furthermore, STR profiling loci are specifically chosen because they are not associated with major markers of health. Allowing the government to read and store your entire genome would give them tons of information about your health and ancestry, precisely the wrong thing to do if you are worried about privacy and misuse of the information.

russ_watters said:
I don't understand what you are saying. It sounds you are saying you can get ten "random" matches from a DNA test. What does "random" mean in this context? I don't see how that word can apply. If you mean 10 legitimate matches, has a DNA test that bad ever been used? Do you have a reference to the case you are remembering? According to your wiki link, one common test has a risk of coincidental match of 1 in 100 billion (barring twins), so it would be near impossible to have more than one match (barring identical twins).

DNA from crime scenes is often not of the greatest quality, so not all profiling loci are interpretable, lowering the probability of a false positive from the ideal 1 in 10^9. The problem is exacerbated in familiar matching, where investigators are looking only for a partial match (e.g. https://geneticliteracyproject.org/...ons-using-partial-dna-matches-raise-concerns/). A lot depends on how well the results are interpreted by investigators, judges and juries. A 1 in 1 million match might seem strong and convincing to a random Jury member, but what if the database they querried to find the 1 in 1 million match had millions of people?
 
  • #15
Ygggdrasil said:
A 1 in 1 million match might seem strong and convincing to a random Jury member, but what if the database they querried to find the 1 in 1 million match had millions of people?
Granted, but traditional means can easily cull that list by many orders of magnitude. For example, if a crime is committed in my neighbor's house and DNA flags everyone related to me by 4 steps or less (assuming that's how it works), the police wouldn't have any trouble eliminating the other 2-dozen suspects, none except my parents living within 50 miles of the crime scene.

The bigger issue is the one that @Borg mentioned, in that since I visit my neighbors often, I'm sure to have left a substantial amount of physical evidence there, regardless of if I've committed a crime there or not. This complication of police investigation, of course, has nothing whatsoever to do with DNA or any other individual type of evidence. It is a logic issue applying equally to all types.
 
  • #16
sandy stone said:
This is the standard defense of nearly any intrusive surveillance technology - "If you don't do anything wrong then you have nothing to worry about." The obvious reply is that the technology can be used for unintended and nefarious purposes, whether by a maverick technician or LEO, or by an abusive government. If human history is any indication, this is a legitimate concern.
So wouln't it be logical to weigh the potential for benefit against the potential for harm? Do you not wear a seat belt in your car because of the potential that it could trap you in a fire or flood?
 
  • #17
My post is mainly about how police might misuse the data. The police aren't scientists and they come up with procedures that were taught to them at some time but never upgraded as the technology improves. Being a non-scientist means they may interpret percentages wrong or make assumptions about what evidence says and thus chase after the wrong person.

In Texas, we have the Todd Willingham cases where he was convicted based primarily on flawed fire forensics and his attitude as described by a witness as his home went up in flames with his kids inside. In Australia, there was the case of the mother who claimed here baby was taken by dingos while on a camping trip and flawed analysis of her behavior convicted her until some years later the child's clothing was found in a dingo burrow. A small flaw in one piece of evidence can magnify into something that juries will see as clear guilt.

In the early days of using less markers (I think it was 10), one police dept frustrated that they couldn't find the suspect used a dna database of criminals and flagged ten felons and from that they tried to tie them to the crime.

The guideline now is to have a pool of suspects based on other evidence and then if the crime scene DNA matches one of them then they become the prime suspect.

https://www.sciencenews.org/article/why-police-using-genetic-geneaology-solve-crimes-poses-problems

There are just so many ways that you in exercising your freedom can appear to be doing something wrong or illegal if a police officer is investigating you or your family because of a mistaken belief. That's all I am saying and technology can help them cover their tracks unless your defense attorney is smart enough to see the hole in their logic.

I know 99% of you reading this will scoff at it until it happens to you. The more information that's on file about you the more at risk you from innumerable threats.
 
  • #18
jedishrfu said:
My post is mainly about how police might misuse the data.
That at least is clear. But in order to be relevant, such a risk must be new. If a cop who used to misuse fingerprints is now misusing DNA evidence, nothing has actually changed. Not incorporating the technology doesn't prevent an increase in abuse, it only prevents an increase in successful investigations.

And even after that, it must outweigh other benefits.

And again, it doesn't appear to me that you are considering the possibility that this could decrease the potential for abuse.
The police aren't scientists and they come up with procedures that were taught to them at some time but never upgraded as the technology improves.
This, of course, is a logical self-contradiction, since use of the new technology is a new procedure.
Being a non-scientist means they may interpret percentages wrong or make assumptions about what evidence says and thus chase after the wrong person.
That is just a description of the concept of "investigation". Investigators spend almost all of their time chasing the wrong person. It's a logical necessity.
In the early days of using less markers (I think it was 10), one police dept frustrated that they couldn't find the suspect used a dna database of criminals and flagged ten felons and from that they tried to tie them to the crime.

The guideline now is to have a pool of suspects based on other evidence and then if the crime scene DNA matches one of them then they become the prime suspect.
If this is the case you referred to before, it doesn't directly answer my request about it, however I can say:

They've updated their procedures to improve accuracy and reduce risk of misuse/abuse. Do you agree that's a positive thing?
I know 99% of you reading this will scoff at it until it happens to you. The more information that's on file about you the more at risk you from innumerable threats.
I am honestly trying not to, but what you are giving here is making it difficult. However, if you're saying that it happened to you, then that changes everything. If I speculate that the risk to me is 1 in a million and the oddds that it helps me is 1 in 1000, then it's a no-brainer that it is a positive thing. But if it already happened to you, then for you the odds feel like 100% and the risk calculus doesn't work anymore.
 
  • #19
This has never happened to me although there was a local case where a student was arrested for a child related crime. The child described the assailant in terms that fit this guy and said that he wore a certain type of pajama pants (sponge bob, I think). He was charged and offered a plea deal which he took on advisement of counsel because he couldn't refute a child's testimony.

The backstory was that his parents had moved to a new town and left him to stay at a friends house whose mother ran a daycare center out of the house so that placed him at the scene. He had a pair of the pants described because the friends mother had given him a pair as a gift. Based on that and a possible long sentence and being 18 he pled guilty to get a lesser sentence.

A couple of years later, his friend was arrested on another charge and photos of this kid and others were on his phone and the police and the newly elected DA realized that the old outgoing DA had made a big mistake in this case and so the wheels turning ever so slowly got him released, the verdict overturned.

The child witness now older indicated that he would get these two guys confused and it turns out the friend also had a pair of the same kind of pants which the child had described. There was some suspicion that the friends mother knew all along that’s why she gave him the same pants.

However, the damage is done, his reputation is stained and remains as stuff that will reappear on internet searches, potentially and tacitly limiting his career options as the doubt will persist until it fades away if ever.

https://www.kvue.com/article/news/local/williamson-county/timeline-greg-kelley-case/269-443000777

So while there was no DNA in this case something clearly went wrong with the investigation. The saving grace was the new DA was dedicated enough to reopen the case and start again once he found the error.
 
Last edited:
  • #20
My apologies but I think I've deviated from the original therme of this thread and so I think we should close it awaiting any final responses.

Okay so now I will close the thread.

Thank you all for contributing. Thanks @russ_watters for helping to clarify my comments.
 
Last edited:

What is DNA and how is it related to Big Data?

DNA, or deoxyribonucleic acid, is a molecule that contains the genetic instructions for the development and function of all living organisms. Big Data refers to large and complex sets of data that can be analyzed to reveal patterns, trends, and associations. In the context of DNA, Big Data can be used to analyze and interpret vast amounts of genetic data to understand the relationship between genes, traits, and diseases.

How can DNA and Big Data be used to find a person?

By analyzing a person's DNA, researchers can identify unique genetic markers that can be used to create a genetic profile. This profile can then be compared to large databases of genetic data, such as those used by ancestry and genealogy companies, to find potential relatives. Additionally, law enforcement agencies can also use DNA and Big Data to identify suspects in criminal investigations.

Is there a risk to personal privacy with the use of DNA and Big Data?

Yes, there is a risk to personal privacy when it comes to the use of DNA and Big Data. The analysis of genetic data can reveal sensitive information about a person's health, ancestry, and even predisposition to certain diseases. This information can potentially be misused or shared without the individual's consent. Therefore, it is important to have strict regulations and ethical guidelines in place to protect personal privacy.

What are the benefits of using DNA and Big Data for research?

The use of DNA and Big Data in research has many potential benefits. It allows scientists to analyze large and diverse datasets, leading to a better understanding of genetic diseases, personalized medicine, and population health. It also has the potential to accelerate research and drug development, and improve disease prevention and treatment.

What are some challenges with using DNA and Big Data in research?

One of the main challenges with using DNA and Big Data in research is the sheer volume and complexity of the data. It requires advanced computational and analytical tools to process and make sense of the data. Additionally, there are ethical considerations and concerns about data privacy that need to be addressed. There is also a potential for bias and discrimination, as certain groups may be underrepresented in genetic databases. Thus, it is crucial to address these challenges to ensure the responsible and ethical use of DNA and Big Data in research.

Similar threads

Replies
8
Views
2K
  • Biology and Medical
Replies
3
Views
7K
  • General Discussion
Replies
33
Views
4K
  • General Discussion
Replies
24
Views
3K
Replies
127
Views
16K
Replies
10
Views
2K
Replies
14
Views
15K
Replies
18
Views
5K
  • General Discussion
Replies
10
Views
3K
  • Other Physics Topics
Replies
5
Views
3K
Back
Top