# Closest Matching Chemical Fingerprint -what analysis?

1. Apr 16, 2014

### geetar_king

Correlation of data sets, chemical composition

I have roughly 80 test results from different samples, each result set is a list of concentrations of various chemical compounds and proteins obtained through gcms (gas chromatography mass spec)

There are over 50 of these compound concentrations for each of these data sets.

Of these test results, 5 are from samples that are known to originate from the same source.

From what I can see by looking at the variance between these samples is that some of the compounds show similar test concentration and others do not, likely because of degradation due to exposure to different conditions.

I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

Can someone recommend a method to correlate or determine which has the best match?

Thanks

Last edited: Apr 16, 2014
2. Apr 17, 2014

### Stephen Tashi

To have a mathematical question, you have to be precise about what it means for one sample to resemble another.

On the one hand, you might have in mind that the 5 samples come from some, say, geographic location such as a freshwater swamp and you are wanting to know which of the other sample also come from freshwater swamps. In that case "resemble" means "come from similar geographic conditions". So you are asking how to infer a resemblance that is not explicitly part of the data itself.

On the other hand, you might not care whether a sample comes from. Perhaps you just want to treat each sample as a vector of numbers and ask which vectors are close to each other in an abstract 50-dimensional space. You could try a "fuzzy clustering" algorithm for that.

If you are trying to do statistical inference, you need an explicit model for how random variation enters your sample data. Statistical analysis requires a probability model. The "bare facts" of data do not provide enough information in themselves. It's tempting to say "I'm going to be purely objective, I won't make any assumptions." If you do that, you won't come to any statistical conclusions either.

3. Apr 17, 2014

### geetar_king

Thanks, I will look at fuzzy clustering.

I do not care really which samples come from a particular source. I also don't really know what compounds and proteins should remain unchanged in the sample over time or after exposure to different conditions, otherwise I would exclude some of the compounds.

I'll look at fuzzy clustering, otherwise, I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.

4. Apr 17, 2014

### Stephen Tashi

Perhaps the idea of "Mahalanobis distance" would be useful. If you estimate the standard deviation of given type of concentration then it can be used to rescale the data before you use clustering. For example, for one type of measurement a difference of 5 ppm might be a "big" difference and for another type it might be a "small" difference. If can rescale the data so "big" and "small" have a common meaning for all types of concentration then clustering would work better.