Closest Matching Chemical Fingerprint -what analysis?

  • Thread starter Thread starter geetar_king
  • Start date Start date
  • Tags Tags
    Analysis Chemical
AI Thread Summary
The discussion centers on analyzing 80 test results from various samples, focusing on identifying which of the 75 unknown samples most closely resembles five known samples from the same source. Participants suggest using fuzzy clustering to treat the samples as vectors in a multi-dimensional space, allowing for the identification of similarities without needing to know their geographic origins. The importance of establishing a statistical model for random variation in the data is emphasized, as it is crucial for drawing valid conclusions. Additionally, the concept of Mahalanobis distance is proposed for rescaling data, enabling better clustering by accounting for variability in different compound concentrations. Overall, the analysis aims to correlate chemical compositions effectively to find the closest matches among the samples.
geetar_king
Messages
26
Reaction score
0
Correlation of data sets, chemical composition

I have roughly 80 test results from different samples, each result set is a list of concentrations of various chemical compounds and proteins obtained through gcms (gas chromatography mass spec)

There are over 50 of these compound concentrations for each of these data sets.

Of these test results, 5 are from samples that are known to originate from the same source.

From what I can see by looking at the variance between these samples is that some of the compounds show similar test concentration and others do not, likely because of degradation due to exposure to different conditions.

I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

Can someone recommend a method to correlate or determine which has the best match?

Thanks
 
Last edited:
Physics news on Phys.org
geetar_king said:
I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

To have a mathematical question, you have to be precise about what it means for one sample to resemble another.

On the one hand, you might have in mind that the 5 samples come from some, say, geographic location such as a freshwater swamp and you are wanting to know which of the other sample also come from freshwater swamps. In that case "resemble" means "come from similar geographic conditions". So you are asking how to infer a resemblance that is not explicitly part of the data itself.

On the other hand, you might not care whether a sample comes from. Perhaps you just want to treat each sample as a vector of numbers and ask which vectors are close to each other in an abstract 50-dimensional space. You could try a "fuzzy clustering" algorithm for that.

If you are trying to do statistical inference, you need an explicit model for how random variation enters your sample data. Statistical analysis requires a probability model. The "bare facts" of data do not provide enough information in themselves. It's tempting to say "I'm going to be purely objective, I won't make any assumptions." If you do that, you won't come to any statistical conclusions either.
 
Thanks, I will look at fuzzy clustering.

I do not care really which samples come from a particular source. I also don't really know what compounds and proteins should remain unchanged in the sample over time or after exposure to different conditions, otherwise I would exclude some of the compounds.

I'll look at fuzzy clustering, otherwise, I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.
 
geetar_king said:
I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.

Perhaps the idea of "Mahalanobis distance" would be useful. If you estimate the standard deviation of given type of concentration then it can be used to rescale the data before you use clustering. For example, for one type of measurement a difference of 5 ppm might be a "big" difference and for another type it might be a "small" difference. If can rescale the data so "big" and "small" have a common meaning for all types of concentration then clustering would work better.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...

Similar threads

Replies
17
Views
2K
Replies
21
Views
4K
Replies
3
Views
2K
Replies
20
Views
4K
Back
Top