Closest Matching Chemical Fingerprint -what analysis?

  • Context: Graduate 
  • Thread starter Thread starter geetar_king
  • Start date Start date
  • Tags Tags
    Analysis Chemical
Click For Summary

Discussion Overview

The discussion revolves around methods for analyzing chemical composition data to identify which samples most closely resemble a set of known samples. The context includes exploratory reasoning regarding data correlation, statistical analysis, and clustering techniques in the field of chemical analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant presents a dataset of 80 samples with concentrations of various chemical compounds and proteins, seeking methods to correlate these samples with 5 known samples from the same source.
  • Another participant suggests that the definition of "resemble" could vary based on whether geographic origin or abstract numerical similarity is considered, proposing fuzzy clustering as a potential method.
  • A participant expresses uncertainty about which compounds remain unchanged over time or under different conditions, indicating a need for further analysis of the 5 known samples before comparing them to the other 75.
  • Another participant introduces the concept of using Mahalanobis distance for rescaling data based on standard deviations of concentration types to improve clustering results.

Areas of Agreement / Disagreement

Participants do not reach a consensus on a specific method but explore various approaches, including fuzzy clustering and Mahalanobis distance, indicating multiple competing views on how to analyze the data.

Contextual Notes

Participants acknowledge limitations in their understanding of which compounds are stable over time and the need for a clear statistical model to draw conclusions from the data.

geetar_king
Messages
26
Reaction score
0
Correlation of data sets, chemical composition

I have roughly 80 test results from different samples, each result set is a list of concentrations of various chemical compounds and proteins obtained through gcms (gas chromatography mass spec)

There are over 50 of these compound concentrations for each of these data sets.

Of these test results, 5 are from samples that are known to originate from the same source.

From what I can see by looking at the variance between these samples is that some of the compounds show similar test concentration and others do not, likely because of degradation due to exposure to different conditions.

I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

Can someone recommend a method to correlate or determine which has the best match?

Thanks
 
Last edited:
Physics news on Phys.org
geetar_king said:
I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

To have a mathematical question, you have to be precise about what it means for one sample to resemble another.

On the one hand, you might have in mind that the 5 samples come from some, say, geographic location such as a freshwater swamp and you are wanting to know which of the other sample also come from freshwater swamps. In that case "resemble" means "come from similar geographic conditions". So you are asking how to infer a resemblance that is not explicitly part of the data itself.

On the other hand, you might not care whether a sample comes from. Perhaps you just want to treat each sample as a vector of numbers and ask which vectors are close to each other in an abstract 50-dimensional space. You could try a "fuzzy clustering" algorithm for that.

If you are trying to do statistical inference, you need an explicit model for how random variation enters your sample data. Statistical analysis requires a probability model. The "bare facts" of data do not provide enough information in themselves. It's tempting to say "I'm going to be purely objective, I won't make any assumptions." If you do that, you won't come to any statistical conclusions either.
 
Thanks, I will look at fuzzy clustering.

I do not care really which samples come from a particular source. I also don't really know what compounds and proteins should remain unchanged in the sample over time or after exposure to different conditions, otherwise I would exclude some of the compounds.

I'll look at fuzzy clustering, otherwise, I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.
 
geetar_king said:
I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.

Perhaps the idea of "Mahalanobis distance" would be useful. If you estimate the standard deviation of given type of concentration then it can be used to rescale the data before you use clustering. For example, for one type of measurement a difference of 5 ppm might be a "big" difference and for another type it might be a "small" difference. If can rescale the data so "big" and "small" have a common meaning for all types of concentration then clustering would work better.
 

Similar threads

  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 17 ·
Replies
17
Views
2K
  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 3 ·
Replies
3
Views
1K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 20 ·
Replies
20
Views
4K
  • · Replies 6 ·
Replies
6
Views
3K