Closest Matching Chemical Fingerprint -what analysis?

geetar_king · Apr 16, 2014

Correlation of data sets, chemical composition

I have roughly 80 test results from different samples, each result set is a list of concentrations of various chemical compounds and proteins obtained through gcms (gas chromatography mass spec)

There are over 50 of these compound concentrations for each of these data sets.

Of these test results, 5 are from samples that are known to originate from the same source.

From what I can see by looking at the variance between these samples is that some of the compounds show similar test concentration and others do not, likely because of degradation due to exposure to different conditions.

I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

Can someone recommend a method to correlate or determine which has the best match?

Thanks

Stephen Tashi · Apr 17, 2014

geetar_king said:

I am trying to determine which of the other 75 samples (not in the 5 known same-source set) most closely resembles the 5.

To have a mathematical question, you have to be precise about what it means for one sample to resemble another.

On the one hand, you might have in mind that the 5 samples come from some, say, geographic location such as a freshwater swamp and you are wanting to know which of the other sample also come from freshwater swamps. In that case "resemble" means "come from similar geographic conditions". So you are asking how to infer a resemblance that is not explicitly part of the data itself.

On the other hand, you might not care whether a sample comes from. Perhaps you just want to treat each sample as a vector of numbers and ask which vectors are close to each other in an abstract 50-dimensional space. You could try a "fuzzy clustering" algorithm for that.

If you are trying to do statistical inference, you need an explicit model for how random variation enters your sample data. Statistical analysis requires a probability model. The "bare facts" of data do not provide enough information in themselves. It's tempting to say "I'm going to be purely objective, I won't make any assumptions." If you do that, you won't come to any statistical conclusions either.

geetar_king · Apr 17, 2014

Thanks, I will look at fuzzy clustering.

I do not care really which samples come from a particular source. I also don't really know what compounds and proteins should remain unchanged in the sample over time or after exposure to different conditions, otherwise I would exclude some of the compounds.

I'll look at fuzzy clustering, otherwise, I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.

Stephen Tashi · Apr 17, 2014

geetar_king said:

I'll try to determine which compounds are most similar in the 5 known samples, then look at that that set of concentrations in the other 75 samples.

Perhaps the idea of "Mahalanobis distance" would be useful. If you estimate the standard deviation of given type of concentration then it can be used to rescale the data before you use clustering. For example, for one type of measurement a difference of 5 ppm might be a "big" difference and for another type it might be a "small" difference. If can rescale the data so "big" and "small" have a common meaning for all types of concentration then clustering would work better.

Closest Matching Chemical Fingerprint -what analysis?

Discussion

Graduate Expected numbers of cards of a last color remaining

Graduate Probability puzzle

Undergrad The problem of points

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Graduate Cannot understand this corollary on surreal numbers

High School Bunkbed Conjecture Debunked?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect