Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

I Principal component analysis (PCA) coefficients

  1. Nov 27, 2018 #1
    I am trying to use PCA to classify various spectra. I measured several samples to get an estimate of the population standard deviation (here I've shown only 7 measurements):


    I combined all these data into a matrix where each measurement corresponded to a column. I then used the pca(...) function in Matlab to find the component coefficients. In this case, Matlab returned 6 components (not a significant dimension reduction since I had 7 measurements).

    I plotted the first four sets of the component coefficients:


    I am not sure how to interpret these curves. The blue curve is the first order coefficients and it resembles the overall shape of the measurement curves. The red curve is the second order coefficient set and it seems to accurately model the shape of the peak at 550 nm (but I don't understand the rest of the curve). The higher order coefficient sets were much noisier.

    So, what do these curves represent exactly? Is it possible that each curve is influenced more by the presence of certain components (e.g. molecules of the substance that created the spectrum)?
  2. jcsd
  3. Nov 27, 2018 #2


    User Avatar
    Science Advisor
    Gold Member
    2018 Award

    A major complication of PCA is that the components can be difficult (or impossible) to interpret in engineering terms. If you can interpret the most significant one, you are doing good. But it sounds like you have not really interpreted the most significant one (the blue line) as you have noticed that it mimics the general shape of the data. That is what the principle component is supposed to do, so it should not be a surprise. That observation is of more value if the shape of the blue line does a better job of fitting a theory of that subject. Then you can look for other subject-dependent explanations of the other lines and arrive at a more detailed theory. That is all up to the subject matter expert and is not a statistical question. Perhaps some experts in spectra can give you some ideas if you supply more detail of where the data came from.
    Last edited: Dec 4, 2018
  4. Nov 27, 2018 #3

    Stephen Tashi

    User Avatar
    Science Advisor

    I don't know what terminology Matlab uses, but I would say you plotted the first four "principal components."

    What , if anything, they mean physically is a matter of physics. As a mathematical model it means that the reflectance curves of substances in the population of substances that are tested can be "encoded" by labeling each substance with a set set of coefficients. The decoding of this notation is that the the coefficients c1, c2, c3,..c7 are understood to represent the reflectance curve f = c1 v1 + c2 v2 + ... c7 v7, where the v1, v2,...v7 are a given set of functions of wavelengths (i.e. these functions are the principal components). Furthermore the functions v1, v2,..v7 are chosen and ordered so that (over the population of substances) we have, in a manner of speaking, an efficient system of labeling with respect to losing some of the higher indexed labels. For example, if we only know that the label of a substance begins c1,c2,c3, then our system does the best possible job of approximating f as c1 v1 + c2 v2 + c3 v3, where "best possible job" considers how the system performs over the whole population of substances tested.

    You can think of PCA as a method of data compression.

    To plot "the component coefficients" of 7 substances, you can plot 7 curves. Each curve will have 7 points on it of the form (k, ck). Wavelength values won't be represented on this graph.

    Whether this relates to physics probably hinges on whether the phenomena can be modeled by a superposition of functions. A poetic description of PCA (and ICA and many other statistical techniques) is that they model a result as a "combined effect" of variables. However, the terminology "combined effect" is does not emphasize that PCA and similar techniques model the "combined effect" as an arithmetic sum. An arithmetic sum of variables is very special way to model a combined effect. If there is a physical situation where "adding" things to a system involves all sorts of interactions among the things, then it will be surprising (but not impossible) if the combined effect of some property of the system is the arithmetic sum of certain properties of the things added.
  5. Nov 28, 2018 #4
    Hi Stephen Tashi,

    My data were spectrophotometer traces of leaves of a given plant species. I want to know if two different species can be distinguished from each other based on their reflectance. But most green plants have a very similar spectrum, for instance, for two different species I found:


    Do you think PCA can be useful in pinpointing key differences in a given species that will help identify it from another species?

    In this case, what is considered the "variables"? Is it the chemical constituents, or the wavelengths over which the measurements were taken? As I understand it, the original variables are supposed to be interrelated. PCA transforms into a new set of variables, the principal components, which are then uncorrelated.

    Also, the spectrum of a leaf is mainly a linear superposition of the spectrum of chlorophyll, water, and dry matter.

    That's right. Matlab's pca function returned a 1350x7 matrix (1350 being the number of wavelengths), and I plotted the first 4 of the 7 rows. So, I believe this corresponds to the principal components.
  6. Nov 28, 2018 #5


    User Avatar
    Science Advisor
    Gold Member
    2018 Award

    IMHO, PCA may not be the best approach. There are advances in neural networks and Artificial Intelegence that are probably better. In fact, they now have facial recognition and identification methods that are pritty good. That seems like the type of methodology that would apply to your problem. I do not have expertise in that area, so that is about all I can say.
  7. Nov 28, 2018 #6

    Stephen Tashi

    User Avatar
    Science Advisor

    I don't know. To investigate this graphically, you need to look a graphs of the coefficents of the principal components and NOT graphs of the principal component functions themselves. The same functions (principal components) are used in making the description of each species, so graphing those functions doesn't tell you anything about how well the descriptions can distinguish species.

    If you graph the coefficents of the principal components of a species, the x-axis won't be 200 nm, 300 nm, ... etc. The x-axis will just be 1st, 2nd, 3rd,...etc. It's best to think about the coefficents of a given species as a point in k-dimensional space where k is the number of coefficients we will use in classifying species. The geometric question is whether these points are easily separated from each other by some algorithm or whether they are all bunched up together.

    Your PCA treats each wavelength as a separate variable. It doesn't explicitly model any function that relates the variables.

    For example, suppose we have a 20-item rating system for people. The items are things like : sense-of-humor, religious fervor, honesty, shyness,..etc. Applying PCA to these measurements doesn't explicitly model any function that relates (for example) the rating for honesty to the rating for religious fervor. PCA does, in a manner of speaking, account for any linear relation between those two qualities that emerges just from statistics of the population.

    For example, there is nothing in the PCA that enforces the idea: Reflectance is a smooth function of wavelength, so the reflectance of a given sample at 350 nm will be close to its reflectance at 300 nm.

    Most ideas of "correlation" have to do with a probability model. I think your statement is correct if the probability model is "pick a leaf that was tested at random from the population of those tested".

    There is a different type of analysis called Independent Component Analysis (ICA), whose goal is to find a way to generate a joint distribution as the distribution of the sum of independent random variables.
  8. Nov 30, 2018 #7
    Hi Stephen Tashi,

    Could you please explain this a bit more? If I understand it correctly, what you are suggesting is to make a 1-D graph (the value of the coefficients against their order). Or are you suggesting something like "biplots"?

    Also, to make sure I understood this correctly:

    If ##p## is the number of observations and there are ##n## variables/wavelengths (##n \gg p##), in PCA each of the reflectance observations will be represented by linear functions:


    It therefore reduces the dimension from ##n\times p## to ##p\times p##. This means that up to ##p## PCs could be found. So, the way we reduce the dimensionality of the data further is to choose some ##m\ll p## for the number of PCs to keep. In other words, we discard some of the higher order PCs by assuming that most of the variation in the population is accounted for by ##m## PCs, and the higher order ones mostly model noise. Is this correct?

    Is it then possible to argue that the higher order principal components would not be very useful for distinguishing between two species?

    Noted. PCA might not be the best way to deal with this problem. But I am looking for the simplest way to analyze the spectra, and I think we should still be able to use the more classical techniques to establish if there are discernible differences between the two species.
  9. Nov 30, 2018 #8

    Stephen Tashi

    User Avatar
    Science Advisor

    I'm not familiar with biplots or "1-D graphs", but yes, I am suggesting that you make some sort of graphical representation that shows the coefficients of each species - or at least the coefficients of the first few principal components. (perhaps "spider graphs"?).

    I don't understand what the "'" signifies in your notation. The main ideas can be understood without writing summations over the ##n## wavelengths.

    If we have measurements at each of ##n## wavelengths then each principal component can be regarded as one n-dimensional vector.

    Suppose we have ##k## individual leaves that are measured at each of those wavelengths. The results of a test of one those things is also represented as one n-dimensional vector.

    Let ##w_1,w_2,...w_k## be k vectors, each of which is a result of a test.
    Let ##v_1, v_2,...v_k## be (as yet unspecified) n-dimensional vectors.

    It is obvious that we can express each ##w_j## in a trivial manner by setting:
    ##v_j = w_j## and writing
    ##w_j = \sum_{i=1}^k c_{j,i} v_i ## with ##c_{j,j} = 1## and the other ##c_{j,*}##'s = 0. This just says ##w_j## is equal to itself.
    Also, in a trivial sense, it is easy to distinguish the ##w_*##'s using their coefficients. For example the coefficients for ##w_1## are (1,0,0,...0), the coefficients for ##w_2## are ##(0,1,0,0,....)## etc.

    If we apply PCA, we get a different set of vectors (functions) ##v_1, v_2,...v_k## that are no longer identical to the ##w_*##'s. We find different coefficients for each ##w_j## such that ##w_j = \sum_{i=1}^k c_{j,i} v_i## Distinguishing the ##w##'s by their coefficients is no longer trivial. For example, ##w_1## might have coefficients like (0.79, -0.32, 4.61,....).

    We get "dimension reduction" if we can approximate each ##w_j## by using only the first few principal components. For example, suppose 3 < k and that the approximations ##w_j \approx \sum_{i=1}^3 c_{j,i} v_i ## are each good for ##j = 1,2,..k##.

    Dimension reduction is only useful for the purposes of classification if we can find a method to classify the data given by the smaller set of coefficents. For example for the case of 3 coefficients, the result of each test can be represented by a point in 3-dimensional space. There are various techniques for trying to classify such data. "Cluster analysis" is often useful.

    Dimension reduction, by itself, does not automatically solve the classification problem. Dimension reduction only reduces the complexity of the data that we attempt to classify.

    I have avoided writing sums over the index of the ##n## wavelengths by talking in terms of vectors. Of course, if you want to write-out the component-by-component meaning of ##w_j \approx \sum_{i=1}^k c_{j,i} v_i##, it would be (for the ##s##-th component, ##s = 1,2,..n##)
    ##w_{(j,s)} \approx \sum_{i=1}^3 c_{j,i} v_{(i,s)}##.
  10. Dec 3, 2018 #9
    Thank you very much for the explanation.

    Is it not more computationally efficient to perform cluster analysis directly on the original data?

    Suppose that we are trying to classify based on a feature such as the Euclidean distance between the observations in the original k-dimensional space. Using the first few PCs simply provides an approximation to the original distance that we want to use for classification. So, doesn't the extra calculation involved in finding the PCs outweigh any savings we get from using ##m \ll k## instead of ##k## variables?

    Another question that I have, is whether I should be looking into "cluster analysis" or "discriminant analysis"?

    In my situation, I have a large number of leaf measurements (##w_1, ... , w_k##) for which the group membership of each observation is already known. The aim is to use this data as a kind of "training set" so that future measurements can be automatically classified.

    Also, would PCA be useful in a discriminant analysis (i.e. if we only use the first few PCs in the derivation of the discriminant rule)?

    The " ' " denotes transpose. The formula I used is from the textbook "Principal Component Analysis" by Jolliffe, but I think your formulation is clearer.
  11. Dec 3, 2018 #10

    Stephen Tashi

    User Avatar
    Science Advisor

    Are you comparing it to using all the principal components? If you use all the principal components then, yes, you are effectively doing cluster analysis on the original data with the added burden of expressing the data in PCA. However, using on a few of the principal components need not give you the same results as a cluster analysis on the original data - which may be good if the higher order principal components are due to "noise" of some sort (errors in measurement, variability of different specimens of the same species).

    I don't know much about discriminant analysis. To me, typical neural nets are a form of non-linear discriminant analysis. They perform a non-linear mapping of the data points to other points in space and then a response node defines a plane that separates the data. By doing compositions of these non-linear mappings, you effectively define non-linear boundaries around volumes in the original data.

    My advice is to first characterize the variability of each species. For each species, find a good probability model for the response curve of a randomly selected leaf of that species. If you are doing research in botany, this gets you closer to the science of leaves. Of course, if you are doing a project for a course in data analysis, the evaluators may be more interested in seeing statistical techniques than botanical science.

    I'm not a botanist, but if I were collecting spectral response data from a fragment of a leaf, I'd measure it and then move the specimen a little to see how much the measurement changed. I'd flip the specimen over and measure it from the other side. Those type of measurements characterize the variability of response in a single leaf. Then one can investigate variability among different leaves on the same plant and plant-to-plant differences in the same species.
    Last edited: Dec 3, 2018
  12. Dec 5, 2018 #11


    User Avatar
    Science Advisor
    Gold Member

    Can't you build your PC A around this ?
  13. Dec 6, 2018 #12
    Thanks a lot for the explanation.

    Do you mean that I should collect additional measurements? At the present, I have about 30 leaf measurements for each species, and that gives the standard deviation bounds shown in post #4. I know that I would need well over 100 measurements to have a statistically meaningful standard deviation estimate. But realistically I can't collect such a large data set (I wonder if there is a rule of thumb for when you could stop collecting further measurements).

    Also, would you say that the regions with complete overlap (e.g. 750–900 nm in my post #4) are useless for classifications?

    My project is in applied physics and it does relate to botany. A part of the project involves using remote sensing to delineate certain plant species from the crops. Based on my data, I don't know yet if it is feasible to positively classify a given measurement. We will be imaging the plants from above, so all of my measurements are from the upper (adaxial) surface of the leaf lamina. For all specimens, I measured the same location which was the meristem (a small young leaf in the center of the plant, or the growing point).

    Suppose we have 3 species: species A, species B, and the crops. I am looking for a way to measure the degree to which species A is more discernible from the crops than species B.

    If I retain a subset of the principal components (the first few high variance PCs) and exclude higher order ones that are likely contaminated by noise, do you think that could be plotted to give a good graphical representation of the degree of dissimilarity between two species? I mean, we could plot the first few PCs for each species separately (e.g. on spider graphs) and then compare the plots. Could this be used as a way to show how good, or otherwise, the separation between the groups are?

    What do you mean exactly? I am trying to use PCA on both data sets...
  14. Dec 7, 2018 #13

    Stephen Tashi

    User Avatar
    Science Advisor

    Neither "standard deviation" nor "degree" of discernability has a specific meaning until we define specific random variables and attempt a specific procedure.

    Yes I think you should attempt to do this, but I can't know in advance whether it will be a successful procedure.

    Concerning your previous comment:
    You could pursue @WWGD 's suggestion in the following manner.
    Let the spectral response curves of chlororphyl, water, and dry matter be, resectively ##C(\lambda), W(\lambda),D(\lambda)##.
    Let ##F(\lambda)## be the spectral response curve of a given leaf.
    Assume ##F(\lambda) = x\ C(\lambda) + y\ W(\lambda) + z\ W(\lambda)## where ##x,y,z## are constants.

    Solving for ##x,y,z## requires solving an overdetermined system of ##N## linear equations where ##N## is the number of wavelengths at which a response measurement is taken. For each wavelength ##\lambda_i## we have the linear equation
    ##F(\lambda_i) = x\ C(\lambda_i)+ y\ W(\lambda_i) + z\ W(\lambda_i)##.

    An over determined system of linear equations can be solved "in the least squares sense". So for each leaf you can get values of ##x,y,z##. If the relative contribution of chlorophyll, water, and dry matter is a distinguishing feature of a species, you could classify the species by their ##x,y,z## values. It's a matter of biology whether different leaves have distinguishing proportions of chlorophyll, water and dry matter. You might be able to measure those proportions in a laboratory to check your estimates of ##x,y,z##.
  15. Dec 7, 2018 #14


    User Avatar
    Science Advisor
    Gold Member
    2018 Award

    I think this is wrong. The principle components might be the best representation of the common aspects of both types. You are looking for the differences, not the common aspects. And the noise might be associated the principle components more than the less significant components and contaminate them more. IMHO, you should start by doing a multiple linear stepwise regression of the type based on the other variables and see what the result is and how statistically significant it is as a predictor of type. The stepwise regression process should identify the combination of variables that best distinguishes between types.
    Last edited: Dec 7, 2018
  16. Dec 14, 2018 #15
    Hi Stephen Tashi,

    I managed to plot the observations for two species with respect to their first two PCs. For one species I had 25 measurements (shown in blue), and for the other species I had 10 specimen measurements (red). Assuming that my computations were correct, here is what I got:




    If I draw a convex hull around each species, the red species will be completely encompassed by the blue. Does this mean that we cannot differentiate the two species from each other?

    When I tried to plot PC3 against PC1, the results were the same. In fact, according to Matlab, PC1, PC2, and PC3, respectively account for 99.8%, 0.0829%, and 0.0255% of the variation. I am not sure how to interpret that. But the percentage of total variation is a measure of how good the 2D representation is, so I think a higher dimension plot would not be very helpful.

    Also, why do I get an extreme outlier for each set (as shown in the first picture above)? My spectrophotometric traces were all fairly close to each other — for instance, for the blue species I had:


    Unfortunately, I can't measure the proportions of the biochemical constituents individually. I can only measure the total signal from intact leaves (here is my setup).

    So, basically, we can assume that ##C(\lambda), W(\lambda),D(\lambda)## are the first 3 principal components?

    In the spectral region that I am looking at, the spectrum is largely due to chlorophyll. Water and dry matter play a comparatively small role. If PC1 accounts 99.8%, then it could likely be related to chlorophyll. But "chlrophyll" is itself composed of several different types of pigments whose concentrations vary from plant-to-plant. I guess ##C(\lambda)## cannot be broken down further into those individual components that make up the chlorophyll?

    Thanks a lot for the suggestion. I will definitely be looking into that. But I think we should be able to judge the degree of similarity by looking at the areas of a 2D plot covered by various subsets of observations. These kinds of diagrams could be a very simple way of delimiting the species.
  17. Dec 17, 2018 #16
    I had made a mistake. Here is what my results should look like:


    & \text{PC} & \text{Percentage}\\
    \hline & 1 & 91.1\\
    \text{Species 1} & 2 & 7.9\\
    & 3 & 0.7\\
    \hline & 1 & 96.2\\
    \text{Species 2} & 2 & 2.7\\
    & 3 & 0.6

    From these results, what can we say about the discernibility of the two species?

    Also, I didn't understand how the PCs relate to each of the three functions in the linear sums:

    ##x C(\lambda)+y W(\lambda)+z D(\lambda).##

    Any explanation would be appreciated.
  18. Dec 17, 2018 #17


    User Avatar
    Science Advisor
    Gold Member
    2018 Award

    You can see the problem with using PCs here. PC#1 is the best single representation of BOTH species (91% and 96%), so it is not clear how to use it to distinguish between species. This is the wrong approach. You need an approach that is specifically designed to distinguish between the species. That is what a linear regression of species based on the other variables would do. I think that your desire to use PCs is misleading you here.
    You are not looking for similarities, either within species or overall. You are looking for differences so that you can distinguish between the species. That is a very different problem.
    Last edited: Dec 17, 2018
  19. Dec 17, 2018 #18

    Stephen Tashi

    User Avatar
    Science Advisor

    No, you can't assume that. Fundamental mathematical things sometimes correspond to fundamental physical things, but this doesn't always happen.

    There are various ways to represent a response function as a linear combination of other functions. For example, if you were in the field of signal processing, you would naturally express a response function as a fourier series. You an also express functions in various ways using various sets of "orthogonal polynomials".

    The functions ##C(\lambda), W(\lambda), D(\lambda)## are probably not orthogonal functions. For example, suppose you want to express a response function ##R(\lambda)## as the sum of orthogonal and orthonormal functions and you want to know the coefficent for the 3rd orthonormal function ##f_3(\lambda)## in that set. You can find the coefficient by computing ##c_3 = \sum_{\lambda \in \Lambda} R(\lambda)f_7(\lambda)##.

    By contrast, the problem of expressing ##R## as a linear combination of ##C(\lambda),W(\lambda),D(\lambda)## may have more than 1 possible answer for the coefficients or no answer at all. Then you need to add more conditions to the problem (such as best least square fit) to specify a unique result. However, since ##C,W,D## have some physical interpretation, using those functions might (or might not!) provide insight in the physics.

    As @FactChecker indicates, your problem is not one of "cluster detection". Instead, you already know the clusters you want - namely you want a cluster at each individual plant species. Your problem is to find a representation of the data plus a method of cluster detection that reproduces the desired clusters when applied to that representation.

    The best practical way to investigate this problem depends on your skills with computer software. For example, you might be a sophisticated programmer (or have the use of one) or you might only be comfortable with using particular software programs.
  20. Jan 6, 2019 #19
    Hi @Stephen Tashi,

    I have a follow-up question. I have found a paper which shows that the concentrations of the components ##C\left(\lambda\right),\ W\left(\lambda\right)## and ##D\left(\lambda\right)## are are not statistically independent and co-vary. Does this fact mean that PCA will be unable to resolve the principal components along the lines of the actual physical components? :confused: (principal components are supposed to be uncorrelated)

    To clarify: the paper is really talking about the correlations between the concentration (e.g. μg/cm2) of the components. I think this corresponds to the constants, ##x,\ y##, and ##z## in your post #13, which are the coefficients of the PCA. It doesn't mean that the actual functions ##C,\ W, \ D## are themselves correlated (indeed the reflectance of the three substances are entirely different).
  21. Jan 6, 2019 #20

    Stephen Tashi

    User Avatar
    Science Advisor

    I don't know what you mean by "this". It's also unclear what "correlation" means until we have specified what random variables are involved. For example, how do you define the concentration of chlorophyll or the concentration of water as random variables? One thought is that we pick a plant at random from the population of plants and measure the concentrations of the substances in the plant.

    With that interpretation, the ##x,y,z## of post 13 can be considered as random variables. These random variables are not coefficients of the principal components because ##C,W,D## are (probably) not the principal components.

    What would it mean to say two functions are "correlated"? or "uncorrelated"? We can speak of functions ( and vectors) as being "linearly independent". The "independence" of 3 functions is a geometric idea, not a statistical idea. To say ##C(\lambda),W(\lambda),D(\lambda)## are linearly independent means you can't express one function as a linear combination of the other two functions. "independence" of functions is a different concept that "uncorrelated-ness" of random variables.

    For example, the functions ##f(\lambda) = \lambda^2 + 2\lambda, \ g(\lambda) = 4\lambda, \ h(\lambda) = 3\lambda^2 + \lambda## are not a linearly independent set because ##h = 3f - (5/4) g##

    A stronger property than "independence" is the property of "orthogonal". Two 3-dimensional vectors can be orthgonal, and, similarly, two functions can be orthogonal when considered as many dimensional vectors.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Have something to add?