Regression Analysis: Does it Make Sense?

Click For Summary

Discussion Overview

The discussion revolves around the appropriateness of conducting a multiple regression analysis to estimate protein content in milk based on moisture and fat percentages. Participants explore the implications of using a regression model in a context where the response and explanatory variables are interdependent and should sum to a constant total.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Mathematical reasoning

Main Points Raised

  • One participant questions the validity of the regression model, suggesting that since the response variable (protein) and explanatory variables (moisture and fat) should sum to a constant, a linear regression may not be appropriate.
  • Another participant proposes that the regression could yield a result indicating that β = 100, γ = δ = -1, but expresses concern about the accuracy due to the small differences involved in measuring protein content.
  • Clarifications are sought regarding the definitions and units of moisture and fat, with some participants emphasizing the importance of understanding these variables in the context of the regression.
  • Concerns are raised about the potential for systematic errors in measurements affecting the regression results, particularly given the small percentage of protein in milk.
  • One participant suggests that a better approach might be to assess the accuracy of the measurements and create a confidence interval for the protein content instead of relying solely on regression analysis.
  • A hypothesis test is proposed by a participant, questioning whether it might be a more appropriate method to analyze the relationship between the variables.
  • Discussion includes the role of error terms in regression analysis, with some participants noting that the error term is often misunderstood and suggesting that it should be included in the model.
  • Another participant argues that the regression model should simply express protein as a function of the total minus moisture and fat, indicating that a linear regression may not be necessary.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of using regression analysis in this context. While some raise valid concerns about the model's assumptions and potential inaccuracies, others suggest that it could still provide useful insights despite these issues. The discussion remains unresolved regarding the best approach to estimate protein content.

Contextual Notes

Participants note that the relationship between moisture, fat, and protein is deterministic in the context of total composition, which complicates the use of regression. There are also concerns about the accuracy of measurements and the implications of using statistical models without a thorough understanding of their limitations.

autobot.d
Messages
67
Reaction score
0
Does this even make sense?

Am told to do a multiple regression analysis. The response variable and the explanatory variables add up and should give up ~100 percent of the total product. Example:

Milk = water + fat + protein ~= 100% (all are in terms of percentages)

The regression I was asked to do is

protein = β + moisture*γ + fat*δ

So we can calculate protein without having to measure it (done on some test samples)

This to me makes absolutely no sense...I just can't put my finger on exactly why this is wrong (or if it even is)
 
Physics news on Phys.org
If I've understood correctly, the regression result is obvious: β = 100, γ = δ = -1.
Now, one could in principle deduce the protein content this way, but I would expect the result to be highly inaccurate since it will involve a small difference between large numbers.
 
Hey autobot.d and welcome to the forums.

The thing though is that you should clarify what the variables moisture and fat are, and what the units are. I'm guessing that they are percentages, but the reason I am asking this has not only do with the units, but also with the relationship with the response variable protein.

In the water, fat, protein combination there is no room for error: if you supply two you will always get the third in an absolute deterministic, zero degrees of freedom manner.

But usually in a regression, you are trying to fit a model that minimizes the residual (error) over the whole data set being considered, and if there are errors then it means that you are not doing the same thing that you are doing with water/fat/protein because 2 of the 3 will give complete information about the system, whereas a typical regression will not.

With respect to haruspex's post, it is wrong in general and here is why:

The first reason is that if you are dealing with incomplete information or being forced to estimate, then there will be an error component in the model.

Because of this incomplete information, things won't be conserved up to 100%.

The other thing that is more obvious, is that the protein/fat/moisture model way not be conserved. It is conserved in the water/protein/fat model but that does not imply its conserved in the other model. This alone signifies why it is erroneous, but it is not the only reason. Remember that moisture has not been made explicit to relate to the water/fat/protein model: if it was, you wouldn't need to do a regression at all.

Because of these issues, you will be trying to fit a model to predict protein from fat and moisture (usually under uncertainty), and the regression algorithm will figure out the B,Y,d variables that are the best ones.

The idea of regression is to fit a model so that given predictor variables, you can get a response variable. However you need to be really cautious about using these models because it's not as simple as just calculating the model and using it without thinking: if your professor/teacher hasn't mentioned this then it is irresponsible on his part.

Haruspex, have you ever taken a statistics course?
 
chiro said:
In the water, fat, protein combination there is no room for error: if you supply two you will always get the third in an absolute deterministic, zero degrees of freedom manner.
chiro, that was exactly my point. If I/we have understood the statement of the problem correctly then the 'right' answer will be the one I stated. Any difference between that and the result of the regression analysis will represent the errors in the measurements.
I did not mean to suggest that an actual regression analysis based on data samples would have exactly that result.
Perhaps the point of the exercise is to find systematic errors in the measurements so that they can be corrected. It still seems rather fraught since protein is only 3-3.5% of whole milk. It doesn't take much percentage error in the water content to make inroads into that.
One thing I may have misinterpreted is water v. moisture. I've taken these to refer to the same thing.
 
haruspex said:
chiro, that was exactly my point. If I/we have understood the statement of the problem correctly then the 'right' answer will be the one I stated. Any difference between that and the result of the regression analysis will represent the errors in the measurements.
I did not mean to suggest that an actual regression analysis based on data samples would have exactly that result.
Perhaps the point of the exercise is to find systematic errors in the measurements so that they can be corrected. It still seems rather fraught since protein is only 3-3.5% of whole milk. It doesn't take much percentage error in the water content to make inroads into that.
One thing I may have misinterpreted is water v. moisture. I've taken these to refer to the same thing.

Thanks for clearing that up.

The thing is that with a lot of people who haven't studied statistics or at least practiced it in some depth tend to treat these techniques in a way that is without understanding.

It's not good when you get another science or engineering student that applies a regression without knowing issues surrounding its use.

Also the main issue I had with your post is the water vs moisture: They might be the same thing but we have to be careful. Moisture may be related to, or be derived from water but it may not be equivalent. If this guy is going to base a paper or a project on this kind of advice, it's better to be a little anal to make sure. I personally would hate being a scientist that has to quadruple check things consistently, but it would be a waste for them to go to all that effort and then misuse a bit of statistics to screw up the whole thing.

The other thing is that the post does have an implication that he is trying to predict protein given other predictors and this kind of thing is a timebomb waiting to happen for scientists when they don't know the context of when models are OK or misleading and in depth.

It's not saying they are somehow stupid, just that statistics is not the focus for them: the focus is science and that's understandable.
 
chiro said:
If this guy is going to base a paper or a project on this kind of advice, it's better to be a little anal to make sure.

Being anal might help in the peer-reviewing process but I'd say it is a bit backward :biggrin:
 
Sorry about that, water = moisture, typo. They are all percentages and I had the same problem is that is does not make sense to make a linear equation out of something that should all add up to 100 percent. The problem is that protein is not measured but the other things are so I am thinking a better way to find out protein content is to see how accurate the measurements are then make a confidence interval of what the protein could be.

I really appreciate the responses and sorry for not being clear in my original post. Thanks!
 
Does this make sense? Since the equation is

Protein + Water + Fat ~= 96% of total milk weight (the total is the average from data)

Hypothesis test:

H0 := (Water + Fat) - Protein = 45.1 (average difference obtained from data)
H1 := (Water + Fat) - Protein != 45.1

Does this seem like a more appropriate test? It does to me so any criticism would be most appreciated. Thanks!
 
  • #10
Well in this case the error term would come from the measuring device which in our case can safely be assumed to be ε~NID(0,σ)

so the equation would look like

protein = β + water*γ + fat*δ + ε

Again, to me it still seems like a regression on this might be inappropriate since

protein + water + fat + ε ~= 96%

b/c then the equation should just be

protein = 96% - water - fat + ε
just constants and an error term without a linear term. Thanks for the help. I think one little push and I will understand this. Thanks.
(Also, I am a mathematician just more pure mathematics now getting masters in applied math if that helps...)
 
  • #11
autobot.d said:
Well in this case the error term would come from the measuring device which in our case can safely be assumed to be ε~NID(0,σ)

so the equation would look like

protein = β + water*γ + fat*δ + ε [1]

Again, to me it still seems like a regression on this might be inappropriate since

protein + water + fat + ε ~= 96%

b/c then the equation should just be

protein = 96% - water - fat + ε [2]
just constants and an error term without a linear term.
Perhaps it's a question of systematic errors versus random ones. E.g. the analysis might reveal that when the water content is high the fat content tends to be overestimated. If so, [1] would yield a better estimate of protein than would [2].
 
  • #12
Thank you so much haruspex for sticking with me. I think you are on to something and I will spend tonight studying the difference and maybe have a more well formed question tomorrow. Thanks again everybody!
 

Similar threads

  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
15
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 10 ·
Replies
10
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 6 ·
Replies
6
Views
7K
  • · Replies 11 ·
Replies
11
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 3 ·
Replies
3
Views
1K