Regression Analysis: Does it Make Sense?

In summary, the conversation is discussing the use of regression analysis to calculate protein content in milk without directly measuring it. The approach suggested involves using the relationship between water, fat, and protein in milk, but the regression model may not accurately account for errors and other factors. There is also a discussion about the difference between water and moisture and the importance of understanding statistical techniques before applying them in a project.
  • #1
autobot.d
68
0
Does this even make sense?

Am told to do a multiple regression analysis. The response variable and the explanatory variables add up and should give up ~100 percent of the total product. Example:

Milk = water + fat + protein ~= 100% (all are in terms of percentages)

The regression I was asked to do is

protein = β + moisture*γ + fat*δ

So we can calculate protein without having to measure it (done on some test samples)

This to me makes absolutely no sense...I just can't put my finger on exactly why this is wrong (or if it even is)
 
Physics news on Phys.org
  • #2
If I've understood correctly, the regression result is obvious: β = 100, γ = δ = -1.
Now, one could in principle deduce the protein content this way, but I would expect the result to be highly inaccurate since it will involve a small difference between large numbers.
 
  • #3
Hey autobot.d and welcome to the forums.

The thing though is that you should clarify what the variables moisture and fat are, and what the units are. I'm guessing that they are percentages, but the reason I am asking this has not only do with the units, but also with the relationship with the response variable protein.

In the water, fat, protein combination there is no room for error: if you supply two you will always get the third in an absolute deterministic, zero degrees of freedom manner.

But usually in a regression, you are trying to fit a model that minimizes the residual (error) over the whole data set being considered, and if there are errors then it means that you are not doing the same thing that you are doing with water/fat/protein because 2 of the 3 will give complete information about the system, whereas a typical regression will not.

With respect to haruspex's post, it is wrong in general and here is why:

The first reason is that if you are dealing with incomplete information or being forced to estimate, then there will be an error component in the model.

Because of this incomplete information, things won't be conserved up to 100%.

The other thing that is more obvious, is that the protein/fat/moisture model way not be conserved. It is conserved in the water/protein/fat model but that does not imply its conserved in the other model. This alone signifies why it is erroneous, but it is not the only reason. Remember that moisture has not been made explicit to relate to the water/fat/protein model: if it was, you wouldn't need to do a regression at all.

Because of these issues, you will be trying to fit a model to predict protein from fat and moisture (usually under uncertainty), and the regression algorithm will figure out the B,Y,d variables that are the best ones.

The idea of regression is to fit a model so that given predictor variables, you can get a response variable. However you need to be really cautious about using these models because it's not as simple as just calculating the model and using it without thinking: if your professor/teacher hasn't mentioned this then it is irresponsible on his part.

Haruspex, have you ever taken a statistics course?
 
  • #4
chiro said:
In the water, fat, protein combination there is no room for error: if you supply two you will always get the third in an absolute deterministic, zero degrees of freedom manner.
chiro, that was exactly my point. If I/we have understood the statement of the problem correctly then the 'right' answer will be the one I stated. Any difference between that and the result of the regression analysis will represent the errors in the measurements.
I did not mean to suggest that an actual regression analysis based on data samples would have exactly that result.
Perhaps the point of the exercise is to find systematic errors in the measurements so that they can be corrected. It still seems rather fraught since protein is only 3-3.5% of whole milk. It doesn't take much percentage error in the water content to make inroads into that.
One thing I may have misinterpreted is water v. moisture. I've taken these to refer to the same thing.
 
  • #5
haruspex said:
chiro, that was exactly my point. If I/we have understood the statement of the problem correctly then the 'right' answer will be the one I stated. Any difference between that and the result of the regression analysis will represent the errors in the measurements.
I did not mean to suggest that an actual regression analysis based on data samples would have exactly that result.
Perhaps the point of the exercise is to find systematic errors in the measurements so that they can be corrected. It still seems rather fraught since protein is only 3-3.5% of whole milk. It doesn't take much percentage error in the water content to make inroads into that.
One thing I may have misinterpreted is water v. moisture. I've taken these to refer to the same thing.

Thanks for clearing that up.

The thing is that with a lot of people who haven't studied statistics or at least practiced it in some depth tend to treat these techniques in a way that is without understanding.

It's not good when you get another science or engineering student that applies a regression without knowing issues surrounding its use.

Also the main issue I had with your post is the water vs moisture: They might be the same thing but we have to be careful. Moisture may be related to, or be derived from water but it may not be equivalent. If this guy is going to base a paper or a project on this kind of advice, it's better to be a little anal to make sure. I personally would hate being a scientist that has to quadruple check things consistently, but it would be a waste for them to go to all that effort and then misuse a bit of statistics to screw up the whole thing.

The other thing is that the post does have an implication that he is trying to predict protein given other predictors and this kind of thing is a timebomb waiting to happen for scientists when they don't know the context of when models are OK or misleading and in depth.

It's not saying they are somehow stupid, just that statistics is not the focus for them: the focus is science and that's understandable.
 
  • #6
chiro said:
If this guy is going to base a paper or a project on this kind of advice, it's better to be a little anal to make sure.

Being anal might help in the peer-reviewing process but I'd say it is a bit backward :biggrin:
 
  • #7
Sorry about that, water = moisture, typo. They are all percentages and I had the same problem is that is does not make sense to make a linear equation out of something that should all add up to 100 percent. The problem is that protein is not measured but the other things are so I am thinking a better way to find out protein content is to see how accurate the measurements are then make a confidence interval of what the protein could be.

I really appreciate the responses and sorry for not being clear in my original post. Thanks!
 
  • #8
Does this make sense? Since the equation is

Protein + Water + Fat ~= 96% of total milk weight (the total is the average from data)

Hypothesis test:

H0 := (Water + Fat) - Protein = 45.1 (average difference obtained from data)
H1 := (Water + Fat) - Protein != 45.1

Does this seem like a more appropriate test? It does to me so any criticism would be most appreciated. Thanks!
 
  • #9
  • #10
Well in this case the error term would come from the measuring device which in our case can safely be assumed to be ε~NID(0,σ)

so the equation would look like

protein = β + water*γ + fat*δ + ε

Again, to me it still seems like a regression on this might be inappropriate since

protein + water + fat + ε ~= 96%

b/c then the equation should just be

protein = 96% - water - fat + ε
just constants and an error term without a linear term. Thanks for the help. I think one little push and I will understand this. Thanks.
(Also, I am a mathematician just more pure mathematics now getting masters in applied math if that helps...)
 
  • #11
autobot.d said:
Well in this case the error term would come from the measuring device which in our case can safely be assumed to be ε~NID(0,σ)

so the equation would look like

protein = β + water*γ + fat*δ + ε [1]

Again, to me it still seems like a regression on this might be inappropriate since

protein + water + fat + ε ~= 96%

b/c then the equation should just be

protein = 96% - water - fat + ε [2]
just constants and an error term without a linear term.
Perhaps it's a question of systematic errors versus random ones. E.g. the analysis might reveal that when the water content is high the fat content tends to be overestimated. If so, [1] would yield a better estimate of protein than would [2].
 
  • #12
Thank you so much haruspex for sticking with me. I think you are on to something and I will spend tonight studying the difference and maybe have a more well formed question tomorrow. Thanks again everybody!
 

1. What is regression analysis?

Regression analysis is a statistical method used to analyze the relationship between a dependent variable and one or more independent variables. It is used to predict the values of the dependent variable based on the values of the independent variables.

2. How is regression analysis used?

Regression analysis is used in many fields, including economics, psychology, and biology. It is commonly used to make predictions and identify patterns in data. It can also be used to test hypotheses about the relationship between variables.

3. What are the assumptions of regression analysis?

There are several assumptions that must be met for regression analysis to be valid. These include linearity, independence, normality, and homoscedasticity. Linearity assumes that the relationship between the variables is linear, while independence assumes that the observations are not influenced by each other. Normality assumes that the data follows a normal distribution, and homoscedasticity assumes that the variance of the dependent variable is constant across all values of the independent variable.

4. How do you interpret the results of regression analysis?

The results of regression analysis typically include a regression equation, which can be used to predict the value of the dependent variable based on the values of the independent variables. The coefficient of determination, or R-squared, is also commonly reported and represents the percentage of variation in the dependent variable that can be explained by the independent variables. Additionally, the p-values of the coefficients can be used to determine the significance of the relationship between the variables.

5. What are the limitations of regression analysis?

Regression analysis has several limitations that should be considered when interpreting the results. These include the assumptions that must be met for the analysis to be valid, as well as potential issues with multicollinearity (when independent variables are highly correlated with each other) and outliers. Additionally, regression analysis can only be used to identify correlations, not causation.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
6
Views
4K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
5K
  • General Math
Replies
2
Views
2K
Replies
14
Views
942
  • MATLAB, Maple, Mathematica, LaTeX
Replies
3
Views
2K
  • Poll
  • Science and Math Textbooks
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
11
Views
1K
Replies
1
Views
2K
  • Classical Physics
Replies
8
Views
2K
Back
Top