# Regression analysis?

1. Jun 28, 2012

### autobot.d

Does this even make sense?

Am told to do a multiple regression analysis. The response variable and the explanatory variables add up and should give up ~100 percent of the total product. Example:

Milk = water + fat + protein ~= 100% (all are in terms of percentages)

The regression I was asked to do is

protein = β + moisture*γ + fat*δ

So we can calculate protein without having to measure it (done on some test samples)

This to me makes absolutely no sense...I just can't put my finger on exactly why this is wrong (or if it even is)

2. Jun 29, 2012

### haruspex

If I've understood correctly, the regression result is obvious: β = 100, γ = δ = -1.
Now, one could in principle deduce the protein content this way, but I would expect the result to be highly inaccurate since it will involve a small difference between large numbers.

3. Jun 29, 2012

### chiro

Hey autobot.d and welcome to the forums.

The thing though is that you should clarify what the variables moisture and fat are, and what the units are. I'm guessing that they are percentages, but the reason I am asking this has not only do with the units, but also with the relationship with the response variable protein.

In the water, fat, protein combination there is no room for error: if you supply two you will always get the third in an absolute deterministic, zero degrees of freedom manner.

But usually in a regression, you are trying to fit a model that minimizes the residual (error) over the whole data set being considered, and if there are errors then it means that you are not doing the same thing that you are doing with water/fat/protein because 2 of the 3 will give complete information about the system, whereas a typical regression will not.

With respect to haruspex's post, it is wrong in general and here is why:

The first reason is that if you are dealing with incomplete information or being forced to estimate, then there will be an error component in the model.

Because of this incomplete information, things won't be conserved up to 100%.

The other thing that is more obvious, is that the protein/fat/moisture model way not be conserved. It is conserved in the water/protein/fat model but that does not imply its conserved in the other model. This alone signifies why it is erroneous, but it is not the only reason. Remember that moisture has not been made explicit to relate to the water/fat/protein model: if it was, you wouldn't need to do a regression at all.

Because of these issues, you will be trying to fit a model to predict protein from fat and moisture (usually under uncertainty), and the regression algorithm will figure out the B,Y,d variables that are the best ones.

The idea of regression is to fit a model so that given predictor variables, you can get a response variable. However you need to be really cautious about using these models because it's not as simple as just calculating the model and using it without thinking: if your professor/teacher hasn't mentioned this then it is irresponsible on his part.

Haruspex, have you ever taken a statistics course?

4. Jun 29, 2012

### haruspex

chiro, that was exactly my point. If I/we have understood the statement of the problem correctly then the 'right' answer will be the one I stated. Any difference between that and the result of the regression analysis will represent the errors in the measurements.
I did not mean to suggest that an actual regression analysis based on data samples would have exactly that result.
Perhaps the point of the exercise is to find systematic errors in the measurements so that they can be corrected. It still seems rather fraught since protein is only 3-3.5% of whole milk. It doesn't take much percentage error in the water content to make inroads into that.
One thing I may have misinterpreted is water v. moisture. I've taken these to refer to the same thing.

5. Jun 29, 2012

### chiro

Thanks for clearing that up.

The thing is that with a lot of people who haven't studied statistics or at least practiced it in some depth tend to treat these techniques in a way that is without understanding.

It's not good when you get another science or engineering student that applies a regression without knowing issues surrounding its use.

Also the main issue I had with your post is the water vs moisture: They might be the same thing but we have to be careful. Moisture may be related to, or be derived from water but it may not be equivalent. If this guy is going to base a paper or a project on this kind of advice, it's better to be a little anal to make sure. I personally would hate being a scientist that has to quadruple check things consistently, but it would be a waste for them to go to all that effort and then misuse a bit of statistics to screw up the whole thing.

The other thing is that the post does have an implication that he is trying to predict protein given other predictors and this kind of thing is a timebomb waiting to happen for scientists when they don't know the context of when models are OK or misleading and in depth.

It's not saying they are somehow stupid, just that statistics is not the focus for them: the focus is science and that's understandable.

6. Jun 29, 2012

### viraltux

Being anal might help in the peer-reviewing process but I'd say it is a bit backward

7. Jun 29, 2012

### autobot.d

Sorry about that, water = moisture, typo. They are all percentages and I had the same problem is that is does not make sense to make a linear equation out of something that should all add up to 100 percent. The problem is that protein is not measured but the other things are so I am thinking a better way to find out protein content is to see how accurate the measurements are then make a confidence interval of what the protein could be.

I really appreciate the responses and sorry for not being clear in my original post. Thanks!

8. Jun 29, 2012

### autobot.d

Does this make sense? Since the equation is

Protein + Water + Fat ~= 96% of total milk weight (the total is the average from data)

Hypothesis test:

H0 := (Water + Fat) - Protein = 45.1 (average difference obtained from data)
H1 := (Water + Fat) - Protein != 45.1

Does this seem like a more appropriate test? It does to me so any criticism would be most appreciated. Thanks!

9. Jun 29, 2012

### SW VandeCarr

10. Jun 29, 2012

### autobot.d

Well in this case the error term would come from the measuring device which in our case can safely be assumed to be ε~NID(0,σ)

so the equation would look like

protein = β + water*γ + fat*δ + ε

Again, to me it still seems like a regression on this might be inappropriate since

protein + water + fat + ε ~= 96%

b/c then the equation should just be

protein = 96% - water - fat + ε
just constants and an error term without a linear term. Thanks for the help. I think one little push and I will understand this. Thanks.
(Also, I am a mathematician just more pure mathematics now getting masters in applied math if that helps...)

11. Jun 29, 2012

### haruspex

Perhaps it's a question of systematic errors versus random ones. E.g. the analysis might reveal that when the water content is high the fat content tends to be overestimated. If so, [1] would yield a better estimate of protein than would [2].

12. Jun 29, 2012

### autobot.d

Thank you so much haruspex for sticking with me. I think you are on to something and I will spend tonight studying the difference and maybe have a more well formed question tomorrow. Thanks again everybody!