Always positive function with regression

mignu · Dec 28, 2009

I'm attempting to solve a multiple regression.
My problem is that I want the resultant function to be always positive.

I need a regression of the machining cutting force for several values of cutting parameters (cutting speed, depth of cut, ...).
The cutting force has to be always positive, but with my limited set of parameters (i don't have ALL the possibile speeds or all the possible depths) the resultant function for the force isn't always positive.

For exaple: in my experiments I have the tool diameter varying from 15 to 70. If I use my regression and try to calculate the cutting force, for example, with diameter 10 I get a negative value. That's unacceptable, the cutting force has to be positive.

I need to constrain the regression so that the function is always positive with the coefficients found and for ANY value of cutting parameters (or at least for positive values of cutting parameters, cause negative values of diameter, depth of cut, ... don't exist too). Some coefficients has to be positive and others negative, but the final function has to be always positive for positive cutting parameters values.

How can I do this? I use Matlab but any program would be great if it solves my problem.

EnumaElish · Dec 30, 2009

Putting aside your empirical question for a moment; what does theory tell you should happen to cutting force as diameter goes to zero? Should it also go to zero, or go to some other value, or become infinite?

statdad · Dec 30, 2009

"For exaple: in my experiments I have the tool diameter varying from 15 to 70. If I use my regression and try to calculate the cutting force, for example, with diameter 10 I get a negative value."

You also neglected a basic idea of regression: technically, your regression equation is valid only for the collected x-values: without data at x = 10, you have no basis for using the equation.

EnumaElish · Dec 31, 2009

Certainly the prediction confidence interval expands very rapidly as one moves beyond the limits of observed data.

statdad · Dec 31, 2009

EnumaElish said:

Certainly the prediction confidence interval expands very rapidly as one moves beyond the limits of observed data.

it can be a little more than that. back in the "old days" of photography, when film was used, the response curves for exposure were important to know. the curves were incredibly linear in the middle (for typical exposure times), but for extremely long or extremely short exposures there was a huge departure from a linear pattern - plateaus to the right and left, basically. estimations for exposure based on the linear portion, but aimed at the extremes, were doomed to fail - you needed data from those plateaus to know the "correct" exposure values.

the behavior of a regression model is impossible to determine outside the range of collected data. the software will let you do it, but it doesn't mean the results are meaningful.

EnumaElish · Dec 31, 2009

Your post nicely highlights that there is more than one issue related to the current problem.

If "theory" (or common sense) tells me that the true model is non-linear, then I know that linear regression is at best a local approximation, which is your point.

But suppose that I had a theory that told me exactly what the true (linear or nonlinear) model is, for example, y = b1/x + b2 x + b3 x^2 (or: [insert complicated, nonlinear formula]). Then I could fit _this_ model to data, yet even then I need to worry about predicting outside of the range, because the prediction interval inflates rather rapidly.

statdad · Dec 31, 2009

Ah, if I understand this point, I would say there is no need for estimation after the determinationof the parameters. I make this comment viewing your hypothetical model as deterministic. If it isn't meant to be, I apologize for being thick tonight.

EnumaElish · Jan 1, 2010

That was not the point I was trying to make, but you needn't have apologized. My post was far from clear. Here's a hypothetical example: y measures likelihood of death within 5 years, x is the cholesterol level. And z is defined as z = -Log(1/y -1), or the "logit value." Since z = -Log(1/y -1), I can also write y = 1/(1 + Exp(z)).

y , x , z
0.990872864 , 9 , 4.687334231
0.943313766 , 8 , 2.811867565
0.644459353 , 7 , 0.594772174
0.785788908 , 7 , 1.299726253
0.719638278 , 3 , 0.94266806
0.988224114 , 9 , 4.429855602
0.54455151 , 5 , 0.178679914
0.977860268 , 7 , 3.78799294
0.766413964 , 4 , 1.188171972
0.987292192 , 8 , 4.352749407
0.672823862 , 4 , 0.720984899
0.836917534 , 4 , 1.635469543
0.988351439 , 7 , 4.440855722
0.982624412 , 7 , 4.035160735
0.867312958 , 5 , 1.877406589
0.849853063 , 5 , 1.733449073
0.839829058 , 6 , 1.656956736
0.96094483 , 7 , 3.202941748
0.695135331 , 5 , 0.824238576
0.779657846 , 6 , 1.263673583

Suppose that the true model is logit: z = a0 + a1 x + u. Alternatively, I can be "naive" and estimate a linear probability model, as y = b0 + b1 x + v. My estimation results, rounded to two decimals, are:

Logit: z = -1.81 + 0.67x (F stat = 23.97)
Linear: y = 0.52 + 0.05x (F stat = 13.74)

Now suppose I'd like to predict the likelihood of death Y(x) for two out-of-sample x values, x = 11, and x = 0. The logit model gives me predicted probabilities Y(11) = 0.9960 and Y(0) = 0.1403 respectively. These are reasonable probability values. Using the linear model, however, I obtain Y(11) = 1.0939 and Y(0) = 0.5206, and I calculate the 2-standard-deviation prediction interval around Y(0) as (0.2645, 0.7766) in the linear model. To summarize the prediction results:

Logit: z = -1.81 + 0.67x
0.99598333 , 11 , 5.513277315
0.140328661 , 0 , -1.812562899

Linear: y = 0.52 + 0.05x
1.093857079 , 11
0.520578354 , 0
Prediction interval around 0.520578354 = (0.264523738, 0.776632969)

So there are at least two problems with out-of-sample prediction using the linear model. First, it can produce a "probability" value > 1. (This is similar to the problem in the OP.) But second, even when it doesn't, the prediction interval can become quite large.

The solution for the first problem is to estimate the "true model" (in this case, the logit model). There isn't a solution for the second problem, other than exercising caution when predicting out of the sample.

Even when the linear model predicts a probability value within the [0,1] interval, it doesn't mean it's the right (or approximately right) predicted value. That's apparent from the fact that the predicted value for x = 0 from the logit model, 0.1403, lies outside of the prediction interval (0.2645, 0.7766) for x= 0 in the linear model. I think this was statdad's point, one should be extra careful about extending a "simplistic" model beyond the sample because it is (at best) a local approximation. (The remedy is to estimate the true model.)

statdad · Jan 3, 2010

"(The remedy is to estimate the true model.)"

Agreed - if you know the form of the true model and that experience, or solid theory, show that it extends beyond the range of your collected data. Even if it does, as you have pointed out, the confidence bands can be so wide as to be useless in practice.

EnumaElish · Jan 5, 2010

The expression "y = 1/(1 + Exp(z))" in my previous post should have been "y = 1/(1 + Exp(-z))."

Always positive function with regression

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Graduate Hypothesis testing: Defining H0, HA hypotheses so that ( H_A)_A' makes sense

Undergrad My basic understanding of set theory

Undergrad How do E[X] and E[|X|] relate?

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight