Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

A question about regressions.

  1. Feb 5, 2014 #1
    I notice that the examples always seem to talk about linear regressions, as in y=mx +c

    So how I understand this is that you get your sum of squared residuals and your parameters and you basically find the parameters which minimise the SSE.

    Would this mean, that you could essentially set up an excel spreadsheet that would let you use almost ANY expression using solver or VBA ? Even if you can't solve it algebraically you can get it close enough.

    Why do they focus on the situation which is least likely to be useful ? If something were the same as a straight line, why would you need to use such a complicated topic ?
     
  2. jcsd
  3. Feb 5, 2014 #2

    Stephen Tashi

    User Avatar
    Science Advisor

    Your question is unclear. I think you are asking whether one can fit almost any sort of non-linear curve to data by finding constants that minimize the sum of the squared errors between the curve and the data.

    Yes, you can fit non-linear curves to data that way. (You can't necessarily do this by solving simple equations, but it can be done numerically.) Whether you use linear or non-linear curves is a subjective matter unless you know some theoretical model for the data that directs your attention to curves with a particular shape.

    Sometimes the sum of squared errors is not a good measure of the utility of a fit. For example, if you are estimating the volume of a water tank needed to serve a city with a certain population, the dis-utility of underestimating the volume is different that the dis-utility of overestimating it by the same amount.
     
  4. Feb 5, 2014 #3

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    Here is one of the important uses of linear regression: When data looks "sort of" linear, it is natural to ask if the apparent linear trend is just random luck or if it really explains something about the data. To do that, you want to get the best linear function you can through the data. Then you can compare how much of the scatter of the data is explained by the linear part, versus how much unexplained scatter remains. Statistical tests can say if that linear trend should be assumed to be random luck or not.
     
  5. Feb 5, 2014 #4
    yes I was thinking to solve it numerically using an iterations of some small number

    yes my terminology will be off, sorry about that. I'm not an academic.

    the reason is that there is that over a to b the gradient is steep but from points c to d the gradient becomes shallow, which means that when you try to fit a straight line you get a flat gradient and it introduces error that doesn't need to be there.

    does this in effect mean we're violating an assumption of regression analysis because of heterodescadicity and if you numerically fit a non linear curve does that get around that problem ?
     
  6. Feb 5, 2014 #5

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    Often you have some reason for choosing a particular "shape" or regression model before you do any curve fitting (and sometimes even before you collect any data, because the "shape" of the curve will influence the best choice of data points to test whether it is a good model or a poor one).

    Otherwise, you can fall into the trap of just going on a fishing expedition to see if something in a collection of data happens to fit to a straight line. That doesn't "prove" anything, because correlation is not the same as causation. The word "explains" in Factchecker's quote

    is a dangerous game to play. That way, you can easily "prove" nonsense like "owning a washing machine causes diabetes" (compare the death rate from diabetes to the percentage of families owning washing machines, over the last 100 years), or whatever other piece of crackpottery appeals to you.

    IMO the reason that simple example "always seem to" talk about linear regression is probably because linear regression leads to the simplest version of the math, and many apparently nonlinear curve-fitting problems can be mathematically transformed into linear ones.

    If you just want to fit a "smooth curve" through some data points, you don't have to use regression at all - for example you can fit a spline curve.
     
  7. Feb 5, 2014 #6

    Awesome, thanks.

    I just went y= p.exp(rt)=where t is time r is a variable of interest and p is a constant.

    Plugged into into excel and then calculated MAPE and used solver to minimise it.

    If I wanted to include 3 or 4 variables to od a multiple regresson could I simply go :

    y=p*( [exp(r1)] + [exp(r2)] + [exp(r2)] )

    In theory if this is minimising error, it should be a better fit than just using the average ?

    Thanks so much for your help so far, what do you think is the advantag of regression and whqat techniques are better ?
     
  8. Feb 5, 2014 #7

    FactChecker

    User Avatar
    Science Advisor
    Gold Member

    There is a deep statistical foundation behind linear regression that is based on the errors being a normal distribution. So there is a lot to gain from sticking with that type of model. One advantage is that it can tell you when adding more complexity to the model is statistically justified. You can get different shapes for the model and still retain the advantages of linear regression.

    Suppose :
    1) the general trend of data is exponential, and
    2) the random scatter of data seems to grow proportional to the trend line (when the values get bigger, the scatter gets bigger), and
    3) the y data values are all positive.

    Then take the logarithm of all the y values and perform a linear regression to get a model
    log(y) = a + b*x + c+x^2 + ..
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook