Linear regression and probability distribution

In summary, the OLS regression model found that the y variable was linearly related to the beta variable and the e variable. The model had an R-squared of 1.000 and a F-statistic of 5.402e+06. The AIC was 269.5 and the BIC was 277.4. The covariance type was nonrobust.
  • #1
maka89
68
4
I have some data that I want to do simple linear regression on.
However I don't have a lot of datapoints, and would like to know the uncertainty in the parameters. I.e. the slope and the intercept of the linear regression model.

I know it should be possible to get a prob. distribution of the parameters using bayesian linear regression, but I don't know quite how to use it and what assumptions has to be made. Can someone help me how to find reasonable error estimates for my model?
 
Physics news on Phys.org
  • #2
What's wrong with a confidence interval on the parameters?

if you wish to use a Bayesian approach then it's no different than how one normally approaches these problems. Assume a prior, compute a posterior form a credible interval.
 
  • #3
Like Marne said, most programs will include the margin of error and distribution of each of both the independent term and the coefficient in the parameters when you do a regression.
 
  • #4
Sound like what I am after. Any Python library that has this ability?
 
  • #5
statmodels should have it, but it's been awhile since I've used it. Although i'll also say it's relatively trivial to figure it out on your own and write your own function to do it. I also imagine sklearn should have this feature (if it doesn't then wtf)
 
  • #6
Sklearn does not have it :( statsmodels does, but api seems kind of lame.

Think numpy.polyfit with cov=True will do the trick :cool:
 
  • #7
I'm not sure if that'll actually do the trick since all that'll do is return the standard error.

Code:
import numpy as np, statsmodels.api as sm

nsample = 100
x = np.linspace(0, 10, nsample)
X = np.column_stack((x, x**2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

X = sm.add_constant(X)
y = np.dot(X, beta) + e

mod = sm.OLS(y, X)
res = mod.fit()

print(res.summary())
print(res.conf_int(0.01))

Returns the following

Code:
                           OLS Regression Results                           
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.402e+06
Date:                Mon, 03 Jul 2017   Prob (F-statistic):          1.70e-245
Time:                        09:34:54   Log-Likelihood:                -131.77
No. Observations:                 100   AIC:                             269.5
Df Residuals:                      97   BIC:                             277.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.9008      0.270      3.338      0.001       0.365       1.436
x1             0.0988      0.125      0.792      0.430      -0.149       0.346
x2             9.9997      0.012    828.551      0.000       9.976      10.024
==============================================================================
Omnibus:                        0.170   Durbin-Watson:                   1.765
Prob(Omnibus):                  0.918   Jarque-Bera (JB):                0.274
Skew:                          -0.092   Prob(JB):                        0.872
Kurtosis:                       2.821   Cond. No.                         144.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[[  0.19176405   1.60988065]
 [ -0.22893401   0.42648449]
 [  9.96797785  10.03139905]]

Which, in my opinion, is much more helpful than just the standard errors.
 
  • #8
Ah. Okay. But am I right in that the two coeficients in my model has a joint multivariate normal distribution. So to get a 95% confidence interval, I suppose I have to integrate out the other coeficient that I am not looking at(And not naively just add/subtract 1.96std.devs) ?
 
  • #9
The estimates of the coefficients will actually follow a t-distribution with n-2 degrees of freedom. This is of course assuming that the assumption of linear regressions have been met.
 
  • #10
Thanks! Bit rusty on the ol' central limit theorem... So, since I have two coeficients and numpy.polyfit returns a 2x2 covariance matrix, I guess I am looking at a multivariate t-distribution with n-2 degrees of freedom and the covariance matrix returned by the function .. ?
 
  • #11
You're literally just getting the covariance matrix. There's nothing special about that, the standard way you get the covariance matrix is all the function is doing, nothing about distributions. Thus I keep pointing you towards a confidence interval. The way you figure out confidence intervals of any estimate is basically what you do with the parameters being estimated here. I'm not entirely sure what you don't get about the process.
 
  • #12
I need to be able to sample from the distribution of the regression coeficients, eventually. So I try to understand what pdf to use.
 
  • #13
maka89 said:
Sound like what I am after. Any Python library that has this ability?

Excel itself has it, standard Excel. And besides, Excel has free add-ins for a lot of other stuff, like PCA, Factor Analysis, examples, etc. I know a lot of people look down at it, but I think it is mostly snobbery.
 
  • Like
Likes FactChecker
  • #14
Excel is nice, but I'm looking to use this as a means to an end in a different project, which is python based :)
 
  • #15
maka89 said:
Excel is nice, but I'm looking to use this as a means to an end in a different project, which is python based :)
Can't you call Excel from Python or maybe export the Excel output analysis?
 
  • #16
Excel is not free and the data is user specified so the script should be able to do the numbercrunching itself.
I found that the statsmodels package works well for getting confidence intervals.But I still have some trouble understanding how it works and how I can sample from the distribution of the regression parameters ^^
 
  • #17
maka89 said:
Excel is not free and the data is user specified so the script should be able to do the numbercrunching itself.
I found that the statsmodels package works well for getting confidence intervals.But I still have some trouble understanding how it works and how I can sample from the distribution of the regression parameters ^^

I don't think it samples from the parameters; the parameters have distributions that are know a priori ( derived theoretically) , and the specific values for the confidence intervals are arrived at through the data you enter. I think Excel Online is available for free, though don't mean to push it.
 
  • Like
Likes maka89
  • #18
maka89 said:
Excel is nice, but I'm looking to use this as a means to an end in a different project, which is python based :)
Python and other languages with scripting capabilities (I like Perl) allow you to write top-level scripts that can generate programs for other languages (Excel, R, MATLAB, etc.), run the program, parse the output for results, and continue to the next step. A large project might have a lot of top-level Python scripts that invoke several other programs or tools that are best for specific subtasks.
 
  • Like
Likes WWGD and maka89
  • #19
maka89 said:
Excel is not free and the data is user specified so the script should be able to do the numbercrunching itself.
I found that the statsmodels package works well for getting confidence intervals.But I still have some trouble understanding how it works and how I can sample from the distribution of the regression parameters ^^
The sampling of the regression coefficients converge asymptotically to a normal distribution. If I may ask, why do you want to sample the parameters?
 
  • #20
MarneMath said:
The sampling of the regression coefficients converge asymptotically to a normal distribution. If I may ask, why do you want to sample the parameters?

I got the confidence intervals to work with statsmodels. Need to figure out how to get prediction intervals as well, though.

The reason I want to be able to sample, is that the regression coeficients obeys some (potentially complex) physics equation with several other parameters (That also have uncertainties). Want to be able to see how these uncertainties propagate. I.e. if a and b has some uncertainty, what are the possible values of c, which also is in the equation. One easy way is to monte carlo sample a and b and create a histogram over possible c values
 

1. What is linear regression and how is it used in science?

Linear regression is a statistical method used to model the relationship between two or more variables. It is often used in science to predict or explain the relationship between a dependent variable (the outcome) and one or more independent variables (the predictors).

2. What is the difference between simple and multiple linear regression?

Simple linear regression involves modeling the relationship between two variables, while multiple linear regression involves modeling the relationship between a dependent variable and two or more independent variables. In multiple linear regression, the effect of each independent variable on the dependent variable is examined while controlling for the other variables.

3. How is probability distribution related to linear regression?

Probability distribution is a way of displaying the likelihood of different outcomes in a statistical analysis. In linear regression, probability distribution can be used to determine the probability of obtaining a certain relationship between variables, as well as to calculate confidence intervals for the estimated parameters.

4. How do you determine the best fit line in linear regression?

The best fit line in linear regression is determined by minimizing the sum of the squared differences between the observed data points and the predicted values from the regression line. This is known as the method of least squares and is used to find the line that best represents the relationship between the variables.

5. Can linear regression be used to make predictions?

Yes, linear regression can be used to make predictions about the relationship between variables. By using the regression equation, we can input values for the independent variables and predict the corresponding value for the dependent variable. However, it is important to note that the accuracy of these predictions is dependent on the quality of the data and the assumptions of the linear regression model.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
827
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
13
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
345
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
469
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
980
Back
Top