Multilinear Regression. Interpreting "Insignificant"p-values

In summary, the conversation discusses the practice of dropping insignificant variables in regression models and various methods for model selection. It is recommended to partition the data into separate groups for model selection, coefficient estimation, and model testing to avoid "double dipping." Specific criteria and theorems are suggested for determining which terms to keep or get rid of in a regression model. Additionally, the analogy of a sample test before an exam is used to illustrate the importance of testing models on separate data from what they were trained on.
  • #1
WWGD
Science Advisor
Gold Member
6,910
10,293
Hi all,

Hi all, hope this is not too simple; please feel free to give me a reference if this is so. I want to know how to address having "insignificant" coefficients in my regression:

I just did a multilinear regression returning ## y=a_1x_1+...+a_kx_k ##

In the resulting analysis, two of the coefficients , say ## a_1, a_2 ##, came out to be insignificant (at a 95% level) . Still, when I remove them and do a new regression ## y=a_3x_3+a_4x_4+...+a_k x_k ## the adjusted R^2 drops very slightly. Is it standard practice to drop "insignificant" variables?
 
Physics news on Phys.org
  • #2
R^2 will always drop with a simpler model. The more complicated model let's you fit to the noise.

When comparing two models usually you will use an ANOVA or something like the minimum Bayes information criterion or the Akaike information criterion.
 
  • Like
Likes WWGD
  • #3
WWGD said:
Hi all,

Hi all, hope this is not too simple; please feel free to give me a reference if this is so. I want to know how to address having "insignificant" coefficients in my regression:

I just did a multilinear regression returning ## y=a_1x_1+...+a_kx_k ##

In the resulting analysis, two of the coefficients , say ## a_1, a_2 ##, came out to be insignificant (at a 95% level) . Still, when I remove them and do a new regression ## y=a_3x_3+a_4x_4+...+a_k x_k ## the adjusted R^2 drops very slightly. Is it standard practice to drop "insignificant" variables?

This is called "model selection". It is indeed good practice to drop insignificant variables, this is done in stepwise, forward or backwards model selection. You should only take care that you don't keep an interaction term while dropping a main effects term, this is called the heredity principle.

When dropping an insignificant variable, you should usually see that the other variables become more significant.

There are many ways to compare models like adjusted ##R^2##, Mallow's ##C_p##, AIC, AICc, BIC and even ##R^2## can be used for this practice with some care.

(Also: do check for multicollinearity using the Variance Inflation Factors to see whether you have insignificance because of that).
 
  • Like
Likes FactChecker and Dale
  • #4
Hey WWGD.

In addition to micromass' comment, I'd suggest understanding the nature of things like adjusted R^2 (or other adjusted values) in addition to a normal R^2 (or other similar test-statistic).

The adjustment for R^2 is done because of how correlation easily increases with more variables/information. It's a good thing to understand this when you actually interpret the test statistics for model selection.

Also - in regression modeling there are step-up and step-down models for regression that are procedural and they are used with a variety of test statistics to find models that maximize some information criterion and minimize the number of variables.

PCA (Principal Component Analysis) is another thing that helps with model fitting for multiple random variables.
 
  • Like
Likes WWGD
  • #5
If you do use any form of data-driven model selection then it is important to partition your data into two random groups. Do your model selection on one group and your model testing on the other group. You need to avoid "double dipping" or using the same data for selecting the model and testing the model.
 
  • Like
Likes WWGD
  • #6
Dale said:
If you do use a data-driven model selection then it is important to partition your data into two random groups. Do your model selection on one group and your model testing on the other group. You need to avoid "double dipping" or using the same data for selecting the model and testing the model.

It is recommended to actually use three groups if the size of your data permit it. One group for model selection. One group for coefficient estimation. One group for model testing. But if the size of your data set is very small then there are still things you can do like the leave-one-out model testing.
 
  • #7
micromass said:
This is called "model selection". It is indeed good practice to drop insignificant variables, this is done in stepwise, forward or backwards model selection. You should only take care that you don't keep an interaction term while dropping a main effects term, this is called the heredity principle.

When dropping an insignificant variable, you should usually see that the other variables become more significant.

There are many ways to compare models like adjusted ##R^2##, Mallow's ##C_p##, AIC, AICc, BIC and even ##R^2## can be used for this practice with some care.

(Also: do check for multicollinearity using the Variance Inflation Factors to see whether you have insignificance because of that).

Thank you, I am aware of the general area and methods; I was looking for theorems and specific criteria to keep or get rid of terms.
 
  • #8
Thanks all. I ended up using the best subsets method.
 
  • #9
This is an old thread. I came across it from a search. I am currently taking an online class for Analytics Modeling, from edX and Georgia Tech. In regards to separating training and test data, the TA used an example this week, which I thought sums it up.

I am paraphrasing said:
The class has a midterm exam coming up soon. Before the exam, we hand out a sample test, for students to get an idea of what to expect. If they show up for the actual exam, and it is the same exact questions as the sample test, the students will perform much better than for what they have actually learned.
Apparently models work the same way. If you test them on the same exact data that they were trained on, they will appear to be much better than how they will behave one future real world data.
 
  • Like
Likes WWGD and Dale

What is multilinear regression?

Multilinear regression is a statistical method used to model the relationship between a dependent variable and multiple independent variables. It is an extension of simple linear regression and allows for the analysis of more complex relationships between variables.

What is a "p-value" in multilinear regression?

A "p-value" is a measure of the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is true. In multilinear regression, p-values are used to determine the significance of the relationship between the independent variables and the dependent variable.

What does it mean when a p-value is "insignificant" in multilinear regression?

When a p-value is "insignificant", it means that the relationship between the independent variables and the dependent variable is not statistically significant. This could be due to chance or other factors that are not accounted for in the model. It does not necessarily mean that there is no relationship between the variables, but rather that the relationship is not strong enough to be considered significant.

How should insignificant p-values be interpreted in multilinear regression?

Insignificant p-values should be interpreted with caution. It is important to consider other factors such as the sample size, the strength of the relationship, and the validity of the model. It is also important to conduct further analysis to determine if there are other variables that may be influencing the relationship.

What are some potential limitations of using multilinear regression and interpreting insignificant p-values?

One potential limitation is that multilinear regression assumes a linear relationship between variables, which may not always be the case. Additionally, the model may not account for all possible variables that could affect the relationship. Interpreting insignificant p-values may also lead to overlooking important relationships between variables and drawing incorrect conclusions from the data.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Precalculus Mathematics Homework Help
Replies
2
Views
1K
  • Math Proof Training and Practice
3
Replies
71
Views
9K
  • Math Proof Training and Practice
6
Replies
175
Views
19K
  • Mechanical Engineering
Replies
1
Views
3K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
5
Views
2K
  • General Discussion
Replies
4
Views
7K
Back
Top