Multiple Regression: Steps for Variable Selection

In summary, the steps for variable selection in regression analysis involve starting with a full model, using a selection method to determine the best subset of variables, and testing the significance of each variable. The software will also provide output to help interpret the results. This process helps to ensure that the variables included in the final model are truly related to the dependent variable and not just due to chance.
  • #1
alias
46
0
Hey everyone,

Does anyone know the steps that minitab or any other stats software does to eliminate variables when doing forward selection, backward elimination, or standard(general) stepwise regression and best subsets regression on a set of data. I'm not sure if the correlation coefficients are tested for significance or not. Or if its just trial and error. I have to do this by hand and I don't know the steps. I think that:
1) Start with the independent variable that has the highest correlation with the dependent variable.
2) Test the slope of that independent variable for signficance (I use t-stat or p-value with 0.05 significance level).
3) Take the variable that has the second highest correlation with the dependent variable, test, and so on.
Please tell me if I'm not making sense!
This is the only way I can think of but it does not match my output from minitab.
Could someone tell me where I'm going wrong? It would be much appreciated. Thanks.
 
Last edited:
Physics news on Phys.org
  • #2


Hi there,

The steps for variable selection in regression analysis can vary slightly depending on the software being used, but the general process is as follows:

1. Start with a full model that includes all potential independent variables.
2. Use a selection method (such as forward selection, backward elimination, or stepwise regression) to select the best subset of variables.
3. In each step, the software will test the significance of each variable and decide whether to include it in the model or remove it.
4. This process continues until the best subset of variables is determined based on a pre-determined criterion (such as AIC or BIC).
5. The software will also provide output such as regression coefficients, significance levels, and goodness-of-fit measures to help interpret the results.

In terms of how the software tests for significance, it typically uses statistical tests such as t-tests or F-tests to determine whether the correlation coefficient for each variable is significantly different from zero. This helps to ensure that the variables included in the final model are truly related to the dependent variable and not just due to chance.

I hope this helps clarify the process for you. If you have any further questions, please don't hesitate to ask. Best of luck with your analysis!
 

1. What is multiple regression and why is it used?

Multiple regression is a statistical method used to analyze the relationship between a dependent variable and multiple independent variables. It is used to understand how one or more independent variables impact the dependent variable, and to predict the value of the dependent variable based on the values of the independent variables.

2. What are the steps for selecting variables in multiple regression?

The steps for variable selection in multiple regression are:

  1. Identify the research question and variables of interest.
  2. Collect and clean the data.
  3. Perform exploratory data analysis to understand the relationships between variables.
  4. Select potential independent variables based on theoretical knowledge and exploratory analysis.
  5. Use statistical methods, such as forward or backward selection, to select the best combination of variables that explain the variation in the dependent variable.
  6. Evaluate the chosen model for accuracy and reliability.

3. What is the difference between forward and backward selection in multiple regression?

Forward selection is a method for selecting variables in multiple regression where variables are added one at a time, starting with the most significant variable, until no more significant variables can be added. Backward selection, on the other hand, starts with all potential variables and removes them one at a time, starting with the least significant variable, until no more variables can be removed without significantly affecting the model's performance.

4. What are some techniques for evaluating the accuracy of a multiple regression model?

There are several techniques for evaluating the accuracy of a multiple regression model, including:

  • Calculating the coefficient of determination (R^2) to determine the proportion of variation in the dependent variable explained by the model.
  • Performing a hypothesis test on the model's coefficients to determine if they are significantly different from zero.
  • Conducting residual analysis to check for patterns and outliers in the model's errors.
  • Using cross-validation to assess the model's predictive ability on new data.

5. What are some common pitfalls to avoid when performing variable selection in multiple regression?

Some common pitfalls to avoid when performing variable selection in multiple regression include:

  • Selecting variables based on statistical significance alone without considering their practical significance or theoretical relevance.
  • Including highly correlated variables, which can lead to multicollinearity and inaccurate results.
  • Overfitting the model by including too many variables, which can lead to poor performance on new data.
  • Not considering the assumptions of multiple regression, such as linearity, normality, and homoscedasticity.
  • Ignoring potential interactions between variables, which may be important for accurately predicting the dependent variable.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
Back
Top