Hello, I have created a multiple linear regression fit (using least squares) for a project. The regression has two independent variables, rainfall and time, and fits these to groundwater level. The regression was calculated automatically in Excel. I have been asked to report on the P-value and F-statistic for this regression (both generated automatically by program). I have read some good explanations of P-value, but cannot find any simple explanation of the F-test or F-statistic. Can anyone provide, or recommend a simple explanation?
The F value is the value of the F statistic on a joint test of all indep. variable coefficients ("the betas") simultaneously being different from zero. With a single indep. var., the F test reduces to testing the indep. var. coefficient (the beta) being different from zero. This is the same hypothesis tested by the t Stat for the beta. Try the following: drop one of your indep. variables; then verify that F = (t Stat)^2 and t Stat = SQRT(F).
Remember that in regression you're investigating whether the independent variables provide any useful information for you to use in the prediction of the mean value of Y (this is a simplified comment, but we're talking about linear regression so it works). The basic hypotheses for the F-test are [tex] \begin{align*} H_0 \colon & \beta_1 = \beta_2 = 0 \\ H_a \colon & \text{At least one coefficient is not zero} \end{align*} [/tex] If the null hypothesis is true you're left with the result that the best way to estimate the mean value of Y is with the ordinary sample mean. If the alternative hypothesis is true then you can say your data indicates the mean value of Y is not constant but varies in a way consistent with your model. In short: the F-test provides a way to distinguish which of two models (constant mean vs variable mean) best describes the variable Y.
I used to teach statistics to chemists (wannabe physicists that couldn't handle the math <G>) and found that a paticular thought experiment that explored HOW the F-distribution might be generated was useful. It goes like this: Suppose you have a very large container of ball bearings. You extract 3 bearings ("randomly") and measure their average weight. Now you extract 5 bearings and measure the average weight of those five. You calculate the ratio. You continue this process of calculating 5 & 3 average ball bearing weight ratios and build a histogram. After you do this an "infinite" number of times you have a facsimile of the F(5,3) distribution (sort of). Now the key idea - I pull three ball bearings from my pocket. I ask you "what is the probability that those three ball bearings came from the "big container"? The way you answer my question is to grab 5 bearings (at "random") from the big container average their weight. That average weight is compared to the average weight of the three bearing I pulled from my pocket. The position of this "experimental" ratio is located on the histogram you so laboriously constructed. Since the histogram is a picture of a probability distribution you can determine what the probability is that the three ball bearings were pulled from the "big container". In other words, what are the chances that the "big container" could produce a 5 - 3 ratio like the one you measured using the three from my pocket. Substitute residuals for ball bearings. The "big container" contains "random" error. Why square the residuals before consulting an F-distribution? Because it gets rid of negative numbers which could produce zero values upon averaging. Why not use 4th powers of residuals instead of squares? You could but you would have to construct the distribution - the distribution of squared values is already made for you. I hope that helps a little. It is not rigorous but I do not think you are looking for rigor.