Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

A Collinearity Confusion

  1. May 25, 2017 #1
    Say I want to investigate the rate of crime by the density of police inside a city. The sampling unit is per city.

    So according to a problem I saw, it claims that we can model it by
    ##log(\mu_{i})\sim PoliceDensity_{i}+log(PopulationSize_{i})##

    Where ##\mu_{i}## is the mean count of crime in the ith city.

    The last term is the offset term.

    First question,

    does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

    So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have ##x ##and ##x^2## in as predictors in a regression, then that should be ok right? As long as its not ##x and x##.

    Second question,
    Ok now I want to update the model in include the proportion of improversed citizens in a city. So the new model, I supposed to figure out.

    ##log(\mu_{i})\sim PoliceDensity_{i}+PovertyRate_{i}+log(PopulationSize_{i})##

    where ##PovertyRate_{i}## is the proportion of poverty in a city.

    Now I'm guessing that that is the wrong way to go about seeing if poverty rate is associated with because intuitively, the rate would depend on large cities. Large urban cities have more crime in general. So assuming that, by having the the poverty rate and the offset, log(Population Size) would be giving redundant information? But then again, it might not as its not redundant in a linear sense.

    On the other hand, ##PovertyRate_{i}## and ##PoliceDensity_{i}## might be redundant linearly. As high crime areas would have more crime police per area.

    Or it could just be confounding such that it isn't so much as to be redundant. Poverty rate could be associated with the the rate of crimes and associated with police density, making it a potential confounder.
    The problem is, it could be that ##PovertyRate->CrimeRate->PoliceDensity## making the response inside the causal pathway to density. Would this be something to worry about?
     
    Last edited: May 25, 2017
  2. jcsd
  3. May 25, 2017 #2

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

    A non-linear relationship does not mean there is no correlation. For instance the following R model
    Code (Text):

    x<-1+rnorm(1000)
    summary(lm(x^2~x))
     
    gives a significant positive correlation
    Code (Text):

    Call:
    lm(formula = x^2 ~ x)

    Residuals:
        Min      1Q  Median      3Q     Max
    -1.0236 -0.9215 -0.5854  0.4238  8.4289

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.09604    0.06160   1.559    0.119  
    x            1.92625    0.04318  44.609   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.38 on 998 degrees of freedom
    Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
    F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
     
     
  4. May 26, 2017 #3

    So is it because x squared and x are similar in the sense that they both have an upwardness? But it is weird how small the pval is. If I have mx+b no matter what m and b are, we won't be able to approximate x squared.

    Oh I think I see. The p value is going to be small no matter what because x^2 is directly related to x. So this means that statistical significance(i.e associations) won't be affected if we mispecify functional form.

    It's only when there is no way to build a relation, not even approximately that the pvalue is not significant.

    Is this correct?
     
    Last edited: May 26, 2017
  5. May 26, 2017 #4

    andrewkirk

    User Avatar
    Science Advisor
    Homework Helper
    Gold Member

    It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

    Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
    Code (Text):

    x<- 0+rnorm(1000)
    summary(lm(x^2~x))
     
    results:
    Code (Text):

    Call:
    lm(formula = x^2 ~ x)

    Residuals:
        Min      1Q  Median      3Q     Max
    -0.9110 -0.8254 -0.4780  0.2661 10.7322

    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.91011    0.04222  21.555   <2e-16 ***
    x           -0.05905    0.04421  -1.336    0.182  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 1.334 on 998 degrees of freedom
     
    That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.
     
  6. May 26, 2017 #5
    Ah that makes sense. So when x is centered at one, it is just more likely to have positive values and the pvalues, which are calculated from x's and y's would indicate that. The data frame, even though you didn't make one, would have the numbers and it's square. So it is local calculation?
    So inferences are only valid if we don't stray too far away from centered values?


    Also, on a similar topic. If I just want to see if ,say, weight is associated with height, I could just regress height on weight( and potential confounders of course) and look at the p value without caring about functional form, other than shifting it so the increasing matches with increasing etc.. Since I just want test associations. True?

    But if I want to predict height, then I do care about functional form of all the predictors because I want to get the best fit possible. Is that logic valid?
     
    Last edited: May 26, 2017
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Collinearity Confusion
  1. Probability confusion (Replies: 2)

  2. Confusing Problem (Replies: 2)

  3. Logical confusion (Replies: 1)

Loading...