Should I be concerned about collinearity in my crime rate prediction model?

  • A
  • Thread starter FallenApple
  • Start date
  • Tags
    Confusion
In summary: So the key is not really about expressing one regressor as a function of others, but more about the actual relationship between the regressors and the dependent variable. As long as there is a clear relationship, the p-value should be significant, regardless of the functional form.
  • #1
FallenApple
566
61
Say I want to investigate the rate of crime by the density of police inside a city. The sampling unit is per city.

So according to a problem I saw, it claims that we can model it by
##log(\mu_{i})\sim PoliceDensity_{i}+log(PopulationSize_{i})##

Where ##\mu_{i}## is the mean count of crime in the ith city.

The last term is the offset term.

First question,

does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have ##x ##and ##x^2## in as predictors in a regression, then that should be ok right? As long as its not ##x and x##.

Second question,
Ok now I want to update the model in include the proportion of improversed citizens in a city. So the new model, I supposed to figure out.

##log(\mu_{i})\sim PoliceDensity_{i}+PovertyRate_{i}+log(PopulationSize_{i})##

where ##PovertyRate_{i}## is the proportion of poverty in a city.

Now I'm guessing that that is the wrong way to go about seeing if poverty rate is associated with because intuitively, the rate would depend on large cities. Large urban cities have more crime in general. So assuming that, by having the the poverty rate and the offset, log(Population Size) would be giving redundant information? But then again, it might not as its not redundant in a linear sense.

On the other hand, ##PovertyRate_{i}## and ##PoliceDensity_{i}## might be redundant linearly. As high crime areas would have more crime police per area.

Or it could just be confounding such that it isn't so much as to be redundant. Poverty rate could be associated with the the rate of crimes and associated with police density, making it a potential confounder.
The problem is, it could be that ##PovertyRate->CrimeRate->PoliceDensity## making the response inside the causal pathway to density. Would this be something to worry about?
 
Last edited:
Physics news on Phys.org
  • #2
FallenApple said:
does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have xx and x2x^2 in as predictors in a regression, then that should be ok right? As long as its not xandxx and x.
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09604    0.06160   1.559    0.119  
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
 
  • Like
Likes FallenApple
  • #3
andrewkirk said:
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09604    0.06160   1.559    0.119
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
So is it because x squared and x are similar in the sense that they both have an upwardness? But it is weird how small the pval is. If I have mx+b no matter what m and b are, we won't be able to approximate x squared.

Oh I think I see. The p value is going to be small no matter what because x^2 is directly related to x. So this means that statistical significance(i.e associations) won't be affected if we mispecify functional form.

It's only when there is no way to build a relation, not even approximately that the pvalue is not significant.

Is this correct?
 
Last edited:
  • #4
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.
 
  • #5
andrewkirk said:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.

Ah that makes sense. So when x is centered at one, it is just more likely to have positive values and the pvalues, which are calculated from x's and y's would indicate that. The data frame, even though you didn't make one, would have the numbers and it's square. So it is local calculation?
So inferences are only valid if we don't stray too far away from centered values?Also, on a similar topic. If I just want to see if ,say, weight is associated with height, I could just regress height on weight( and potential confounders of course) and look at the p value without caring about functional form, other than shifting it so the increasing matches with increasing etc.. Since I just want test associations. True?

But if I want to predict height, then I do care about functional form of all the predictors because I want to get the best fit possible. Is that logic valid?
 
Last edited:

1. What is collinearity?

Collinearity is a statistical term that refers to the high correlation between two or more predictor variables in a regression model. It indicates that the predictor variables are highly related to each other, which can cause issues in the model's accuracy and interpretation.

2. Why should I be concerned about collinearity in my crime rate prediction model?

Collinearity can lead to inflated standard errors and unstable coefficients, making it difficult to identify the true effects of each predictor variable on the outcome variable. This can result in inaccurate predictions and misleading conclusions about the relationship between the variables.

3. How can I detect collinearity in my crime rate prediction model?

There are several methods for detecting collinearity, such as calculating the correlation coefficients between predictor variables, performing a variance inflation factor (VIF) test, or examining the tolerance values. These methods can help identify which variables have a high level of collinearity and may need to be addressed in the model.

4. Can collinearity be corrected in a crime rate prediction model?

While it is not possible to completely eliminate collinearity, there are steps that can be taken to reduce its impact on the model. This includes removing highly correlated variables, transforming variables, or using regularization techniques such as ridge regression or LASSO to penalize the effect of collinear variables.

5. Are there any consequences of not addressing collinearity in my crime rate prediction model?

Ignoring collinearity in a regression model can lead to biased and unstable coefficients, making it difficult to accurately interpret the relationship between the predictor variables and the outcome variable. This can also result in unreliable predictions and potentially misleading conclusions about the factors influencing crime rates.

Similar threads

Replies
0
Views
2K
Replies
1
Views
590
  • Calculus and Beyond Homework Help
Replies
1
Views
3K
  • General Math
Replies
2
Views
3K
Back
Top