Should I be concerned about collinearity in my crime rate prediction model?

  • Context: Graduate 
  • Thread starter Thread starter FallenApple
  • Start date Start date
  • Tags Tags
    Confusion
Click For Summary

Discussion Overview

The discussion revolves around concerns regarding collinearity in a crime rate prediction model that incorporates police density and population size. Participants explore the implications of including additional variables, such as poverty rate, and the potential for redundancy or confounding effects in the model.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants question whether the model assumes no collinearity, noting that police density is derived from the number of police officers and population size, which could create a functional relationship between the offset and density terms.
  • There is a suggestion that avoiding collinearity may require ensuring that police density is not linearly related to population size, with examples provided regarding the use of polynomial terms in regression.
  • Participants discuss the potential redundancy of including both poverty rate and population size in the model, with some arguing that they may provide overlapping information regarding crime rates.
  • Concerns are raised about the possibility of confounding variables, where poverty rate could influence both crime rates and police density, complicating the interpretation of the model.
  • Some participants emphasize that a non-linear relationship does not eliminate correlation, using statistical examples to illustrate how significant relationships can exist even when the functional form is mis-specified.
  • There is a discussion about the importance of identifying clear trends in the data, noting that a lack of linearity does not preclude the ability to make accurate predictions.

Areas of Agreement / Disagreement

Participants express differing views on the implications of collinearity and redundancy in the model. While some agree on the potential issues with including certain variables, others present counterarguments, leading to an unresolved discussion regarding the best approach to model specification.

Contextual Notes

Limitations include the potential for unaddressed assumptions about the relationships between variables, the dependence on specific definitions of terms like police density and poverty rate, and unresolved questions about the functional forms used in the model.

FallenApple
Messages
564
Reaction score
61
Say I want to investigate the rate of crime by the density of police inside a city. The sampling unit is per city.

So according to a problem I saw, it claims that we can model it by
##log(\mu_{i})\sim PoliceDensity_{i}+log(PopulationSize_{i})##

Where ##\mu_{i}## is the mean count of crime in the ith city.

The last term is the offset term.

First question,

does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have ##x ##and ##x^2## in as predictors in a regression, then that should be ok right? As long as its not ##x and x##.

Second question,
Ok now I want to update the model in include the proportion of improversed citizens in a city. So the new model, I supposed to figure out.

##log(\mu_{i})\sim PoliceDensity_{i}+PovertyRate_{i}+log(PopulationSize_{i})##

where ##PovertyRate_{i}## is the proportion of poverty in a city.

Now I'm guessing that that is the wrong way to go about seeing if poverty rate is associated with because intuitively, the rate would depend on large cities. Large urban cities have more crime in general. So assuming that, by having the the poverty rate and the offset, log(Population Size) would be giving redundant information? But then again, it might not as its not redundant in a linear sense.

On the other hand, ##PovertyRate_{i}## and ##PoliceDensity_{i}## might be redundant linearly. As high crime areas would have more crime police per area.

Or it could just be confounding such that it isn't so much as to be redundant. Poverty rate could be associated with the the rate of crimes and associated with police density, making it a potential confounder.
The problem is, it could be that ##PovertyRate->CrimeRate->PoliceDensity## making the response inside the causal pathway to density. Would this be something to worry about?
 
Last edited:
Physics news on Phys.org
FallenApple said:
does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have xx and x2x^2 in as predictors in a regression, then that should be ok right? As long as its not xandxx and x.
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09604    0.06160   1.559    0.119  
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
 
  • Like
Likes   Reactions: FallenApple
andrewkirk said:
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09604    0.06160   1.559    0.119
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
So is it because x squared and x are similar in the sense that they both have an upwardness? But it is weird how small the pval is. If I have mx+b no matter what m and b are, we won't be able to approximate x squared.

Oh I think I see. The p value is going to be small no matter what because x^2 is directly related to x. So this means that statistical significance(i.e associations) won't be affected if we mispecify functional form.

It's only when there is no way to build a relation, not even approximately that the pvalue is not significant.

Is this correct?
 
Last edited:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.
 
andrewkirk said:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.

Ah that makes sense. So when x is centered at one, it is just more likely to have positive values and the pvalues, which are calculated from x's and y's would indicate that. The data frame, even though you didn't make one, would have the numbers and it's square. So it is local calculation?
So inferences are only valid if we don't stray too far away from centered values?Also, on a similar topic. If I just want to see if ,say, weight is associated with height, I could just regress height on weight( and potential confounders of course) and look at the p value without caring about functional form, other than shifting it so the increasing matches with increasing etc.. Since I just want test associations. True?

But if I want to predict height, then I do care about functional form of all the predictors because I want to get the best fit possible. Is that logic valid?
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 2 ·
Replies
2
Views
4K