A Should I be concerned about collinearity in my crime rate prediction model?

  • A
  • Thread starter Thread starter FallenApple
  • Start date Start date
  • Tags Tags
    Confusion
FallenApple
Messages
564
Reaction score
61
Say I want to investigate the rate of crime by the density of police inside a city. The sampling unit is per city.

So according to a problem I saw, it claims that we can model it by
##log(\mu_{i})\sim PoliceDensity_{i}+log(PopulationSize_{i})##

Where ##\mu_{i}## is the mean count of crime in the ith city.

The last term is the offset term.

First question,

does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have ##x ##and ##x^2## in as predictors in a regression, then that should be ok right? As long as its not ##x and x##.

Second question,
Ok now I want to update the model in include the proportion of improversed citizens in a city. So the new model, I supposed to figure out.

##log(\mu_{i})\sim PoliceDensity_{i}+PovertyRate_{i}+log(PopulationSize_{i})##

where ##PovertyRate_{i}## is the proportion of poverty in a city.

Now I'm guessing that that is the wrong way to go about seeing if poverty rate is associated with because intuitively, the rate would depend on large cities. Large urban cities have more crime in general. So assuming that, by having the the poverty rate and the offset, log(Population Size) would be giving redundant information? But then again, it might not as its not redundant in a linear sense.

On the other hand, ##PovertyRate_{i}## and ##PoliceDensity_{i}## might be redundant linearly. As high crime areas would have more crime police per area.

Or it could just be confounding such that it isn't so much as to be redundant. Poverty rate could be associated with the the rate of crimes and associated with police density, making it a potential confounder.
The problem is, it could be that ##PovertyRate->CrimeRate->PoliceDensity## making the response inside the causal pathway to density. Would this be something to worry about?
 
Last edited:
Physics news on Phys.org
FallenApple said:
does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have xx and x2x^2 in as predictors in a regression, then that should be ok right? As long as its not xandxx and x.
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09604    0.06160   1.559    0.119  
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
 
  • Like
Likes FallenApple
andrewkirk said:
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09604    0.06160   1.559    0.119
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16
So is it because x squared and x are similar in the sense that they both have an upwardness? But it is weird how small the pval is. If I have mx+b no matter what m and b are, we won't be able to approximate x squared.

Oh I think I see. The p value is going to be small no matter what because x^2 is directly related to x. So this means that statistical significance(i.e associations) won't be affected if we mispecify functional form.

It's only when there is no way to build a relation, not even approximately that the pvalue is not significant.

Is this correct?
 
Last edited:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.
 
andrewkirk said:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.

Ah that makes sense. So when x is centered at one, it is just more likely to have positive values and the pvalues, which are calculated from x's and y's would indicate that. The data frame, even though you didn't make one, would have the numbers and it's square. So it is local calculation?
So inferences are only valid if we don't stray too far away from centered values?Also, on a similar topic. If I just want to see if ,say, weight is associated with height, I could just regress height on weight( and potential confounders of course) and look at the p value without caring about functional form, other than shifting it so the increasing matches with increasing etc.. Since I just want test associations. True?

But if I want to predict height, then I do care about functional form of all the predictors because I want to get the best fit possible. Is that logic valid?
 
Last edited:
Namaste & G'day Postulate: A strongly-knit team wins on average over a less knit one Fundamentals: - Two teams face off with 4 players each - A polo team consists of players that each have assigned to them a measure of their ability (called a "Handicap" - 10 is highest, -2 lowest) I attempted to measure close-knitness of a team in terms of standard deviation (SD) of handicaps of the players. Failure: It turns out that, more often than, a team with a higher SD wins. In my language, that...
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...

Similar threads

Back
Top