A Should I be concerned about collinearity in my crime rate prediction model?

FallenApple · May 25, 2017

Say I want to investigate the rate of crime by the density of police inside a city. The sampling unit is per city.

So according to a problem I saw, it claims that we can model it by
##log(\mu_{i})\sim PoliceDensity_{i}+log(PopulationSize_{i})##

Where ##\mu_{i}## is the mean count of crime in the ith city.

The last term is the offset term.

First question,

does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have ##x ##and ##x^2## in as predictors in a regression, then that should be ok right? As long as its not ##x and x##.

Second question,
Ok now I want to update the model in include the proportion of improversed citizens in a city. So the new model, I supposed to figure out.

##log(\mu_{i})\sim PoliceDensity_{i}+PovertyRate_{i}+log(PopulationSize_{i})##

where ##PovertyRate_{i}## is the proportion of poverty in a city.

Now I'm guessing that that is the wrong way to go about seeing if poverty rate is associated with because intuitively, the rate would depend on large cities. Large urban cities have more crime in general. So assuming that, by having the the poverty rate and the offset, log(Population Size) would be giving redundant information? But then again, it might not as its not redundant in a linear sense.

On the other hand, ##PovertyRate_{i}## and ##PoliceDensity_{i}## might be redundant linearly. As high crime areas would have more crime police per area.

Or it could just be confounding such that it isn't so much as to be redundant. Poverty rate could be associated with the the rate of crimes and associated with police density, making it a potential confounder.
The problem is, it could be that ##PovertyRate->CrimeRate->PoliceDensity## making the response inside the causal pathway to density. Would this be something to worry about?

andrewkirk · May 25, 2017

FallenApple said:

does this assume that there isn't collinearity when this model is put together? I mean, Density presumably is (Num of police officers/Population size)

So that means that the offset term and the density term are a function of each other. Is that ok? If I want to avoid collinearity, then would I just have to be sure that the density isn't related to the population size in a linear way? For example, if I have xx and x2x^2 in as predictors in a regression, then that should be ok right? As long as its not xandxx and x.

There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model

Code:

x<-1+rnorm(1000)
summary(lm(x^2~x))

gives a significant positive correlation

Code:

Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.09604    0.06160   1.559    0.119  
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16

FallenApple · May 26, 2017

andrewkirk said:
There is not necessarily any problem with one regressor being expressible as a function of others. Any quantity can be expressed as a function of any other quantity, for instance ##a## and ##b## are related by the equation ##a=b+c## where ##c## is a variable that has value ##a-b##. What is important is whether the regressors are correlated.

A non-linear relationship does not mean there is no correlation. For instance the following R model
Code:
x<-1+rnorm(1000)
summary(lm(x^2~x))
gives a significant positive correlation
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0236 -0.9215 -0.5854  0.4238  8.4289

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.09604    0.06160   1.559    0.119
x            1.92625    0.04318  44.609   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.38 on 998 degrees of freedom
Multiple R-squared:  0.666,   Adjusted R-squared:  0.6657
F-statistic:  1990 on 1 and 998 DF,  p-value: < 2.2e-16

So is it because x squared and x are similar in the sense that they both have an upwardness? But it is weird how small the pval is. If I have mx+b no matter what m and b are, we won't be able to approximate x squared.

Oh I think I see. The p value is going to be small no matter what because x^2 is directly related to x. So this means that statistical significance(i.e associations) won't be affected if we mispecify functional form.

It's only when there is no way to build a relation, not even approximately that the pvalue is not significant.

Is this correct?

andrewkirk · May 26, 2017

It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:

Code:

x<- 0+rnorm(1000)
summary(lm(x^2~x))

results:

Code:

Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom

That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.

FallenApple · May 26, 2017

andrewkirk said:
It's more about whether there is a clear trend in the data. If there is a clear trend that the dependent variable increases (decreases) as the regressor does, then there will be a strong p value on a positive (negative) coefficient. It is not necessary for that trend to be linear, and the trend does not imply that we can use a linear model to get a good predictor - which we can't in the x-squared case.

Conversely, the lack of a linear trend doesn't mean we can't get a good predictor. If I change the mean of the random regressors above from 1 to 0, we get an insignificant slope coefficient, despite having a perfect nonlinear relationship:
Code:
x<- 0+rnorm(1000)
summary(lm(x^2~x))
results:
Code:
Call:
lm(formula = x^2 ~ x)

Residuals:
    Min      1Q  Median      3Q     Max
-0.9110 -0.8254 -0.4780  0.2661 10.7322

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.91011    0.04222  21.555   <2e-16 ***
x           -0.05905    0.04421  -1.336    0.182
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.334 on 998 degrees of freedom
That's because for about half of the regressor values (the ones less than 0) the result variable is decreasing and for the other half it is increasing.

Ah that makes sense. So when x is centered at one, it is just more likely to have positive values and the pvalues, which are calculated from x's and y's would indicate that. The data frame, even though you didn't make one, would have the numbers and it's square. So it is local calculation?
So inferences are only valid if we don't stray too far away from centered values?Also, on a similar topic. If I just want to see if ,say, weight is associated with height, I could just regress height on weight( and potential confounders of course) and look at the p value without caring about functional form, other than shifting it so the increasing matches with increasing etc.. Since I just want test associations. True?

But if I want to predict height, then I do care about functional form of all the predictors because I want to get the best fit possible. Is that logic valid?

A Should I be concerned about collinearity in my crime rate prediction model?

Thread 'Onto set mapping is the surjective set mapping, and into injective?'

Thread 'Here's a Statistics problem for game of Polo (or Hockey if you like)'

Thread 'Roulette wheel physics and probability'

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective