Confidence of complete spatial randomness

1. May 8, 2014

stephenx_86

Hi,

I've got a distribution of points in two dimensions and would like to demonstrate if these are randomly distributed. The points have been measured using single particle tracking, so likely have some degree of error in their position. What I'd like to show is whether, as time progresses (I have 10 distributions at different time points), the distribution remains random.

So far I've been using the nearest neighbour index (NNI) and seem to have this working correctly (I've compared the results I get to those obtained using CrimeStat III; http://www.icpsr.umich.edu/CrimeStat/). Similarly, I've calculated the Z-score and also got the same results as with CrimeStat III.

What I'd like to be able to do is specify some confidence or precision in each measurement of NNI. However, my problem is that for a relatively small variation in the nearest neighbour index (NNI ranges from 0.985 to 1.03) I get a large variation in Z-score (Z ranges from 0.31 to 2.58). There is also no apparent trend (i.e. NNI and Z-score getting larger over time). To test this, I generated a similar sized sample of random points in MATLAB and see the same massive fluctuations in Z. I'm surprised by these large values of Z, since the NNI values are very close to the expected value of 1 for complete spatial randomness.

I was wondering if there is a better way to demonstrate if I have complete spatial randomness? For example, can I take the precision of measurements into account, since it seems my large number of measurements (n ~2500) is influencing Z, while the accuracy of each measure is ignored.

From what I've seen on-line; people are generally assuming complete spatial randomness if Z<1.65 (or greater, depending on how confident they need to be), since this corresponds to a confidence of 90%. However, when I'm trying to show if there IS, rather than ISN'T, randomness do I need to have Z<0.13, so I have a 10% confidence of a non-random process (i.e. 90% of randomness)?

What I'd like is a value I can plot along with the measures of NNI, such as "x% likelihood of complete spatial randomness".

Sorry if this is written confusingly; I'm still trying to get my head round the problem! Please let me know if more information is required.

Any help would be greatly appreciated
Stephen

2. May 9, 2014

Stephen Tashi

You're apparently using jargon from the CrimeStat computer program. A general mathematical audience isn't going to be familiar with the program. (For example, you talk about "the Z-score" without defining what random variable has this Z-score.) It also isn't clear whether your are using the word "confidence" in the way it is technically defined in statistics or whether your are using "confidence" in layman's terms - as a synonym for "probability".

One thought is to quote passages from the CrimeStat documentation that explain what random variables are involved. Another thought is explain the goal in your analysis - what are you trying to accomplish?

3. May 9, 2014

stephenx_86

Hi,

My apologies; I've not done any in depth statistics before and wasn't aware of what are common terms. I'll try and explain the problem, but please let me know if I need to clarify terms further.

What I've got is a distribution of points in two dimensions; it's expected these will show a random distribution (complete spatial randomness), but I need a way to show if this is indeed the case.

So far I've been measuring the average distance to the nearest neighbour for each position (dmeas) in the data. This is compared to the expected average distance for the same number of points randomly distributed over the same area (dexp) to give the nearest neighbour index (NNI):

NNI = dmeas/dexp

The NNI value should be equal to 1 for complete spatial randomness, less than one for clustering and greater than 1 for increasing uniformity of distribution.

The problem I have is that I either need to quantify how clustered (if at all) my distribution is, or state the probability that my distribution was the results of a random process.

As I understand it, I can calculate what they refer to as the Z-score, which "indicates whether the measured average nearest neighbor distance was significantly different from the mean random distance" 1. I've been calculating the Z-score as:

Z = (dmeas-dexp)/SEexp

Where SEexp is the standard error of the expected (random) distribution and can be approximated using the following:

SEexp ≈ 0.26136/√(n2/A)

Where n is the number of points in the sample and A is the area of the sample.

From what I've read, this Z-score is also known as the "standard score" 2 and corresponds to "the number of standard deviations an observed value is above the mean". According to another source 3 I can use this Z-score to calculate a P-value, which "is the probability of obtaining a test statistic result at least as extreme as the one that was actually observed" 4. Such a value seems like it would be a good way to quantify whether the distribution is likely clustered or randomly distributed.

These P-values seem to be used more for identifying the probability that the observed distribution was NOT randomly distributed (i.e. they want P-values less than say 0.1; corresponding to a probability less than 10% that the distribution was random). What I'm trying to do is effectively the opposite (at least that's how it seems to me); I want to be able to say that it's "likely the distribution is randomly distributed". I just don't know at what P-value threshold I can state this. Would I need a P-value of at least 0.9, so I could say there's less than a 10% probability the distribution was created by a non-random process (i.e. a 90% probability it was random)?

When I calculate the NNI everything seems good and I get values pretty close to 1 (they range from 0.985 to 1.03); however, I get pretty large Z-scores (-1.19 to 2.58) for these values (and accordingly, small P-values). I assume this must be due to the dependence of the Z-score on the number of measurements and since I have a large number of measurements (n = 2500) this means even a slight deviation from NNI = 1 looks statistically pretty unlikely. However, this measure doesn't seem to take into account the accuracy of the measured spatial positions.

Please can you tell me if there is a more suitable way to quantify the probability that my measured NNI is the result of a random distribution.

Thanks,
Stephen

References:
1 https://www.icpsr.umich.edu/CrimeStat/files/CrimeStatChapter.5.pdf
2 http://en.wikipedia.org/wiki/Standard_score
3 http://help.arcgis.com/en/arcgisdes...z_score_What_is_a_p_value/005p00000006000000/
4 http://en.wikipedia.org/wiki/P-value

4. May 9, 2014

FactChecker

I hesitate to suggest a completely different approach, but I would look at the three coordinates of the points as three time series. There are tools that can look for any cross correlation between the series or any auto-correlation. If there is no statistically significant pattern in the three axis coordinates, then I would say the points are completely random. (P.S. There is more clustering in randomness than people expect)

5. May 9, 2014

Stephen Tashi

First some important semantics: The type of statistical testing being discussed ( "frequentist" statistical tests) do not quantify the probability that a distribution of points is random (- nor do they quantify the probability that the distribution of points is non-random). The calculation of a p-value is done on the assumption the distribution of points is random. So the calculations quantify the probability of the observed data given that the distribution is random. What a common sense person would like is a value for the probability that the distribution is random given the observed data, but what the test calculates is the probability of the observed data given the distribution is random. It's the distinction between "Probability of A given B" and "Probability of B given A".

Hypothesis testing of this sort is not a mathematical proof. It is a subjective procedure that many people accept as providing evidence. If you are writing something up for publication then you need to browse published papers to see what type of statistics that editors of the publications in your field accept. If your really want to know "the probability that the distribution of points is random given the observed data", you need to use a Bayesian statistical approach (which will require additional assumptions or facts about the problem).

What is the accuracy of the measurements? I'd think that crimes are often reported by an address, so two crimes at the same apartment building or street corner might be considered to occur at the same place. Does this happen in the data that you have?

6. May 10, 2014

stephenx_86

Thank you both for your responses.

The data is indeed intended (hopefully) for publication, so if an alternative approach entirely is required I'll be happy to read into it. This Bayesian statistical approach sounds promising.

Although I've been using CrimeStat to analyse the distribution; the data actually originates from single-molecule fluorescence experiments, where I've been localising individual fluorophores using Gaussian profile fitting. Either way, I end up with a distribution of points in 2D. The accuracy of these points should be at worst around 25nm and the mean nearest neighbour distance I believe (I don't have the data to hand) is approximately 250nm.

I'll have a further look round existing literature to see what I can find.

Thanks again,
Stephen

7. May 11, 2014

Stephen Tashi

Perhpas using a "well-known" statistical package makes a paper more impressive. This PDF discusses using packages from "R" http://www.google.com/url?sa=t&rct=...d5ZzFamcJkBIf67O8n7VO0w&bvm=bv.66330100,d.aWw

One distinction made clear in that link is the difference between a "random" distribution of points vs a "uniform" distribution of points. Does the physics of your problem imply a "random" distribution?

Those results from a MATLAB simulation sound strange. I think we should discuss the specific formulas that your are using.

8. May 13, 2014

stephenx_86

The distribution we're observing is generated by individual fluorophores binding to a microscope slide. Since the size of the fluorophores (diameter ~25 nm) is much less than the average distance to the nearest neighbour (~2 μm) we're assuming it to be negligible (i.e. no spatial restriction). We've no reason to believe there should be uniform distribution of points rather than random.

I tried to keep the MATLAB simulation as simple as possible. I've been using the "rand" function to position points across the sample area. To keep it comparable to the measured data, I used the same number of points as was measured and the same sample area (35 μm x 35μm).

The paper does seem very useful though and will be a more impressive reference when we write up. I'm currently reading through it to see if I can improve my approach.

Stephen

9. May 14, 2014

Stephen Tashi

This is my simplistic reasoning about the MATLAB results. If a random variable is approximately normally distributed then a correct simulation of it, repeated many times, will probably show a normal distribution of the "Z-score" of the random variable. So if you are getting a non-normal distribution of Z-scores then something is wrong with either the theory or the implementation.

What does the distribution of the Z-scores look like?

10. May 19, 2014

stephenx_86

The distribution of Z-scores does indeed follow a normal distribution; one that appears to follow the distribution predicted by the Z-score calculation (probability as a function of deviation from the expected value).

I've spent the last few days further reading relevant papers and it's starting to all make sense. Most examples use a modified version of the nearest neighbour calculation. Rather than just taking the distance to the closest neighbour; they count the number of positions within circles of increasing radius from each position (Ripley's K-function 1,2). The summed value that comes from this is compared to the expected value. The benefit of this seems to be that you get information at multiple distances from each position, so can analyse short and long-range clustering. The examples also provide information on edge correction and calculation of confidence intervals; the latter simply using somewhere between 100 and 1000 simulations to build a probability distribution.

Thank you again for your advice with this, it has been very helpful in order to clarify my thoughts on the problem.

References:
1 Prior, I. A., et al. The Journal of Cell Biology 160 (2003) 165–70
2 http://en.wikipedia.org/wiki/Spatial_descriptive_statistics#Ripley.27s_K_and_L_functions

Last edited: May 19, 2014