Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

'Distance' between MANY points ?

  1. May 27, 2009 #1
    'Distance' between MANY points ??

    It's easy to calculate the Euclidean distance between two points A and B.
    Between 3 points we can do the Eucliean distance between AB, AC and BC, and then take the average of the three to find the "distance" (I don't know what it's called for more than 2 points!).

    But I have 8900 points, and to go about finding "how similar" the points are, in this manner, would take (8900 choose 2) = 39600550 calculations of the euclidean distance (not practical!)

    Is there a good way to find the 'similarity' between a large number of points ??

    One way I considered was, finding the median (centre of mass) of the object whose boundary was formed by connecting the 8900 points, and then determining what the Euclidean distance is between this 'median' and the FARTHEST point (out of the 8900) from that median.

    This is not good if my 8900 points form a shape like a tight ball with one small sharp spike sticking out.

    Another method would maybe be finding the Euclidean distance between the median and all 8900 other points, and averaging this (8900 calculations is much less than 39600550).

    But how is this problem usually addressed ?? Can anyone please tell me what they know, or refer me to some literature ?? Thanks!!!
     
  2. jcsd
  3. May 27, 2009 #2
    Re: 'Distance' between MANY points ??

    How many dimensions do your points occupy?
     
  4. May 27, 2009 #3
    Re: 'Distance' between MANY points ??

    Thank you for your response. The points are 2x2 matrices over the non-negative integers.
     
    Last edited: May 27, 2009
  5. May 27, 2009 #4

    Hurkyl

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Re: 'Distance' between MANY points ??

    Well, the biggest problem is to figure out just what you really want to know. If you are able to figure that out, then that might give us some ideas about what calculation to do.




    Anyways, just to throw random ideas out there, I believe it's actually pretty easy to compute the average of the squared distance between pairs of points, as well as the average squared distance from their mean -- the method is to write this calculation as a sum of dot products, and then expand and reorganize te terms.
     
  6. May 27, 2009 #5
    Re: 'Distance' between MANY points ??

    My calcualtions are still in their infancy right now, so I'm just trying things out. All 8900 matrices should (in theory) be exactly the same, but of course, they have small differences. I want to quantitatively describe how 'close together' all these matrices are. This way when I look at the 8900 matrices produced using a different theory, I can do the same calculation on those 8900 matrices. And the thoery that produces matrices that are 'more similar' or 'closer to being all the same' is the better theory. But to do this type of comparison, I need a robust way to measure 'how similar' 8900 points are to each other!

    I don't under stand what you mean by sum of dot products here :S
    To find the Euclidean distance between vectors V and U, i can do sqrt [(U-V) dot (U-V)].
    But what do you mean by SUM of dot products ??
     
  7. May 27, 2009 #6
    Re: 'Distance' between MANY points ??

    I'm sorry, I don't understand what you mean by the above. Do you want to find the average distance between each pair of points?

    Do you have any background in programming? I'd think a computer program to do that shouldn't take more than about 5 minutes. (That is to run, not to write)
     
  8. May 27, 2009 #7
    Re: 'Distance' between MANY points ??

    What I mean is that they are 2x2 matrices whose entries are non-negative integers. Let's just say, for now, that my distance between two matrices A and B is the 2x2 matrix where each entry is |Aij-Bij|. I can turn this into a scalar later on but this will do for now.

    There are 39 million pairs of matrices. Trusting that you've worked out the computational cost to around 5 minutes of runtime for this calculation, this would be far too slow. I'm doing this for scores of different 8900-tuples of matrices. And then, after the 2x2 case, I will be doing 4x4, 8x8, 16x16, ... 1024x1024 !!! If it would take 5 minutes to calcualate the average distance between 39 million pairs of 2x2 matrices, I can't afford to wait days to run this program on 1024x1024 matrices!

    Also, I think there's some redundancy in calculating the distance between each pair. But I might be wrong.
     
  9. May 27, 2009 #8
    Re: 'Distance' between MANY points ??

    I said 5 minutes because I just finished writing a program to calculate the distance between each point and every other point for 1000 points defined by latitude and longitude. Your calculations will at least be with integers. Mine were with Lat. & Lon converted to radians and took about 2.5 seconds including reading in the data and converting to radians.

    My calculation involved 500,000 pairs ( 1000^2 / 2) just as yours is 8900^2 / 2 = 39 million.

    I understand the 2x2 matrices but why do you then need to do 4x4, 8x8, etc.?

    How close do you think you'd be if you calculated the average x distance and the average y distance, and calculate the Euclidean distance between the two?
     
  10. May 27, 2009 #9
    Re: 'Distance' between MANY points ??

    Without getting too far off topic .. the 2x2 matrices would represent 1D CGR (chaos game representation) , 4x4 = 2D CGR, 2^k x 2^k = k-D CGR. It really doesn't matter though, the point I was trying to make was that calculating pairwise distances for 39 million pairs might be too computationally demanding for me. Also, my intuition says calc'ing all pairwise distances has some redundancy, although my ideas about that aren't fully crystallized.

    Do you mean, I calculate the average A_{1,1} , the average A{1,2} , the averge A{2,1} and the average A_{2,2} and then summed up the squares of these averages and then square-rooted the result ??
     
  11. May 27, 2009 #10
    Re: 'Distance' between MANY points ??

    Never mind, it doesn't work.
     
  12. Jun 15, 2009 #11
    Re: 'Distance' between MANY points ??

    There are many ways, which vary in their sensitivity to individual points (robustness) and ability to represent the clusters. The choice you make depends on what you expect the distribution of each cluster to look like. Some of the most common are

    1) Distance between means -- simple to calculate, works ok if both distributions are roughly the same scale and tightly grouped
    2) Mahalanobis distance - good if distributions can be modeled well by hyper-ellipsoid, or come from multivariate normal distribution. This is generally the best.
    3) UPGMA - inefficient, and discrimination ability is usually not improved compared to simpler methods
    4) Single link, average link, and complete link - single link is useful if the distributions are highly irregular, but it breaks down for large numbers of points and is inherently sensitive to outliers
     
  13. Jun 16, 2009 #12

    HallsofIvy

    User Avatar
    Staff Emeritus
    Science Advisor

    Re: 'Distance' between MANY points ??

    I haven't read through all the posts in this thread but I just wanted to point out that when we talk about the "distance" between two "extended objects" (not just two points) we mean the minimum distance between two points in the two objects, NOT the average distance as suggested in the first post.
     
  14. Jun 16, 2009 #13
    Re: 'Distance' between MANY points ??

    I believe what you want is the sample variance, which is the sum of squared distances from the mean, divided by n-1 where n is the number of points.
     
  15. Jun 22, 2009 #14
    Re: 'Distance' between MANY points ??

    Hi juliette sekx
    I am not sure what you are trying to do. But if you are dealing with a cloud of points and seeking some relative distance property among these points, then you may to do a Delaunay triangulation (see also convex hull). Very efficient algorithms exist that can handle many million points in few seconds. I hope this will help and feel free to ask for references may you need any.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook