Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

I Many To Many Distances

  1. Aug 26, 2016 #1

    MarneMath

    User Avatar
    Education Advisor

    Here's my problem: I have approximately 500 million pairs of GPS coordinates. Each pair of point is unique and points (lat1,long1,lat2,long2) are considered to be the same as (lat2,long2,lat1,long1). Write now, i'm building a rather large table that'll contain the drive time calculations between each coordinate. I'm doing it in a rather naive way using an openstreemap API that calculates the distances. I'm essentially calculating the distance for each pair. Now, it occurs to me that there should exist a way to build a matrix that uses previously calculated distances in order to reduce the query times for long distance.

    For example, if point A and B is calculated and we move onto point A C and if A B appears in the route for A C then we shouldn't have to search all the routes from A to C. We should be able to automatically select route A B and start searching from there.

    I was wondering if anyone knows of an algorithm that does this. I vaguely recall something similar many years ago but I haven't found anything via Dr. Google.

    Thanks!
     
  2. jcsd
  3. Aug 26, 2016 #2

    Dale

    Staff: Mentor

    How do you know if AB is on the route for AC?
     
  4. Aug 26, 2016 #3

    MarneMath

    User Avatar
    Education Advisor

    The entirety of the mapped has a contraction hierarchy layer and locally indexed.
     
  5. Aug 27, 2016 #4

    SteamKing

    User Avatar
    Staff Emeritus
    Science Advisor
    Homework Helper

    Well, I guess you don't mind flogging the old computer with these calculations!

    What you are doing sounds like a variant of the Traveling Salesman Problem. Given an arbitrarily long list of cities and the distances between pairs of cities (by whatever route), the TSP solution finds a single ordered list of these cities which has the least total distance to cover to visit all of them in turn.

    https://en.wikipedia.org/wiki/Travelling_salesman_problem

    Even Dr. Google has something on this.

    It's not clear what is contained in this massive DB of 500 million points, but you should be able to whittle down the point data to something more managable, otherwise your grandkids might be waiting for the final answer to pop out of the optimization process.
     
  6. Aug 27, 2016 #5

    MarneMath

    User Avatar
    Education Advisor

    I may have not been clear regarding the exact nature of the problem. The route of the problem has been implemented already by a well known algorithm. The query time for a route from LA to NY isn't the issue at play. The problem deals with the repetitiveness of the problem and my inefficient method. So, let me clear this up. As of this morning, I already calculated the distances for all 500 million points. It took 8 days; however, this is rather inefficient if I ever receive a different set of 500 million points. What I would like to do is utilize the nodes created from a previous paths found to speed up of the time calculation in the process.

    I think I basically came up with a solution, but it isn't elegant. I'm using Apache Spark to cache the nodes so when the program identifies a node in the cache it stop searching different paths and begins a new node from the end point of the previously calculated node.

    edit: There's also other badness that comes with my implementation. Since a shuffle step is guaranteed and the data is not read in order I can't guarantee that the closest points are calculated globally, but locally I can. It's rather annoying though.
     
    Last edited: Aug 27, 2016
  7. Aug 27, 2016 #6

    Stephen Tashi

    User Avatar
    Science Advisor

    Since you mention "searching" it appears that your goal is to determine the shortest driving time between each pair of points. Is that correct? If so, I don't understand the statement "if A B appears in the route for A C". Is this assuming A B appears in "the" route knowm to the be quickest route between A and C ? Or are you saying that if A B appears in a given route between A and C then we can incorporate what we know about the quickest route between A and C in searching for ways to improve that given route ? - namely we don't try to improve the given route by changing the part that goes between A and B.

    If you are saying the latter, I don't think it is always a safe assumption. For example, the quickest route between A and C might bypass B. If you know that any route between A and C must go through B then I think it's a correct assumption.
     
  8. Aug 27, 2016 #7

    MarneMath

    User Avatar
    Education Advisor

    Hey Stephen, you're right that isn't a good assumption. I think the way around this would be to enable priority queueing into the mix. However, programmatically this makes the problem quadratic in nature. I think by adding an adjacent listing method to the function then that'll reduce the time complexity. I'll bang my head on this tomorrow.
     
  9. Aug 27, 2016 #8

    SteamKing

    User Avatar
    Staff Emeritus
    Science Advisor
    Homework Helper

    That's the Law of the Short Cut. :wink:

    It's still not clear where all of these 500 million points come from.

    Are these points all located on existing roads? Are some scattered randomly across the countryside, where no transportation infrastructure exists?
     
  10. Aug 28, 2016 #9

    MarneMath

    User Avatar
    Education Advisor

    They should exists on roads. Thankfully, it doesn't particularly matter. If a graph cannot be generated then my code moves the GPS coordinates to a secondary table for it to be reevaluated by a data analyst.
     
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?
Draft saved Draft deleted



Similar Discussions: Many To Many Distances
  1. In how many (Replies: 1)

Loading...