I Calculating Many-to-Many Distances from GPS Coordinates

  • I
  • Thread starter Thread starter MarneMath
  • Start date Start date
AI Thread Summary
The discussion centers on optimizing the calculation of distances between approximately 500 million pairs of GPS coordinates, where each pair is unique and order does not matter. The current method utilizes an OpenStreetMap API to calculate distances naively, leading to inefficiencies, especially when recalculating for different sets of points. A proposed solution involves caching previously calculated nodes using Apache Spark to expedite future calculations by avoiding redundant searches. There is also a concern about the assumption that routes between points can be reused without verifying if they are indeed part of the shortest path. The conversation highlights the complexity of the problem and suggests exploring algorithms like the Traveling Salesman Problem and priority queueing to improve efficiency.
MarneMath
Education Advisor
Messages
550
Reaction score
198
Here's my problem: I have approximately 500 million pairs of GPS coordinates. Each pair of point is unique and points (lat1,long1,lat2,long2) are considered to be the same as (lat2,long2,lat1,long1). Write now, I'm building a rather large table that'll contain the drive time calculations between each coordinate. I'm doing it in a rather naive way using an openstreemap API that calculates the distances. I'm essentially calculating the distance for each pair. Now, it occurs to me that there should exist a way to build a matrix that uses previously calculated distances in order to reduce the query times for long distance.

For example, if point A and B is calculated and we move onto point A C and if A B appears in the route for A C then we shouldn't have to search all the routes from A to C. We should be able to automatically select route A B and start searching from there.

I was wondering if anyone knows of an algorithm that does this. I vaguely recall something similar many years ago but I haven't found anything via Dr. Google.

Thanks!
 
Physics news on Phys.org
MarneMath said:
if A B appears in the route for A C
How do you know if AB is on the route for AC?
 
The entirety of the mapped has a contraction hierarchy layer and locally indexed.
 
MarneMath said:
Here's my problem: I have approximately 500 million pairs of GPS coordinates. Each pair of point is unique and points (lat1,long1,lat2,long2) are considered to be the same as (lat2,long2,lat1,long1). Write now, I'm building a rather large table that'll contain the drive time calculations between each coordinate. I'm doing it in a rather naive way using an openstreemap API that calculates the distances. I'm essentially calculating the distance for each pair. Now, it occurs to me that there should exist a way to build a matrix that uses previously calculated distances in order to reduce the query times for long distance.

For example, if point A and B is calculated and we move onto point A C and if A B appears in the route for A C then we shouldn't have to search all the routes from A to C. We should be able to automatically select route A B and start searching from there.

I was wondering if anyone knows of an algorithm that does this. I vaguely recall something similar many years ago but I haven't found anything via Dr. Google.

Thanks!
Well, I guess you don't mind flogging the old computer with these calculations!

What you are doing sounds like a variant of the Traveling Salesman Problem. Given an arbitrarily long list of cities and the distances between pairs of cities (by whatever route), the TSP solution finds a single ordered list of these cities which has the least total distance to cover to visit all of them in turn.

https://en.wikipedia.org/wiki/Travelling_salesman_problem

Even Dr. Google has something on this.

It's not clear what is contained in this massive DB of 500 million points, but you should be able to whittle down the point data to something more managable, otherwise your grandkids might be waiting for the final answer to pop out of the optimization process.
 
I may have not been clear regarding the exact nature of the problem. The route of the problem has been implemented already by a well known algorithm. The query time for a route from LA to NY isn't the issue at play. The problem deals with the repetitiveness of the problem and my inefficient method. So, let me clear this up. As of this morning, I already calculated the distances for all 500 million points. It took 8 days; however, this is rather inefficient if I ever receive a different set of 500 million points. What I would like to do is utilize the nodes created from a previous paths found to speed up of the time calculation in the process.

I think I basically came up with a solution, but it isn't elegant. I'm using Apache Spark to cache the nodes so when the program identifies a node in the cache it stop searching different paths and begins a new node from the end point of the previously calculated node.

edit: There's also other badness that comes with my implementation. Since a shuffle step is guaranteed and the data is not read in order I can't guarantee that the closest points are calculated globally, but locally I can. It's rather annoying though.
 
Last edited:
MarneMath said:
For example, if point A and B is calculated and we move onto point A C and if A B appears in the route for A C then we shouldn't have to search all the routes from A to C. We should be able to automatically select route A B and start searching from there.

Since you mention "searching" it appears that your goal is to determine the shortest driving time between each pair of points. Is that correct? If so, I don't understand the statement "if A B appears in the route for A C". Is this assuming A B appears in "the" route knowm to the be quickest route between A and C ? Or are you saying that if A B appears in a given route between A and C then we can incorporate what we know about the quickest route between A and C in searching for ways to improve that given route ? - namely we don't try to improve the given route by changing the part that goes between A and B.

If you are saying the latter, I don't think it is always a safe assumption. For example, the quickest route between A and C might bypass B. If you know that any route between A and C must go through B then I think it's a correct assumption.
 
Hey Stephen, you're right that isn't a good assumption. I think the way around this would be to enable priority queueing into the mix. However, programmatically this makes the problem quadratic in nature. I think by adding an adjacent listing method to the function then that'll reduce the time complexity. I'll bang my head on this tomorrow.
 
Stephen Tashi said:
For example, the quickest route between A and C might bypass B.
That's the Law of the Short Cut. :wink:

It's still not clear where all of these 500 million points come from.

Are these points all located on existing roads? Are some scattered randomly across the countryside, where no transportation infrastructure exists?
 
They should exists on roads. Thankfully, it doesn't particularly matter. If a graph cannot be generated then my code moves the GPS coordinates to a secondary table for it to be reevaluated by a data analyst.
 
Back
Top