# I Many To Many Distances

1. Aug 26, 2016

### MarneMath

Here's my problem: I have approximately 500 million pairs of GPS coordinates. Each pair of point is unique and points (lat1,long1,lat2,long2) are considered to be the same as (lat2,long2,lat1,long1). Write now, i'm building a rather large table that'll contain the drive time calculations between each coordinate. I'm doing it in a rather naive way using an openstreemap API that calculates the distances. I'm essentially calculating the distance for each pair. Now, it occurs to me that there should exist a way to build a matrix that uses previously calculated distances in order to reduce the query times for long distance.

For example, if point A and B is calculated and we move onto point A C and if A B appears in the route for A C then we shouldn't have to search all the routes from A to C. We should be able to automatically select route A B and start searching from there.

I was wondering if anyone knows of an algorithm that does this. I vaguely recall something similar many years ago but I haven't found anything via Dr. Google.

Thanks!

2. Aug 26, 2016

### Staff: Mentor

How do you know if AB is on the route for AC?

3. Aug 26, 2016

### MarneMath

The entirety of the mapped has a contraction hierarchy layer and locally indexed.

4. Aug 27, 2016

### SteamKing

Staff Emeritus
Well, I guess you don't mind flogging the old computer with these calculations!

What you are doing sounds like a variant of the Traveling Salesman Problem. Given an arbitrarily long list of cities and the distances between pairs of cities (by whatever route), the TSP solution finds a single ordered list of these cities which has the least total distance to cover to visit all of them in turn.

https://en.wikipedia.org/wiki/Travelling_salesman_problem

Even Dr. Google has something on this.

It's not clear what is contained in this massive DB of 500 million points, but you should be able to whittle down the point data to something more managable, otherwise your grandkids might be waiting for the final answer to pop out of the optimization process.

5. Aug 27, 2016

### MarneMath

I may have not been clear regarding the exact nature of the problem. The route of the problem has been implemented already by a well known algorithm. The query time for a route from LA to NY isn't the issue at play. The problem deals with the repetitiveness of the problem and my inefficient method. So, let me clear this up. As of this morning, I already calculated the distances for all 500 million points. It took 8 days; however, this is rather inefficient if I ever receive a different set of 500 million points. What I would like to do is utilize the nodes created from a previous paths found to speed up of the time calculation in the process.

I think I basically came up with a solution, but it isn't elegant. I'm using Apache Spark to cache the nodes so when the program identifies a node in the cache it stop searching different paths and begins a new node from the end point of the previously calculated node.

edit: There's also other badness that comes with my implementation. Since a shuffle step is guaranteed and the data is not read in order I can't guarantee that the closest points are calculated globally, but locally I can. It's rather annoying though.

Last edited: Aug 27, 2016
6. Aug 27, 2016

### Stephen Tashi

Since you mention "searching" it appears that your goal is to determine the shortest driving time between each pair of points. Is that correct? If so, I don't understand the statement "if A B appears in the route for A C". Is this assuming A B appears in "the" route knowm to the be quickest route between A and C ? Or are you saying that if A B appears in a given route between A and C then we can incorporate what we know about the quickest route between A and C in searching for ways to improve that given route ? - namely we don't try to improve the given route by changing the part that goes between A and B.

If you are saying the latter, I don't think it is always a safe assumption. For example, the quickest route between A and C might bypass B. If you know that any route between A and C must go through B then I think it's a correct assumption.

7. Aug 27, 2016

### MarneMath

Hey Stephen, you're right that isn't a good assumption. I think the way around this would be to enable priority queueing into the mix. However, programmatically this makes the problem quadratic in nature. I think by adding an adjacent listing method to the function then that'll reduce the time complexity. I'll bang my head on this tomorrow.

8. Aug 27, 2016

### SteamKing

Staff Emeritus
That's the Law of the Short Cut.

It's still not clear where all of these 500 million points come from.

Are these points all located on existing roads? Are some scattered randomly across the countryside, where no transportation infrastructure exists?

9. Aug 28, 2016

### MarneMath

They should exists on roads. Thankfully, it doesn't particularly matter. If a graph cannot be generated then my code moves the GPS coordinates to a secondary table for it to be reevaluated by a data analyst.