Approximate finding nearest neighbor vectors

In summary: The vectors occupy a total space of about 15,000 dimensions, but (thankfully) any given vector is:1) normalized2) has positive non-zero values only3) exists in only up to 10 dimensions.To find the nearest neighbor, I understand that I'm looking for the vector which has the shortest distance to another vector. To do this, I computer the distance for all vectors which have at least 1 dimension in common with the search vector. All other vectors are orthogonal and therefore sqrt(2) far away. I then sort by the distances.Simple geometry tells me the distance between vectors a and b is:Let S(v) be the
  • #1
feynomite
22
0
Hello all. This problem is mostly math based, but is being applied to a programming algorithm.

I'm working on a problem with a massive amount of normalized vectors in a massive amount of dimensions and need to quickly compute the nearest neighbors of anyone vector. More precisely, I need to find the closest 20 or so vectors in as close to possible as the correct order. I was able to come up with two approximate solutions, but do not know how to figure out the margin of error of them, or really what they mean. They just happen to give decent results.

The vectors occupy a total space of about 15,000 dimensions, but (thankfully) any given vector is:
1) normalized
2) has positive non-zero values only
3) exists in only up to 10 dimensions


To find the nearest neighbor, I understand that I'm looking for the vector which has the shortest distance to another vector. To do this, I computer the distance for all vectors which have at least 1 dimension in common with the search vector. All other vectors are orthogonal and therefore sqrt(2) far away. I then sort by the distances.

Simple geometry tells me the distance between vectors a and b is:

let S(v) be the set of dimensions that vector v has non-zero values
[tex]Distance(a, b) = \sqrt{\sum_{n}^{S(a) \bigcup S(b)} (a_n - b_n)^2 }[/tex]

Unfortunately this takes a large amount of calculations: for each of the up to 19 dimensions (I'm ignoring the trivial case where a and b are orthogonal), I have to take the difference between the vectors, square it, and add it. That's a lot of work.

My approximation is as follows... I just sum the minimums of each common dimensional component, and treat it as "commonality":
[tex]ApproxCommonality(a, b) = \sum_{n}^{S(a) \bigcap S(b)} min(a_n, b_n) [/tex]

This is much faster because [tex]S(a) \bigcap S(b)[/tex] is at worst case 10, and the work on each iteration is only the "min()" function. I take these results and sort from highest to lowest. Problem is, I do not know how far off they are from the correct results... how well does this algorithm work?

Yet another approximation I've found is simply the dot product between a and b:
[tex]ApproxCommonality2(a, b) = \sum_{n}^{S(a) \bigcap S(b)} a_n * b_n [/tex]

This is very fast as well. But is the order of "a" dotted with all vectors (sorted from highest magnitude to lowest) the same order as all vectors sorted by distance (lowest to highest)? If not, what's the margin of error?

I appreciate your input on this.. and other suggested algorithms would be great as well.
 
Physics news on Phys.org
  • #2
Why don't you take the dot product to measure distance> The larger the angle between two vectors the farther they are apart.
I don't know how fast this is, though, but it seems worth a thought.
 
  • #3
Success. Thanks!
 

Related to Approximate finding nearest neighbor vectors

1. What is the concept of "Approximate finding nearest neighbor vectors"?

The concept of approximate finding nearest neighbor vectors is a method used in machine learning and data mining to find the closest match to a given vector or data point in a large dataset. It involves using algorithms and techniques to efficiently search and retrieve the most similar data points to a given query point.

2. How does approximate finding nearest neighbor vectors differ from exact nearest neighbor search?

Exact nearest neighbor search involves finding the exact closest match to a given query point, while approximate finding nearest neighbor vectors focuses on finding the closest match within a certain degree of similarity. This allows for faster and more efficient searching in large datasets, at the cost of some potential inaccuracies in the results.

3. What are some commonly used algorithms for approximate finding nearest neighbor vectors?

Some commonly used algorithms for approximate finding nearest neighbor vectors include k-d trees, locality-sensitive hashing (LSH), and random projection trees. Each of these algorithms has its own advantages and limitations, and the choice of algorithm often depends on the specific dataset and search criteria.

4. What are the applications of approximate finding nearest neighbor vectors?

Approximate finding nearest neighbor vectors has a wide range of applications, including recommendation systems, image and video search, natural language processing, and anomaly detection. It is also commonly used in clustering and classification tasks, where finding similar data points is crucial for accurate results.

5. How do researchers evaluate the performance of approximate finding nearest neighbor vectors algorithms?

The performance of approximate finding nearest neighbor vectors algorithms is typically evaluated using metrics such as precision, recall, and F1 score. These metrics measure the accuracy and completeness of the search results compared to the exact nearest neighbor search. Other factors such as speed, memory usage, and scalability also play a role in evaluating the performance of these algorithms.

Similar threads

Replies
27
Views
1K
  • Linear and Abstract Algebra
Replies
4
Views
954
  • Linear and Abstract Algebra
Replies
9
Views
2K
  • Special and General Relativity
5
Replies
146
Views
6K
Replies
14
Views
1K
  • Linear and Abstract Algebra
Replies
1
Views
2K
Replies
61
Views
1K
  • Introductory Physics Homework Help
Replies
2
Views
636
  • Linear and Abstract Algebra
Replies
2
Views
3K
  • Linear and Abstract Algebra
Replies
2
Views
1K
Back
Top