Approximate finding nearest neighbor vectors

Click For Summary
SUMMARY

The discussion focuses on optimizing the process of finding the nearest neighbors among a large set of normalized vectors in high-dimensional space, specifically 15,000 dimensions. Two approximate algorithms are proposed: one using a commonality measure based on minimum values of shared dimensions, and another utilizing the dot product of vectors. The author seeks to understand the accuracy of these approximations and their margin of error compared to traditional distance calculations. The need for efficient algorithms in high-dimensional vector spaces is emphasized due to the computational intensity of exact methods.

PREREQUISITES
  • Understanding of vector mathematics and geometry
  • Familiarity with high-dimensional data structures
  • Knowledge of algorithms for nearest neighbor search
  • Experience with performance optimization techniques in programming
NEXT STEPS
  • Research "Approximate Nearest Neighbor (ANN) algorithms" for efficient searching
  • Explore "Locality-Sensitive Hashing (LSH)" for high-dimensional data
  • Study "k-d trees" and their application in nearest neighbor searches
  • Investigate "Cosine Similarity" and its computational efficiency for vector comparisons
USEFUL FOR

Data scientists, machine learning engineers, and software developers working with high-dimensional vector data who need to optimize nearest neighbor search algorithms.

feynomite
Messages
21
Reaction score
0
Hello all. This problem is mostly math based, but is being applied to a programming algorithm.

I'm working on a problem with a massive amount of normalized vectors in a massive amount of dimensions and need to quickly compute the nearest neighbors of anyone vector. More precisely, I need to find the closest 20 or so vectors in as close to possible as the correct order. I was able to come up with two approximate solutions, but do not know how to figure out the margin of error of them, or really what they mean. They just happen to give decent results.

The vectors occupy a total space of about 15,000 dimensions, but (thankfully) any given vector is:
1) normalized
2) has positive non-zero values only
3) exists in only up to 10 dimensions


To find the nearest neighbor, I understand that I'm looking for the vector which has the shortest distance to another vector. To do this, I computer the distance for all vectors which have at least 1 dimension in common with the search vector. All other vectors are orthogonal and therefore sqrt(2) far away. I then sort by the distances.

Simple geometry tells me the distance between vectors a and b is:

let S(v) be the set of dimensions that vector v has non-zero values
Distance(a, b) = \sqrt{\sum_{n}^{S(a) \bigcup S(b)} (a_n - b_n)^2 }

Unfortunately this takes a large amount of calculations: for each of the up to 19 dimensions (I'm ignoring the trivial case where a and b are orthogonal), I have to take the difference between the vectors, square it, and add it. That's a lot of work.

My approximation is as follows... I just sum the minimums of each common dimensional component, and treat it as "commonality":
ApproxCommonality(a, b) = \sum_{n}^{S(a) \bigcap S(b)} min(a_n, b_n)

This is much faster because S(a) \bigcap S(b) is at worst case 10, and the work on each iteration is only the "min()" function. I take these results and sort from highest to lowest. Problem is, I do not know how far off they are from the correct results... how well does this algorithm work?

Yet another approximation I've found is simply the dot product between a and b:
ApproxCommonality2(a, b) = \sum_{n}^{S(a) \bigcap S(b)} a_n * b_n

This is very fast as well. But is the order of "a" dotted with all vectors (sorted from highest magnitude to lowest) the same order as all vectors sorted by distance (lowest to highest)? If not, what's the margin of error?

I appreciate your input on this.. and other suggested algorithms would be great as well.
 
Physics news on Phys.org
Why don't you take the dot product to measure distance> The larger the angle between two vectors the farther they are apart.
I don't know how fast this is, though, but it seems worth a thought.
 
Success. Thanks!
 

Similar threads

  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 27 ·
Replies
27
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
3K
Replies
26
Views
4K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 146 ·
5
Replies
146
Views
11K
  • · Replies 2 ·
Replies
2
Views
3K