Which reduced vector should be used in cosine similarity?

In summary: Overall, considering the importance of the singular values in the SVD process, using ##q^T_r S## for cosine similarity is the more appropriate method. In summary, in Latent Semantic Analysis, there are two methods for cosine similarity: using only ##q^T_r## or scaling the document vector by the corresponding singular values ##S## before comparing it with other vectors. The more appropriate method is to use ##q^T_r S## as it takes into account the importance of the singular values in distinguishing the document from others and is consistent with the original formulation of latent semantic analysis.
  • #1
Adel Makram
635
15
In Latent semantic analysis, the truncated singular value decomposition (SVD) of a term-document matrix ##A_{mn}## is
$$A=U_rS_rV^T_r$$
In many references including wikipedia, the new reduced document column vector in r-space is scaled by the singular value ##S## before comparing it with other vectors by cosine similarity. This yields ##q^T_r S## where ##q^T_r## is just the component of a column vector of the ##V^T_r## matrix. and ##S## is the corresponding singular value.
But in other references, only ##q^T_r## is used for cosine similarity. which one of them is more appropriate and why?
 
Physics news on Phys.org
  • #2


The more appropriate method would be to use ##q^T_r S## for cosine similarity. This is because the singular values represent the importance or weight of each term in the document vector. By scaling the document vector by the singular values, we are giving more weight to the terms that are more important in distinguishing the document from others. This allows for a more accurate comparison between document vectors and can improve the overall performance of the latent semantic analysis method. Additionally, this approach is more consistent with the original formulation of latent semantic analysis, which incorporates the singular values in the decomposition of the term-document matrix.
 

What is cosine similarity?

Cosine similarity is a measure of similarity between two vectors in a multi-dimensional space. It measures the cosine of the angle between two vectors and provides a value between -1 and 1, where 1 indicates perfect similarity and -1 indicates perfect dissimilarity.

Why is cosine similarity commonly used in data science?

Cosine similarity is commonly used in data science because it is a simple and efficient way to measure the similarity between two vectors. It is also robust to changes in vector length and is not affected by the magnitude of the vectors, making it a useful tool for comparing data sets of different sizes.

What is the difference between using a normalized and unnormalized vector in cosine similarity?

The difference between using a normalized and unnormalized vector in cosine similarity lies in the interpretation of the results. Normalized vectors have a length of 1 and are useful for measuring the direction of similarity between two vectors. Unnormalized vectors, on the other hand, can provide information about both the direction and magnitude of similarity between two vectors.

How do I choose which reduced vector to use in cosine similarity?

Choosing the correct reduced vector in cosine similarity depends on the specific data and its characteristics. Different reduction techniques may work better for different types of data, so it is important to experiment with different options and choose the one that provides the most accurate results for your specific data set.

What are some potential limitations of using cosine similarity?

One potential limitation of using cosine similarity is that it does not take into account the context of the data. It only looks at the direction and magnitude of the vectors, which may not always accurately reflect the true similarity between data points. Additionally, cosine similarity may not perform well with sparse data sets or data sets with high dimensionality.

Similar threads

  • Linear and Abstract Algebra
Replies
2
Views
1K
  • Linear and Abstract Algebra
Replies
4
Views
2K
Replies
1
Views
737
  • Linear and Abstract Algebra
Replies
1
Views
2K
  • Linear and Abstract Algebra
Replies
1
Views
2K
  • Linear and Abstract Algebra
Replies
17
Views
4K
  • Linear and Abstract Algebra
Replies
6
Views
1K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
19K
  • Calculus and Beyond Homework Help
Replies
6
Views
6K
Replies
4
Views
4K
Back
Top