Which reduced vector should be used in cosine similarity?

Adel Makram · Oct 27, 2016

In Latent semantic analysis, the truncated singular value decomposition (SVD) of a term-document matrix ##A_{mn}## is
$$A=U_rS_rV^T_r$$
In many references including wikipedia, the new reduced document column vector in r-space is scaled by the singular value ##S## before comparing it with other vectors by cosine similarity. This yields ##q^T_r S## where ##q^T_r## is just the component of a column vector of the ##V^T_r## matrix. and ##S## is the corresponding singular value.
But in other references, only ##q^T_r## is used for cosine similarity. which one of them is more appropriate and why?

jvicens · Oct 27, 2016

The more appropriate method would be to use ##q^T_r S## for cosine similarity. This is because the singular values represent the importance or weight of each term in the document vector. By scaling the document vector by the singular values, we are giving more weight to the terms that are more important in distinguishing the document from others. This allows for a more accurate comparison between document vectors and can improve the overall performance of the latent semantic analysis method. Additionally, this approach is more consistent with the original formulation of latent semantic analysis, which incorporates the singular values in the decomposition of the term-document matrix.

Which reduced vector should be used in cosine similarity?

What is cosine similarity?

Why is cosine similarity commonly used in data science?

What is the difference between using a normalized and unnormalized vector in cosine similarity?

How do I choose which reduced vector to use in cosine similarity?

What are some potential limitations of using cosine similarity?

Similar threads

Hot Threads

Recent Insights