SVD Low-Rank Approximation Algorithm

TheOldHag · Dec 10, 2013

I'm looking for a concise description of an algorithm for low-rank approximation via SVD. I've seen a number of articles referring to Lanczos method and finding eigenvalues but nothing going all the way to determining all the matrices involved in the low-rank SVD of a given matrix.

Any pointers is appreciated.

AlephZero · Dec 10, 2013

You can probably split that question into two parts.

1. What are the math (linear algebra) properties of the SVD, how they can be used for least squares fitting, pseudo-inverse matrices, interpreting the meaning of ignoring (or setting to zero) some of the singular values, etc.

2. How to calculate the SVD efficiently, especially for large matrices.

A (very short and cryptic) answer to (1) is that the SVD of a matrix ##A## is closely related to the eigenvalues and vectors of the matrices ##A^TA## and ##AA^T##. Books and course notes on numerical methods and/or linear algebra should have the details.

For (2), you probably don't really want to know. If you want to calculate SVDs, use the routines from a library like LAPACK, or ARPACK for very large problems.

IIRC the "Numerical Recipes" book has a reasonable description of (1), and some (state of the art in the 1960s) computer code for (2). Old versions of "NR" are free, from here. http://www.nr.com/oldverswitcher.html

Lanczos methods are more or less the current state of the art for very large eigenproblems, but the difference between the basic math (which is simple enough) and computer code that actually works lies in some subtle details about dealing with rounding errors, etc. The math behind Lanczos dates back to the 1940s or 50s, but it took untll the 1980s before anybody figured out how to make it work reliably as a numerical method. (In fact Wilkinson's classic book of the 1960s, "The Algebraic Eigenvalue Problem", "proved" that it was nice in theory but useless in practice!) You most definitely don't want to be write your own code for the Lanczos method, unless you do a LOT of research first.

TheOldHag · Dec 10, 2013

Thanks for the note. Is that to say ARPACK is for large spare matrices. I'm looking at something on the order of 1,000,000 rows and columns for text mining. These columns would be document term vectors and since most documents will only have a small number of terms with respect to the total term vocabulary this would be a pretty spare matrix. Would ARPACK be better for this or would LAPACK be good enough.

The Electrician · Dec 10, 2013

When you say"...1,000,000 rows and columns..." do you mean 1,000,000 rows AND 1,000,000 columns? In other words a matrix of 10^12 elements (as a dense matrix)?

Do you have any idea how long it's going to take to calculate a SVD for that matrix?

AlephZero is right on the money when he says "You most definitely don't want to be write your own code for the Lanczos method, unless you do a LOT of research first."

You need to use already existing linear algebra software, written by people with years of experience doing this. Writing code to handle sparse matrices would be even more difficult.

Just to give you an example, Mathematica in its current incarnation has highly optimized code for this, including special routines for sparse matrices.

I created a 10000x10000 tridiagonal matrix with random floating point numbers on the 3 diagonals and had Mathematica calculate the first 5 singular values, treating the matrix as a dense matrix. It took 336 seconds on a fast Intel quad core machine with a solid state hard drive.

Converting the dense matrix to a sparse matrix, Mathematica took .203 seconds to calculate the first 5 singular values.

If the time scaled up linearly (which it doesn't), calculating the first 5 singular values of your 1000000x1000000 dense matrix would take 3360000 seconds.

Just the storage for your proposed matrix, as a dense matrix, would take 1000 gigabytes if you only allocated 1 byte per element of the matrix.

Your only hope is to create your matrix as a sparse matrix, and use something like Mathematica or Matlab or some similar package. You definitely don't have enough time to write your own software from scratch. I think even starting with something like LAPACK (actually you would probably want EISPACK rather than LAPACK; even those are old) and interfacing it (with plenty of debugging) would probably consume more time than you have for your entire project.

TheOldHag · Dec 11, 2013

Yes, ... apparently my 2 classes in Linear Algebra 10 years ago aren't quite enough. I did an introductory applied class and then went through 'Done Right' for a second time just recently. I'm going to try my hand at Golub over the next 3 months.

In the meantime, I think you are correct regarding doing it yourself. Brick wall comes to mind. Apparently Apache Mahout provides algorithms for machine learning which includes some dimensional reduction stuff. I'm a bit skeptical regarding the performance but they also provide a stochastic implementation of SVD and rank reduction. This looks more up my alley since Mahout is in the general ecosystem of Lucene and text indexing.

Any comments are appreciated?

What do you think of this

https://cwiki.apache.org/confluence/display/mahout/dimensional+Reduction

https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition

The Electrician · Dec 11, 2013

Your first reference mentions dense matrices with 10^7 rows and columns having 100 trillion non-zero entries! Are people really working with such large matrices? I guess so; he says "Each of these examples deal with cases of matrices which tend to be tremendously large (often millions to tens of millions to hundreds of millions of rows or more, by sometimes a comparable number of columns), but also rather sparse."

But, do you have access to a supercomputer for your project, or are you going to do the number crunching on a personal computer? If you will use a desktop sized computer, simply creating and storing these large matrices and intermediate results of your calculations will be problematic.

Using sparse matrix techniques will be necessary, I think. A fairly recent improvement in matrix arithmetic speed is the use of the graphics processor unit (GPU) for matrix calculations. Look up "CUDA" in connection with Nvidia graphics cards. Mathematica can use CUDA, and other packages may as well.

You might try communicating with some of the workers in the field whose references you have located and ask them about the practical problems.

TheOldHag · Dec 11, 2013

Thanks for comments. I'm a bit skeptical of Mahout's claims regarding the size of matrices that need to be worked with. There are document collections in the millions and vocabulary consisting of around a million terms. However, the term document-matrix for a collection would be extremely sparse and irregular. I'm guessing, that the fact that it calculates a low-rank approximation might help since the recommended rank is somewhere between 200 and 400 for text clustering so from that the resultant matrices would have that amount of columns or rows that need to be stored and the rest zeroed out in the SVD computation. The other dimension would still be large though.

As for storage, most text indexes are comparable in size to the collection itself, which is big.

In response to your question, these matrices are popping up in the context of text indexing. Linear algebra is at the mathematical core of text indexing.

AlephZero · Dec 11, 2013

The Electrician said:

(actually you would probably want EISPACK rather than LAPACK; even those are old)

EISPACK was a brilliant library based on of 1960s vintage numerical analysis, and running on 1960s or 1970s computer hardware there was nothing to match it. But if your matrices are bigger than about 500 rows and columns, forget about it.

LAPACK is newer, and much more solid, and the LAPACK project is now getting into 21st century high performance computing (it started in the 20th).

TheOldHag said:

What do you think of this

https://cwiki.apache.org/confluence/display/mahout/dimensional+Reduction

https://cwiki.apache.org/confluence/display/MAHOUT/Stochastic+Singular+Value+Decomposition

My experience over the last 20 or 30 years says, the devil is always in the detail. There wasn't anything that raised my eyebrows much (either in joy or disbelief) but without spending a month playing with it, I don't think that counts for much either way.

TheOldHag · Dec 12, 2013

AlephZero said:

My experience over the last 20 or 30 years says, the devil is always in the detail. There wasn't anything that raised my eyebrows much (either in joy or disbelief) but without spending a month playing with it, I don't think that counts for much either way.

I agree with your sentiment. But do you think LAPACK would be able to handle these massively sized matrices. Mahout advertises it but I have some skepticism until I dig in. There documentation is horrible and jumbled from what I have seen compared to LAPACK. Every other sentence in large parts of their documentation is pushing you to install a new open source project to proceed. Very disconnected.

SVD Low-Rank Approximation Algorithm

1. What is the SVD Low-Rank Approximation Algorithm?

2. When is the SVD Low-Rank Approximation Algorithm used?

3. How does the SVD Low-Rank Approximation Algorithm work?

4. What are the benefits of using the SVD Low-Rank Approximation Algorithm?

5. What are the limitations of the SVD Low-Rank Approximation Algorithm?

Similar threads

Hot Threads

Recent Insights