Large weighted least squares system

Click For Summary

Discussion Overview

The discussion centers around the challenges of efficiently computing a weighted least squares solution for a large dataset with millions of observations. Participants explore methods for constructing the normal equations matrix iteratively while considering the limitations of memory and computational resources.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes the need to compute the matrix A^T W' A from A^T W A while iteratively re-weighting observations, expressing concern about the time required to build the normal equations matrix.
  • Another participant suggests abandoning the normal equations approach in favor of modern least squares algorithms like QR or SVD, emphasizing the need for out-of-core solvers when matrices are too large to fit in memory.
  • A participant with a specific matrix size (8 million rows and 2000 columns) questions how to apply QR or SVD decompositions given their memory constraints and requests references for these methods.
  • One response recommends using a mapreduce approach on a cloud cluster to perform QR decomposition, providing a link to a paper on out-of-core methods for large matrices.
  • Concerns are raised about the potential for round-off errors when solving large normal equations, suggesting that numerical techniques suitable for smaller datasets may not be effective for larger ones.

Areas of Agreement / Disagreement

Participants express differing views on the best approach to take, with some advocating for modern algorithms and others emphasizing the challenges of applying these methods given memory limitations. The discussion remains unresolved regarding the optimal strategy for handling the large weighted least squares system.

Contextual Notes

Participants note limitations related to memory capacity, the feasibility of various numerical methods, and the potential impact of round-off errors in large systems. There is an acknowledgment that the techniques suitable for smaller datasets may not be directly applicable to larger ones.

vibe3
Messages
39
Reaction score
1
I have a large weighted least squares system with millions of observations. I can't store the least squares matrix in memory, so I only store the normal equations matrix:

<br /> A^T W A<br />

where A is dense and n-by-p and W = diag(w_1,...,w_n). It takes a very long time to build this normal equations matrix due to the large number of observations. However, I also want to iteratively re-weight the observations to reduce the effect of outliers, and so at the next iteration, the weight matrix will be the product of the original weights W with some new weights V = diag(v_1,...,v_n). Therefore I need to construct the matrix:

<br /> A^T W&#039; A<br />

where W&#039; = diag(w_1 v_1, w_2 v_2, ..., w_n v_n).

Does anyone know of a slick (fast) way to determine the matrix A^T W&#039; A given the matrix A^T W A and the new weights V? I am hoping I don't need to start over from scratch to build A^T W&#039; A for each iteration, since it takes so long.

I did search the literature and didn't find much on this topic.
 
Physics news on Phys.org
Yeah, forget forming the normal equations altogether. You should use a more modern LS algorithm which incorporates either the QR or the SVD algorithm.

If the matrices are too large to store in memory at one time, then use an out-of-core solver.
 
SteamKing said:
Yeah, forget forming the normal equations altogether. You should use a more modern LS algorithm which incorporates either the QR or the SVD algorithm.

If the matrices are too large to store in memory at one time, then use an out-of-core solver.

My matrix A has 8 millions rows (observations) and 2000 columns (parameters). The normal equations matrix A^T W A is therefore 2000-by-2000, which can easily be handled by LAPACK to solve for the unknown parameters.

I can't even store the full matrix A in memory to attempt to do a QR or SVD on it. How do you propose I use these decompositions? Do you have a reference?
 
If you can't store it in memory, then don't. If you're familiar with a mapreduce rent an amazon cluster and use a hadoop system to run a QR decomposition in map reduce land. It'll generally compute faster and memory won't be an issue.

Also relating to SteamKing's comment. Here's a paper on how to use out of core.

http://www.tau.ac.il/~stoledo/Bib/Pubs/pp01-qr.pdf

It's specifically used in cases that a matrix cannot be stored in memory but only on disk.

(Although personally for me, a mapreduce on a cluster is easier to handle)
 
vibe3 said:
My matrix A has 8 millions rows (observations) and 2000 columns (parameters). The normal equations matrix A^T W A is therefore 2000-by-2000, which can easily be handled by LAPACK to solve for the unknown parameters.

It's not clear that solving that many normal equations will produce a useful result. Round off error in solving that large a system of normal equations could produce a garbage regression.

I can't even store the full matrix A in memory to attempt to do a QR or SVD on it. How do you propose I use these decompositions? Do you have a reference?
Not at the moment. With such a large problem to solve, QR or SVD may not be practical. You may have to resort to a solution by iteration.

The point is, the numerical techniques which may work well for finding a regression line thru a handful of data points may not be ideally suited when you have millions of data points and thousands of parameters.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
Replies
8
Views
4K
  • · Replies 1 ·
Replies
1
Views
2K
Replies
8
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K