Large weighted least squares system

In summary, using an out-of-core solver may be a better option for large weighted least squares systems with many parameters.
  • #1
vibe3
46
1
I have a large weighted least squares system with millions of observations. I can't store the least squares matrix in memory, so I only store the normal equations matrix:

[tex]
A^T W A
[/tex]

where [itex]A[/itex] is dense and n-by-p and [itex]W = diag(w_1,...,w_n)[/itex]. It takes a very long time to build this normal equations matrix due to the large number of observations. However, I also want to iteratively re-weight the observations to reduce the effect of outliers, and so at the next iteration, the weight matrix will be the product of the original weights [itex]W[/itex] with some new weights [itex]V = diag(v_1,...,v_n)[/itex]. Therefore I need to construct the matrix:

[tex]
A^T W' A
[/tex]

where [itex]W' = diag(w_1 v_1, w_2 v_2, ..., w_n v_n)[/itex].

Does anyone know of a slick (fast) way to determine the matrix [itex]A^T W' A[/itex] given the matrix [itex]A^T W A[/itex] and the new weights [itex]V[/itex]? I am hoping I don't need to start over from scratch to build [itex]A^T W' A[/itex] for each iteration, since it takes so long.

I did search the literature and didn't find much on this topic.
 
Physics news on Phys.org
  • #2
Yeah, forget forming the normal equations altogether. You should use a more modern LS algorithm which incorporates either the QR or the SVD algorithm.

If the matrices are too large to store in memory at one time, then use an out-of-core solver.
 
  • #3
SteamKing said:
Yeah, forget forming the normal equations altogether. You should use a more modern LS algorithm which incorporates either the QR or the SVD algorithm.

If the matrices are too large to store in memory at one time, then use an out-of-core solver.

My matrix [itex]A[/itex] has 8 millions rows (observations) and 2000 columns (parameters). The normal equations matrix [itex]A^T W A[/itex] is therefore 2000-by-2000, which can easily be handled by LAPACK to solve for the unknown parameters.

I can't even store the full matrix A in memory to attempt to do a QR or SVD on it. How do you propose I use these decompositions? Do you have a reference?
 
  • #4
If you can't store it in memory, then don't. If you're familiar with a mapreduce rent an amazon cluster and use a hadoop system to run a QR decomposition in map reduce land. It'll generally compute faster and memory won't be an issue.

Also relating to SteamKing's comment. Here's a paper on how to use out of core.

http://www.tau.ac.il/~stoledo/Bib/Pubs/pp01-qr.pdf

It's specifically used in cases that a matrix cannot be stored in memory but only on disk.

(Although personally for me, a mapreduce on a cluster is easier to handle)
 
  • #5
vibe3 said:
My matrix [itex]A[/itex] has 8 millions rows (observations) and 2000 columns (parameters). The normal equations matrix [itex]A^T W A[/itex] is therefore 2000-by-2000, which can easily be handled by LAPACK to solve for the unknown parameters.

It's not clear that solving that many normal equations will produce a useful result. Round off error in solving that large a system of normal equations could produce a garbage regression.

I can't even store the full matrix A in memory to attempt to do a QR or SVD on it. How do you propose I use these decompositions? Do you have a reference?
Not at the moment. With such a large problem to solve, QR or SVD may not be practical. You may have to resort to a solution by iteration.

The point is, the numerical techniques which may work well for finding a regression line thru a handful of data points may not be ideally suited when you have millions of data points and thousands of parameters.
 

1. What is a large weighted least squares system?

A large weighted least squares system is a mathematical method used to find the best fit for a set of data points by minimizing the sum of the squared differences between the observed data points and the predicted values. It is often used in scientific research and data analysis to determine the relationship between variables and make predictions.

2. How does a large weighted least squares system work?

In a large weighted least squares system, the data points are given different weights based on their reliability. The system then calculates a line or curve that minimizes the sum of the squared differences between the observed and predicted values, giving more weight to the more reliable data points. This results in a more accurate and precise fit for the data.

3. What are the advantages of using a large weighted least squares system?

One of the main advantages of using a large weighted least squares system is that it takes into account the reliability of the data points, resulting in a more accurate fit for the data. It also allows for the inclusion of outliers in the data, as they can be given lower weights and not heavily influence the overall fit. Additionally, it is a widely used and well-understood method, making it easy to implement and interpret the results.

4. When should a large weighted least squares system be used?

A large weighted least squares system should be used when there is a linear or non-linear relationship between variables and there is a need to find the best fit for the data. It is commonly used in fields such as statistics, economics, and engineering to analyze data and make predictions.

5. Are there any limitations to using a large weighted least squares system?

One limitation of using a large weighted least squares system is that it assumes the data is normally distributed, which may not always be the case. It also requires a large amount of data to produce accurate results, so it may not be suitable for small datasets. Additionally, it may not be the best method to use if there are extreme outliers in the data, as they can significantly impact the results.

Similar threads

  • Linear and Abstract Algebra
Replies
4
Views
1K
  • Linear and Abstract Algebra
Replies
9
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
979
  • Linear and Abstract Algebra
Replies
9
Views
1K
  • Linear and Abstract Algebra
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Introductory Physics Homework Help
Replies
2
Views
725
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
486
Replies
8
Views
3K
  • Linear and Abstract Algebra
Replies
3
Views
1K
Back
Top