Calculating a covariance matrix with missing data

In summary, the conversation discusses the computation of a co-variance matrix with elements defined as E(Xi Xj) - E(Xi) E(Xj), considering random variables Xi and Xj. It also addresses the issue of estimating the matrix when each variable has a different sample size and the need for trimming the sample size. The speaker seeks literature on methods for estimating covariance from samples with missing data and mentions the need for assumptions about the underlying distribution. A link to a relevant discussion is provided.
  • #1
TravisJay
1
0
Consider a co-variance matrix A such that each element ai,j = E(Xi Xj) - E(Xi) E(Xj) where Xi,Xj are random variables.

Consider the case that each variable X has a different sample size. Let's say that Xi contains the elements xi,1, …, xi,N, and Xj contains the elements xj,1, ..., xj,n where each element is paired up to element n and N > n.

In this case, for each covariance ai,j, is it acceptable to trim the sample size for each Xi and Xj to n and continue the calculation? (I'm not sure if trim is the correct terminology but it seems to meet my needs).

If it is acceptable to trim, then is it necessary to trim to the smallest n of all of the random variables X, or can I just trim to the smallest of the pair?

I'd appreciate it if anyone can point me in the direction of some literature that explains this in detail. I've been struggling to find something that is specific to this case.
 
Physics news on Phys.org
  • #2
See if non-parametric statistical analysis might be what you are needing...
 
  • #3
TravisJay said:
Consider a co-variance matrix A such that each element ai,j = E(Xi Xj) - E(Xi) E(Xj) where Xi,Xj are random variables.
The way such a matrix is computed is from the joint distribution of [itex] X_i, X_j [/itex]. It isn't computed from sample data.

Consider the case that each variable X has a different sample size.

Apparently, what you want to do is estimate the covariance matrix.

You should look up methods of estimating covariance from samples that have missing data.


I'd appreciate it if anyone can point me in the direction of some literature that explains this in detail. I've been struggling to find something that is specific to this case.

You haven't given enough information to define the case. There is no general "best" method for doing this unless you make some assumptions - for example, assumptions about what family of distributions generated the data.

http://icml.cc/discuss/2012/313.html
 
Last edited by a moderator:

Related to Calculating a covariance matrix with missing data

1. What is a covariance matrix?

A covariance matrix is a square matrix that summarizes the relationships between multiple variables. It contains the variances of each variable along the diagonal and the covariances between each pair of variables in the off-diagonal elements.

2. Why is missing data a problem in calculating a covariance matrix?

Missing data can lead to biased estimates of the covariance matrix, as the missing values may not be randomly distributed. This can result in incorrect conclusions about the relationships between variables.

3. How do you handle missing data when calculating a covariance matrix?

There are several methods for handling missing data in calculating a covariance matrix, including listwise deletion (ignoring all cases with missing data), pairwise deletion (using all available data for each pair of variables), and imputation (substituting estimated values for missing data).

4. What are some assumptions for calculating a covariance matrix with missing data?

Some common assumptions for calculating a covariance matrix with missing data include that the data is missing at random (MAR) or missing completely at random (MCAR), and that the missing values do not have a significant impact on the relationships between variables.

5. Are there any limitations to using a covariance matrix with missing data?

Yes, there are some limitations to using a covariance matrix with missing data. These include potential bias in the estimates, difficulty in comparing results with other studies that used different methods for handling missing data, and the need to carefully consider the assumptions and potential impact of missing data on the results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
982
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
557
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
980
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
971
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
832
Back
Top