Cross correlation in two data sets of the same sample population

Click For Summary

Discussion Overview

The discussion revolves around estimating the shares of political inclinations for specific countries using two datasets from annual surveys. One dataset contains political affiliations (Conservative, Liberal, Socialist, Other) over a short time series, while the other provides the number of respondents by country over the same period. Participants explore various statistical methods to analyze the data and address potential issues with correlation and regression analysis.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant proposes using a correlation matrix to identify relationships between political inclinations and country data, but expresses concern about the potential for spurious correlations due to trends in the data.
  • Another participant suggests that linear regression might be useful, but notes that significant results are rare, often only for intercepts rather than the coefficients of interest.
  • There is a discussion about the limitations of canonical correlation and non-linear regression in this context, with a participant questioning whether dividing values by total respondents would help eliminate nonsensical correlations.
  • A suggestion is made to use multinomial logit regression to estimate conditional probabilities related to political affiliation and country, contingent on having enough degrees of freedom.
  • Cluster analysis is also mentioned as a potential method to explore the data further.

Areas of Agreement / Disagreement

Participants express various viewpoints on the methods to analyze the datasets, with no consensus on the best approach. Some methods are suggested as potentially useful, while others are critiqued for their limitations.

Contextual Notes

Participants highlight the need for sufficient degrees of freedom in statistical models and express uncertainty about the effectiveness of certain methods in addressing the issues presented by the datasets.

Who May Find This Useful

This discussion may be of interest to those involved in statistical analysis, particularly in the context of political science, survey data interpretation, and regression modeling.

pink.noise
Messages
3
Reaction score
0
I have the following problem:

Suppose there is a survey of persons with two properties each (a and b). a can take a relatively small number of values (say the political inclination: Conservative, Liberal, Socialist, or Other) the other one b can take a large number of values (say: the country the person is living in).

Now there are two datasets from annual surveys. The first one contains the numbers of Conservative, Liberal, Socialist and Other respondents between 1996 and 2010 (thus a rather short time series). The second one contains the number of respondents by country from 1996 through 2010. The shares of the political camps by country are unknown.

Is there any possible way to estimate the shares of political inclinations for specific countries?

I thought making the heroic assumptions that
1) the number of respondents in general and for any possible combination of a and b is large enough and that
2) the changes of shares of political inclinations do not change in time and are only different by country (thus the data from 1995 through 2010 could be treated as independent observations),
a simple correlation matrix could be used. (Idea 1)
Further a linear regression, taking the time series from the first dataset (political inclinations) as independent, the numbers of individuals from the specific country as dependent variable might as well be useful. (Idea 2)

Using Idea 1 (correlation matrix), I find a number of high correlations, deemed significant by their t-value. (To use the t-value for significance testing of correlation coefficients is right, isn't it?) They might however still be accidental nonsense correlations caused by the same but unrelated strong trend in both time series. (Meaning: if all the time series are growing over time because the sample of the survey is increased from year to year, everything should be highly correlated without representing any useful information at all, right?)

Using Idea 2 (linear regression) I cannot find anything significant (p-value rarely going lower than 10% and if so then usually for the axis intercept not for the more interesting coefficients.)

Canonical Correlation would probably suffer from the same problems as Idea 1 (correlation matrix), right?

Non-linear regression would not be helpful as well since we are dealing with numbers of persons and shares of a population which should be linear, right?

Dividing any value of the data sets by the total number of respondents in the specific year - would this eliminate the nonsense correlations from the correlation matrix (Idea 1)? I'm not sure, but probably not completely ...

Is there any more elaborate method to get some information out of this dataset?

Any help is appreciated. Thanks.

PS: Since this is a rather specific question, it would possibly have been better placed in a statistics forum such as talkstats.com. However it is not possible for me to register there - either they have disabled registration completely or they don't accept email addresses outside the US - both of which is of course rather stupid if they want to keep the forum going.
 
Physics news on Phys.org
Does the following fairly represent your dataset?

Year, Camp, Resps, Cntry1, Cntry2, Cntry3, Cntry4, Cntry5
1996, 0, 10
1996, 1, 20
1996, Total, 30, 6, 6, 6, 6, 6

2010, 0, 15
2010, 1, 10
2010, Total, 25, 10, 1, 3, 1, 10
 
Thanks EnumaElish & PatrickI.

@EnumaElish

the dataset is more like this:

Period1 Period2 Period 3 ...
Country1 11882 12835 ...
Country2 4360 4573 ...
Country3 2963 2778 ...
Country4 16997 16125 ...
Country5 4197 4301 ...
Country6 ...
...
CountryTotal ...
PCamp1 3233130 3228623 ...
PCamp2 254100 256788 ...
PCamp3 ...
...
PCampTotal 3736217 3725002 ...


@PatrickI: will mail you; thanks for taking the effort.
 
PatrickI said:
Hi there,

If you have the data in a SPSS data file, please, send them to me (<< e-mail address deleted by berkeman >>) and I will look into it. Do not forget to describe your variables. Perhaps I can help you.

Please do not post your e-mail address in the open forums -- there are issues with spam. Please use the PM system to make side contacts. Thanks.
 
Through a multinomial logit regression, you can estimate both:

1. conditional probability of respondent belonging to country C, given his/her political affiliation,

2. conditional probability of respondent's political affiliation being P, given his/her country,

provided you have sufficiently many degrees of freedom (time periods, T) to identify the model parameters (coefficients, K). In general, you need T > K.
 
Thanks for the advice, EnumaElish, this is a good suggestion.

I do not (yet -- maybe if I just wait some more time ;) ) have T>K but I will try anyways, maybe I will get something interesting.

PatrickI suggested to start with cluster analysis which is certainly a good idea as well.

Thanks to you all.
 

Similar threads

  • · Replies 24 ·
Replies
24
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 7 ·
Replies
7
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
10K
  • · Replies 21 ·
Replies
21
Views
4K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K