Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Cross correlation in two data sets of the same sample population

  1. Aug 4, 2010 #1
    I have the following problem:

    Suppose there is a survey of persons with two properties each (a and b). a can take a relatively small number of values (say the political inclination: Conservative, Liberal, Socialist, or Other) the other one b can take a large number of values (say: the country the person is living in).

    Now there are two datasets from annual surveys. The first one contains the numbers of Conservative, Liberal, Socialist and Other respondents between 1996 and 2010 (thus a rather short time series). The second one contains the number of respondents by country from 1996 through 2010. The shares of the political camps by country are unknown.

    Is there any possible way to estimate the shares of political inclinations for specific countries?

    I thought making the heroic assumptions that
    1) the number of respondents in general and for any possible combination of a and b is large enough and that
    2) the changes of shares of political inclinations do not change in time and are only different by country (thus the data from 1995 through 2010 could be treated as independent observations),
    a simple correlation matrix could be used. (Idea 1)
    Further a linear regression, taking the time series from the first dataset (political inclinations) as independent, the numbers of individuals from the specific country as dependent variable might as well be useful. (Idea 2)

    Using Idea 1 (correlation matrix), I find a number of high correlations, deemed significant by their t-value. (To use the t-value for significance testing of correlation coefficients is right, isn't it?) They might however still be accidental nonsense correlations caused by the same but unrelated strong trend in both time series. (Meaning: if all the time series are growing over time because the sample of the survey is increased from year to year, everything should be highly correlated without representing any useful information at all, right?)

    Using Idea 2 (linear regression) I cannot find anything significant (p-value rarely going lower than 10% and if so then usually for the axis intercept not for the more interesting coefficients.)

    Canonical Correlation would probably suffer from the same problems as Idea 1 (correlation matrix), right?

    Non-linear regression would not be helpful as well since we are dealing with numbers of persons and shares of a population which should be linear, right?

    Dividing any value of the data sets by the total number of respondents in the specific year - would this eliminate the nonsense correlations from the correlation matrix (Idea 1)? I'm not sure, but probably not completely ...

    Is there any more elaborate method to get some information out of this dataset?

    Any help is appreciated. Thanks.

    PS: Since this is a rather specific question, it would possibly have been better placed in a statistics forum such as talkstats.com. However it is not possible for me to register there - either they have disabled registration completely or they don't accept email addresses outside the US - both of which is of course rather stupid if they want to keep the forum going.
  2. jcsd
  3. Aug 4, 2010 #2


    User Avatar
    Science Advisor
    Homework Helper

    Does the following fairly represent your dataset?

    Year, Camp, Resps, Cntry1, Cntry2, Cntry3, Cntry4, Cntry5
    1996, 0, 10
    1996, 1, 20
    1996, Total, 30, 6, 6, 6, 6, 6

    2010, 0, 15
    2010, 1, 10
    2010, Total, 25, 10, 1, 3, 1, 10
  4. Aug 5, 2010 #3
    Thanks EnumaElish & PatrickI.


    the dataset is more like this:

    Period1 Period2 Period 3 ...
    Country1 11882 12835 ...
    Country2 4360 4573 ...
    Country3 2963 2778 ...
    Country4 16997 16125 ...
    Country5 4197 4301 ...
    Country6 ...
    CountryTotal ...
    PCamp1 3233130 3228623 ...
    PCamp2 254100 256788 ...
    PCamp3 ...
    PCampTotal 3736217 3725002 ...

    @PatrickI: will mail you; thanks for taking the effort.
  5. Aug 5, 2010 #4


    User Avatar

    Staff: Mentor

    Please do not post your e-mail address in the open forums -- there are issues with spam. Please use the PM system to make side contacts. Thanks.
  6. Aug 5, 2010 #5


    User Avatar
    Science Advisor
    Homework Helper

    Through a multinomial logit regression, you can estimate both:

    1. conditional probability of respondent belonging to country C, given his/her political affiliation,

    2. conditional probability of respondent's political affiliation being P, given his/her country,

    provided you have sufficiently many degrees of freedom (time periods, T) to identify the model parameters (coefficients, K). In general, you need T > K.
  7. Aug 10, 2010 #6
    Thanks for the advice, EnumaElish, this is a good suggestion.

    I do not (yet -- maybe if I just wait some more time ;) ) have T>K but I will try anyways, maybe I will get something interesting.

    PatrickI suggested to start with cluster analysis which is certainly a good idea as well.

    Thanks to you all.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook