# How to Analyze Correlation Between Variables

1. Jul 30, 2014

### daneault23

1. The problem statement, all variables and given/known data
I'm doing a research project in which there's a survey given to respondents from 3 different groups of education(high school vs university etc) related to STD contraction rates. On the survey there's a question asking how many STD's they have contracted over their lifetime, possible answers are 0,1,2,3, 4 or more. Respondents also had to answer questions about their socioeconomic status, parental education level, personal education level, and ethnicity. I am trying to find a correlation between STD contraction rate and those other variables using SPSS, but am unsure how to go about this. Do I use MANOVA or could I use Chi Square Test of Independence? I'm not trying to find differences between the 3 groups, I'm trying to find correlations between all of the data as a whole, otherwise I believe I could use one way ANOVA. It's been awhile since I've used SPSS.

2. Relevant equations

3. The attempt at a solution

I would enter STI contraction rate as a nominal variable (categorical), since the responses of STD's contracted isn't necessarily telling you a ranking of that person. However, SES and education levels do have rankings, so I am unsure of how to proceed with the analysis. I want to not just compare two variables, I want to compare them all to see if there's any correlation.

2. Aug 1, 2014

### thelema418

First, I'm confused why you are entering STI rate as a nominal variable instead of a scale. If they report diagnosis of 3 STI's, they have been diagnosed 2 more than a person who was diagnosed once. Perhaps you are implying something about not knowing how long they had the STI's or the fact that people with an STI may not know of their STI. Can you explain why you are making this decision? I just might not see it right.

To use SPSS, go to Analyze > Correlate > Bivariate.

You can enter your data here. Note that the Pearson correlation is for scale to scale correlations. If you have ordinal data, it needs to be examined with the Spearman statistic.

Some of your variables are nominal, and this is also bothering me. You might be able to do correlations for each nominal group using the SPLIT function of SPSS which is under the data tools. You would want to SPLIT so that the output is grouped by the nominal variable.

Are you trying to answer a research question? Usually it is easier to figure out the test from the question.

3. Aug 1, 2014

### thelema418

Also, if this is a research project -- I'm concerned about the variables SES and parental education level. SES is often a composite score built from the parental education level: in nursing sciences the mothers education level is often utilized for this purpose. In some studies SES is purely the mother's education level.

If the SES composite is built this way, the correlation may show up as significant, but really the variables are the same.

4. Aug 1, 2014

### daneault23

thelema418 thanks for the response. I'll try and answer your concerns. First, the number of STI's on the survey had options for responses of 0 ,1, 2, 3, 4 or more. The "or more" part is what's throwing me off. They could have 6 or 10 but I would have no idea.

Also there's no variable SES. I've read over the demographic survey and the questions relevant to that are that it asks the person how many jobs they have, how many hours they work, how much money they earn, their highest level of education, own marital status, and then parents marital status, highest mothers education, highest fathers education, parents combined yearly income.

I've also met with the person who did the surveying and talked to her about what she really wanted to know from this study and she has a lot of questions. She wants to see if theres any correlation or relationship between the scores on the knowledge test to level of criminal history, to # of sexual partners, to # of STI's in lifetime, to personal education level, to personal perception or risk level of getting an STI, and also knowledge scores to personal perception or community risk level of getting an STI.

I guess the researcher's goal is to find out what these kids do not know about STI's and how they get them and develop a curriculum to teach them in the areas where they have knowledge deficits.

5. Aug 1, 2014

### thelema418

Okay, then it can be handled as ordinal data. You should also look at this data, my guess is that it is non-normal: this would be another violation of assumptions for Pearson's r correlation. Spearman's rho is a nonparametric test. It assumes the relationships are monotonic increasing or decreasing. You might be able to use a transform to make the data parametric if you really needed to. Depending on the test, you could dummy code it as nominal data.

What is your research question? If you are trying to say something predictive, then you probably want to use a linear regression. What you can do depends on working with the assumptions of the test and the data you have. You can do tricks like dummy coding to get around some issues, but you need to be mindful of sample size.

If you have several ordinal variables that theoretically and ethically can be called measurements of STI-risk, you might want to use the reliability tests to find the internal consistency of those items. (Remember to reverse code for any negative questions). If you have a high Cronbach alpha, you can create a scale score for those items using -- addition, averaging, or dimension reduction. This should be a ratio scale / interval scale measurement, but check that it is normally distributed. If you do this, it is important that the items you chose are theoretically and ethically appropriate. Then you might be able to make predictions of risk based on age, etc.

6. Aug 1, 2014

### daneault23

I've just finished entering all of the data. The most STI's anyone has had is 3, so I can make STI a scale variable as you mentioned before. Also I asked the researcher to really narrow done what she wants to know and she wants to know the correlations of test scores (and also STI rates) to the aforementioned variables, like level of criminal history, # of sex partners, education, perception of risk etc. I guess that would be like a predictive kind of testing.

In my data, the only ordinal variables I have related to what she wants to know is income, education, mothers education level, parents education level, parents income, tobacco use, alcohol use, and sti risk level perception. The other variables besides age and STI's are nominal, (age, sti both scale).

7. Aug 1, 2014

I suppose personal income could be placed as scale variable, it has options of 0, below $10,000, 10-20K 20-30K etc 70-80K above 80K, but for parents income it has same but also has an option for UNKNOWN. For tobacco and alcohol use the options are not evenly spaced so I would have to keep them as ordinal. 8. Aug 1, 2014 ### thelema418 If there is a limitation in the range of values, it may be better to use the ordinal. How you handle "Unknown" values depend on how many there are. If you exclude missing values, you have to report this number in your findings. Honestly, there is a lot of screening that needs to take place before you can do a test or correlation. E.g., if 70% of the income levels are "Unknown," can you even include the variable in your analysis? To handle missing values, you can declare discrete missing values on the variable in the variable view screen. 9. Aug 5, 2014 ### daneault23 For the Parental Income variable, about 30% have checked Unknown, so not sure if that's meaningful quite yet for analysis. I have also added a test performance results variable and see that out of 35 possible points, there's 1 person that scored a 0. Out of my sample of 87 respondents, this throws off the results a bit. I am not sure whether to disregard that person's score as an outlier or not. The questionnaire was in True, False, and Do Not Know form and getting the question wrong OR checking Do Not Know results in an incorrect answer and that particular person checked do not know for every single question, begging the question if they even looked at the questions. I cannot know for certain though, possibly they really did not know any of the answers. Also when doing the Spearman's rho, do both of the variables being analyzed have to be of the same type? This is what I have: Test Scores -Scale Criminal History-Nominal (options are not really in any specific order) # of Sexual Partners-Ordinal STI's in Lifetime-Scale Education-Ordinal Personal Perception of STI Risk-Ordinal (0 minimal average high risk) Community Perception of STI Risk-Ordinal Ethnicity-Nominal 10. Aug 7, 2014 ### thelema418 You mean "which raises the question." The phrase "begging the question" refers to a logical fallacy. In common speech, no one will care; in academic prose, you must be careful with that phrase. Spearman's rho has two assumptions. The first assumption is that the data is continuous or discrete. The second is that the data is monotonic. In SPSS the easiest way to check that the data is monotonic is by building scatterplots using the Graphing feature. For missing data, you should also make cross-tabs to check the distribution across your nominal groups. The crosstabs are under Analyze in SPSS. Sometimes excluding missing data removes an entire group of interest from your study. Also, you might want to consider$0 as missing data too (Does \$0 mean the participant's parents are dead?) From a methodological standpoint, you should check that the measures are reliable and valid by reviewing literature from the National Institute of Health about indicators of SES, etc. Theory really matters with this too. This is data from a point in time, so why doesn't the income of parents in previous years matter? At what age or under what conditions should we stop looking at parents income? Some researchers might question, why does the income of the parents matter as opposed to the income of the community (zip code)?

With the questionnaire, you have to also ask about validity and reliability. Personally, I would not treat "I don't know" as an incorrect. The researcher might have a reason for doing that on the basis of theory or previous studies. I would probably score it like a Likert scale, with "I don't know" as 0.5. If a question is skipped, I would enter it as a missing value. The problem I see with marking "I don't know" as "incorrect" is that the statement might be ambiguous to the participant; they may not know what a word means. If you plan on adding the values, you could divide the number of questions by 1, then make this the value for "I don't know." If you do this, the value on the left of the decimal is the number correct, and the value on the right tells you something about how many "I don't knows" you have (except the case where someone answers everything as I don't know.)