# Multiple variable correlation - drowning in stats formulas, need help

1. Oct 18, 2011

### ralph86

Hello everyone,

My boss has asked me to find if there is any correlation between various sample parameters in a water quality study. We took around 150 samples, and for each one measured around 50 parameters e.g. iron content, free residual chlorine, level of coliforms, pH, etc etc.

I always avoided stats through school, and having spent a couple of hours reading up about my options for doing this analysis on the internet, I remember why.

So, please, in simple terms, can someone explain to me:
What function or statistical analysis technique(s) I could/should use for this analysis
What (ideally free) software is available to do it

THANK YOU!
Ralph

2. Oct 19, 2011

### Richard_R

Simple linear regression would be the easiest approach, assuming it is suitable for the data you have. Excel can do this well enough for most purposes. The value you would be looking for is the r^2 (r squared). With simple linear regression this tells you how 'strong' the relationship is between the dependant and independant variables.

I think you need to explain a bit more what your variables are. I assume you have a dependant variable which is some sort of water quality measure (DO, BOD etc?), then a batch of independant variables such as iron content, suspended solids and so on, and you want to be able to predict the water quality using the independant variable.

This kind of analysis can be quite easy, if you're lucky. For example an ideal situation for you would be that water quality is largely determined by, say, the suspended solids content of the water, so you only need one variable to predict water quality reasonably accurately. However most real-life situations are rarely that convenient and you may find it difficult to create a simple statistical model from your data.

3. Oct 19, 2011

### ralph86

Hi Richard,

Just to be sure I understand, a dependent variable is one that is purely affected by another one, which you are also measuring?

If this is the case, the situation is a bit more complex - one indicator, e.g. dissolved oxygen, may be the determinate (independant?) variable for some indicators, i guess the microbiological ones, but another e.g. a certain heavy metal, may be an indicator for industrial pollution, and hence correlate with the presence of various industrial contaminants e.g. endocrine disruptors, soaps, whatever.

What I would like to do is, in some kind of reasonably automated way, find the correlation between every possible pair of parameters, and report it (by comparing R^2 I suppose).
Can you recommend a program or software for doing this?

Probably there are interactions between combinations, e.g. coliforms will need a certain pH, be more disposed to the rainy season, cannot survive in high concentrations of lead, etc., but I guess I don't have enough data to do look at high numbers of parameters at the same time with only 150 samples, although if you can think of a way I'd love to hear it.

Ralph

4. Oct 19, 2011

### Richard_R

Yes there will probably be correlations between a large number of your parameters (known as multicollinearity).

The best statistics package that I've personally used is SPSS - might be expensive though (I use the one my university pays for so no idea how much a license costs).

To be honest though, the data you describe, and the tasks you need to perform, it sounds as though the statistics you need will be rather complicated and require a good understanding of how stats works and how to apply the right methods, how to interpret the results etc.

If this is something you absolultely have to do you might be best to sub-contract out to an actual statistician.

5. Oct 19, 2011

### ralph86

Ok Richard, thanks for your help. I was hoping for an easy way of doing this, and that sounds pretty easy. I will ask around if anyone dominates stats in the department :)
Ralph