Multiple variable correlation - drowning in stats formulas,

ralph86
Messages
4
Reaction score
0
Hello everyone,

My boss has asked me to find if there is any correlation between various sample parameters in a water quality study. We took around 150 samples, and for each one measured around 50 parameters e.g. iron content, free residual chlorine, level of coliforms, pH, etc etc.

I always avoided stats through school, and having spent a couple of hours reading up about my options for doing this analysis on the internet, I remember why.

So, please, in simple terms, can someone explain to me:
What function or statistical analysis technique(s) I could/should use for this analysis
What (ideally free) software is available to do it

THANK YOU!
Ralph
 
Physics news on Phys.org
Simple linear regression would be the easiest approach, assuming it is suitable for the data you have. Excel can do this well enough for most purposes. The value you would be looking for is the r^2 (r squared). With simple linear regression this tells you how 'strong' the relationship is between the dependant and independant variables.

I think you need to explain a bit more what your variables are. I assume you have a dependant variable which is some sort of water quality measure (DO, BOD etc?), then a batch of independant variables such as iron content, suspended solids and so on, and you want to be able to predict the water quality using the independant variable.

This kind of analysis can be quite easy, if you're lucky. For example an ideal situation for you would be that water quality is largely determined by, say, the suspended solids content of the water, so you only need one variable to predict water quality reasonably accurately. However most real-life situations are rarely that convenient and you may find it difficult to create a simple statistical model from your data.
 
Hi Richard,

Thanks for your reply.

Just to be sure I understand, a dependent variable is one that is purely affected by another one, which you are also measuring?

If this is the case, the situation is a bit more complex - one indicator, e.g. dissolved oxygen, may be the determinate (independant?) variable for some indicators, i guess the microbiological ones, but another e.g. a certain heavy metal, may be an indicator for industrial pollution, and hence correlate with the presence of various industrial contaminants e.g. endocrine disruptors, soaps, whatever.

What I would like to do is, in some kind of reasonably automated way, find the correlation between every possible pair of parameters, and report it (by comparing R^2 I suppose).
Can you recommend a program or software for doing this?

Probably there are interactions between combinations, e.g. coliforms will need a certain pH, be more disposed to the rainy season, cannot survive in high concentrations of lead, etc., but I guess I don't have enough data to do look at high numbers of parameters at the same time with only 150 samples, although if you can think of a way I'd love to hear it.

Thanks for your help
Ralph
 
Yes there will probably be correlations between a large number of your parameters (known as multicollinearity).

The best statistics package that I've personally used is SPSS - might be expensive though (I use the one my university pays for so no idea how much a license costs).

To be honest though, the data you describe, and the tasks you need to perform, it sounds as though the statistics you need will be rather complicated and require a good understanding of how stats works and how to apply the right methods, how to interpret the results etc.

If this is something you absolultely have to do you might be best to sub-contract out to an actual statistician.
 
Ok Richard, thanks for your help. I was hoping for an easy way of doing this, and that sounds pretty easy. I will ask around if anyone dominates stats in the department :)
Ralph
 
Hi all, I've been a roulette player for more than 10 years (although I took time off here and there) and it's only now that I'm trying to understand the physics of the game. Basically my strategy in roulette is to divide the wheel roughly into two halves (let's call them A and B). My theory is that in roulette there will invariably be variance. In other words, if A comes up 5 times in a row, B will be due to come up soon. However I have been proven wrong many times, and I have seen some...
Thread 'Detail of Diagonalization Lemma'
The following is more or less taken from page 6 of C. Smorynski's "Self-Reference and Modal Logic". (Springer, 1985) (I couldn't get raised brackets to indicate codification (Gödel numbering), so I use a box. The overline is assigning a name. The detail I would like clarification on is in the second step in the last line, where we have an m-overlined, and we substitute the expression for m. Are we saying that the name of a coded term is the same as the coded term? Thanks in advance.
Back
Top