Multiple variable correlation - drowning in stats formulas,

AI Thread Summary
The discussion centers on analyzing correlations between various water quality parameters from 150 samples, with the user seeking simple statistical methods and software recommendations. Simple linear regression is suggested as a starting point, with Excel capable of performing this analysis and providing the r² value to assess the strength of relationships. The user expresses the complexity of their data, noting that some parameters may influence others, complicating the analysis. Recommendations for software include SPSS, though its cost may be a concern, and the possibility of hiring a statistician is mentioned for more complicated analyses. The conversation highlights the challenges of multicollinearity and the need for a solid understanding of statistical methods to interpret results effectively.
ralph86
Messages
4
Reaction score
0
Hello everyone,

My boss has asked me to find if there is any correlation between various sample parameters in a water quality study. We took around 150 samples, and for each one measured around 50 parameters e.g. iron content, free residual chlorine, level of coliforms, pH, etc etc.

I always avoided stats through school, and having spent a couple of hours reading up about my options for doing this analysis on the internet, I remember why.

So, please, in simple terms, can someone explain to me:
What function or statistical analysis technique(s) I could/should use for this analysis
What (ideally free) software is available to do it

THANK YOU!
Ralph
 
Physics news on Phys.org
Simple linear regression would be the easiest approach, assuming it is suitable for the data you have. Excel can do this well enough for most purposes. The value you would be looking for is the r^2 (r squared). With simple linear regression this tells you how 'strong' the relationship is between the dependant and independant variables.

I think you need to explain a bit more what your variables are. I assume you have a dependant variable which is some sort of water quality measure (DO, BOD etc?), then a batch of independant variables such as iron content, suspended solids and so on, and you want to be able to predict the water quality using the independant variable.

This kind of analysis can be quite easy, if you're lucky. For example an ideal situation for you would be that water quality is largely determined by, say, the suspended solids content of the water, so you only need one variable to predict water quality reasonably accurately. However most real-life situations are rarely that convenient and you may find it difficult to create a simple statistical model from your data.
 
Hi Richard,

Thanks for your reply.

Just to be sure I understand, a dependent variable is one that is purely affected by another one, which you are also measuring?

If this is the case, the situation is a bit more complex - one indicator, e.g. dissolved oxygen, may be the determinate (independant?) variable for some indicators, i guess the microbiological ones, but another e.g. a certain heavy metal, may be an indicator for industrial pollution, and hence correlate with the presence of various industrial contaminants e.g. endocrine disruptors, soaps, whatever.

What I would like to do is, in some kind of reasonably automated way, find the correlation between every possible pair of parameters, and report it (by comparing R^2 I suppose).
Can you recommend a program or software for doing this?

Probably there are interactions between combinations, e.g. coliforms will need a certain pH, be more disposed to the rainy season, cannot survive in high concentrations of lead, etc., but I guess I don't have enough data to do look at high numbers of parameters at the same time with only 150 samples, although if you can think of a way I'd love to hear it.

Thanks for your help
Ralph
 
Yes there will probably be correlations between a large number of your parameters (known as multicollinearity).

The best statistics package that I've personally used is SPSS - might be expensive though (I use the one my university pays for so no idea how much a license costs).

To be honest though, the data you describe, and the tasks you need to perform, it sounds as though the statistics you need will be rather complicated and require a good understanding of how stats works and how to apply the right methods, how to interpret the results etc.

If this is something you absolultely have to do you might be best to sub-contract out to an actual statistician.
 
Ok Richard, thanks for your help. I was hoping for an easy way of doing this, and that sounds pretty easy. I will ask around if anyone dominates stats in the department :)
Ralph
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top