Multiple variable correlation - drowning in stats formulas,

Click For Summary

Discussion Overview

The discussion revolves around the analysis of correlations between various parameters in a water quality study involving 150 samples and approximately 50 measured parameters. Participants explore statistical techniques and software options suitable for this analysis.

Discussion Character

  • Technical explanation
  • Exploratory
  • Homework-related

Main Points Raised

  • Ralph seeks guidance on statistical analysis techniques and software for finding correlations between water quality parameters.
  • Some participants suggest simple linear regression as a potential method, noting the importance of understanding dependent and independent variables.
  • Ralph expresses the complexity of the relationships, indicating that some parameters may influence others, complicating the analysis.
  • There is mention of multicollinearity, which could affect the correlation analysis due to the interdependence of multiple parameters.
  • SPSS is recommended as a statistics package, though its cost may be a concern for Ralph.
  • One participant suggests that the analysis may require a deeper understanding of statistics and possibly the assistance of a professional statistician.

Areas of Agreement / Disagreement

Participants do not reach a consensus on the best approach or software for the analysis, and multiple viewpoints regarding the complexity of the task and the appropriate methods remain present.

Contextual Notes

Participants acknowledge limitations such as the potential for multicollinearity and the complexity of interactions between parameters, which may not be fully addressed with the available data.

ralph86
Messages
4
Reaction score
0
Hello everyone,

My boss has asked me to find if there is any correlation between various sample parameters in a water quality study. We took around 150 samples, and for each one measured around 50 parameters e.g. iron content, free residual chlorine, level of coliforms, pH, etc etc.

I always avoided stats through school, and having spent a couple of hours reading up about my options for doing this analysis on the internet, I remember why.

So, please, in simple terms, can someone explain to me:
What function or statistical analysis technique(s) I could/should use for this analysis
What (ideally free) software is available to do it

THANK YOU!
Ralph
 
Physics news on Phys.org
Simple linear regression would be the easiest approach, assuming it is suitable for the data you have. Excel can do this well enough for most purposes. The value you would be looking for is the r^2 (r squared). With simple linear regression this tells you how 'strong' the relationship is between the dependent and dependent variables.

I think you need to explain a bit more what your variables are. I assume you have a dependent variable which is some sort of water quality measure (DO, BOD etc?), then a batch of dependent variables such as iron content, suspended solids and so on, and you want to be able to predict the water quality using the dependent variable.

This kind of analysis can be quite easy, if you're lucky. For example an ideal situation for you would be that water quality is largely determined by, say, the suspended solids content of the water, so you only need one variable to predict water quality reasonably accurately. However most real-life situations are rarely that convenient and you may find it difficult to create a simple statistical model from your data.
 
Hi Richard,

Thanks for your reply.

Just to be sure I understand, a dependent variable is one that is purely affected by another one, which you are also measuring?

If this is the case, the situation is a bit more complex - one indicator, e.g. dissolved oxygen, may be the determinate (independent?) variable for some indicators, i guess the microbiological ones, but another e.g. a certain heavy metal, may be an indicator for industrial pollution, and hence correlate with the presence of various industrial contaminants e.g. endocrine disruptors, soaps, whatever.

What I would like to do is, in some kind of reasonably automated way, find the correlation between every possible pair of parameters, and report it (by comparing R^2 I suppose).
Can you recommend a program or software for doing this?

Probably there are interactions between combinations, e.g. coliforms will need a certain pH, be more disposed to the rainy season, cannot survive in high concentrations of lead, etc., but I guess I don't have enough data to do look at high numbers of parameters at the same time with only 150 samples, although if you can think of a way I'd love to hear it.

Thanks for your help
Ralph
 
Yes there will probably be correlations between a large number of your parameters (known as multicollinearity).

The best statistics package that I've personally used is SPSS - might be expensive though (I use the one my university pays for so no idea how much a license costs).

To be honest though, the data you describe, and the tasks you need to perform, it sounds as though the statistics you need will be rather complicated and require a good understanding of how stats works and how to apply the right methods, how to interpret the results etc.

If this is something you absolultely have to do you might be best to sub-contract out to an actual statistician.
 
Ok Richard, thanks for your help. I was hoping for an easy way of doing this, and that sounds pretty easy. I will ask around if anyone dominates stats in the department :)
Ralph
 

Similar threads

  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
Replies
3
Views
2K
  • · Replies 13 ·
Replies
13
Views
4K
  • · Replies 4 ·
Replies
4
Views
5K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 10 ·
Replies
10
Views
5K
  • · Replies 12 ·
Replies
12
Views
7K
Replies
4
Views
4K