How to deal with averaging before calculating statistics

  • Context: Undergrad 
  • Thread starter Thread starter Monique
  • Start date Start date
  • Tags Tags
    Statistics
Click For Summary

Discussion Overview

The discussion revolves around the appropriate statistical methods for analyzing a data set comparing two conditions (A and B) measured over multiple time periods. Participants explore the implications of averaging data before conducting statistical tests, particularly focusing on t-tests and ANOVA, and the potential issues arising from dependent versus independent replicates.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant presents a scenario involving measurements of the sun's height across different conditions and time periods, questioning the validity of averaging data before statistical testing.
  • Another participant suggests using ANOVA instead of averaging, implying it may be more appropriate for comparing means across multiple groups.
  • A question is raised regarding the null hypothesis of the test, with a suggestion that it might be about the equality of means for conditions A and B.
  • Hotelling's T2 test is proposed by a participant as a suitable method for comparing two random vectors with multivariate normal distributions.
  • One participant summarizes the use of ANOVA for testing hypotheses about multiple random variables and mentions alternative tests like Welch's t-test, depending on data assumptions.
  • The original poster reflects on the feedback received and mentions using a t-test after averaging the data, while also expressing interest in learning more about Hotelling's T2 test.

Areas of Agreement / Disagreement

Participants express differing opinions on the appropriateness of averaging data versus using ANOVA or Hotelling's T2 test. There is no consensus on the best approach, and the discussion remains unresolved regarding the implications of averaging on statistical significance.

Contextual Notes

Participants note the importance of understanding the assumptions underlying the statistical tests being considered, but specific assumptions related to the data are not clearly defined in the discussion.

Monique
Staff Emeritus
Science Advisor
Gold Member
Messages
4,229
Reaction score
61
I have the following data set: condition A and condition B, with 4 replicates recorded over 3 time periods.

Hypothetically you can think of it as measuring the height of the sun in the sky in winter (A) compared to the summer (B), in 4 nearby villages (independent observations) over 3 days (the assumption is that the height is stable over the consecutive days).

Since I want to know the true behavior of each replicate, I average the data of the three time periods (to take out any irrelevant fluctuations). This is the key, but it is also a problem (I think).

Now I calculate the average and the SD of condition A and condition B (based on the 3-day average of each replicate) and want to do a statistics test on the height at t=12:00.

My question: what is the best way to deal with the data? If I perform a t-test on the average then a condition would have 4 independent replicas, but in fact underneath there are 3 dependent replicas.

Am I over-inflating the significance by doing that and how can it be corrected? Any thoughts are welcome :smile:

Example of three days worth of data, averaged into one
A1 ~~~ = ~
A2 ~~~ = ~
A3 ~~~ = ~
A4 ~~~ = ~

B1 ~~~ = ~
B2 ~~~ = ~
B3 ~~~ = ~
B4 ~~~ = ~
 
Physics news on Phys.org
I think it's ok but instead of averaging, couldn't you use Anova?
 
What is the zero hypothesis of the test? Is it that the means of variables A and B are equal?

Edit:
If yes, I think that averaging is OK. Also, I think you can use the Hotelling's T2 test of equality of random vectors.
 
Last edited:
Hey Monique.

I would recommend if you are trying to compare means for multiple groups (which is what it sounds like) then use an ANOVA: this is what this technique was designed for.

Also, are there any assumptions for your data that you either know or don't know?
 
I would summarize it a bit...

1) If one needs to test the hypothesis that several random variables (2 and more) with (nearly) normal distribution have the same mean value, one uses anova (alternatively Welch's t test or other tests, depending on assumptions).
2) If one needs to test the hypothesis that two random vectors with a multivariate normal distribution have the same mean value, Hotelling's T2 suits well.

But it depends on what the test is about, which is not clear from the OP.
 
I didn't thank yet for the replies, but I did take the comments along in my evaluation so: thanks! By formulating the question I already came up with the answer without immediately realizing it.

I didn't use ANOVA, since the data needs to be plotted in a graph and there are only 2 conditions. For each time point I averaged the replicate measurement (~~~), calculated the area under the curve for time frames of each independent measurement (1-4) and and used a t-test to compare the two experimental populations (A, B).

I hadn't heard of Hotelling's T2 before, so I'll educate myself some more on that.
 

Similar threads

  • · Replies 13 ·
Replies
13
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 27 ·
Replies
27
Views
3K