Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

How to deal with averaging before calculating statistics

  1. Apr 12, 2012 #1


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I have the following data set: condition A and condition B, with 4 replicates recorded over 3 time periods.

    Hypothetically you can think of it as measuring the height of the sun in the sky in winter (A) compared to the summer (B), in 4 nearby villages (independent observations) over 3 days (the assumption is that the height is stable over the consecutive days).

    Since I want to know the true behavior of each replicate, I average the data of the three time periods (to take out any irrelevant fluctuations). This is the key, but it is also a problem (I think).

    Now I calculate the average and the SD of condition A and condition B (based on the 3-day average of each replicate) and want to do a statistics test on the height at t=12:00.

    My question: what is the best way to deal with the data? If I perform a t-test on the average then a condition would have 4 independent replicas, but in fact underneath there are 3 dependent replicas.

    Am I over-inflating the significance by doing that and how can it be corrected? Any thoughts are welcome :smile:

    Example of three days worth of data, averaged into one
    A1 ~~~ = ~
    A2 ~~~ = ~
    A3 ~~~ = ~
    A4 ~~~ = ~

    B1 ~~~ = ~
    B2 ~~~ = ~
    B3 ~~~ = ~
    B4 ~~~ = ~
  2. jcsd
  3. Apr 12, 2012 #2


    User Avatar
    Science Advisor

    I think it's ok but instead of averaging, couldn't you use Anova?
  4. Apr 12, 2012 #3
    What is the zero hypothesis of the test? Is it that the means of variables A and B are equal?

    If yes, I think that averaging is OK. Also, I think you can use the Hotelling's T2 test of equality of random vectors.
    Last edited: Apr 12, 2012
  5. Apr 12, 2012 #4


    User Avatar
    Science Advisor

    Hey Monique.

    I would recommend if you are trying to compare means for multiple groups (which is what it sounds like) then use an ANOVA: this is what this technique was designed for.

    Also, are there any assumptions for your data that you either know or don't know?
  6. Apr 13, 2012 #5
    I would summarize it a bit...

    1) If one needs to test the hypothesis that several random variables (2 and more) with (nearly) normal distribution have the same mean value, one uses anova (alternatively Welch's t test or other tests, depending on assumptions).
    2) If one needs to test the hypothesis that two random vectors with a multivariate normal distribution have the same mean value, Hotelling's T2 suits well.

    But it depends on what the test is about, which is not clear from the OP.
  7. May 29, 2012 #6


    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    I didn't thank yet for the replies, but I did take the comments along in my evaluation so: thanks! By formulating the question I already came up with the answer without immediately realizing it.

    I didn't use ANOVA, since the data needs to be plotted in a graph and there are only 2 conditions. For each time point I averaged the replicate measurement (~~~), calculated the area under the curve for time frames of each independent measurement (1-4) and and used a t-test to compare the two experimental populations (A, B).

    I hadn't heard of Hotelling's T2 before, so I'll educate myself some more on that.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook