Comparing two datasets: methods and statistics

  • Thread starter FrankDrebon
  • Start date
  • Tags
    Statistics
In summary, the conversation is discussing the best approach for reporting the variability in data. The speaker is seeking opinions on whether to use the "change in averages" method or the "average of changes" method to calculate the change in a property between two states of a system. They also mention the use of a standard error to calculate confidence intervals, but are unsure of the appropriate sample size to use. There is also a suggestion to consider using a Student's t-test with unequal sample sizes and unequal variance. The conversation also touches on the importance of considering any connections between individual values in different sets of data. The speakers also bring up the possibility of using a Lognormal distribution for the data.
  • #1
FrankDrebon
9
0
Hi all,

Just looking for some opinions on how to approach reporting the variability in some data I have acquired. I know Rutherford is quoted as saying “if your experiment needs statistics, you ought to have done a better experiment”, but unfortunately in biophysics we’re always at the mercy of the variability of living “things”!

Basically, I have a set of measurements ‘A’ and a set of measurements ‘B’. Both measure the same property of a system, but the state of the system is slightly different in the two sets. What I want to calculate is the change in this property between the two states of the system.

I could do this by taking the mean of set A and the mean of set B and dividing one by the other. However, this gives me the “change in the averages”. What I (think I) want is the “average of the changes”, so I divide each A value by each B value and take the average of those comparisons.

As an example, suppose I measured this property over and over with the system in state A and got the results 7, 8, 8, 7, 5. Then I measured it in state B and got 10, 7, 9, 9, 8. The average value in state A is 7.0, the average value in state B is 8.6. State B obviously has a larger "property" than state A.

To calculate the average change between the sets, I’d divide each result in ‘B’ by each result in ‘A’ (25 comparisons) and take the average, in this case 1.265. Simply dividing 8.6 by 7 gives 1.229.

I then wish to provide a standard error to calculate confidence intervals. However, this requires the use of the sample size N. There are 10 measurements in two lots of 5 samples, and 25 comparisons. I can't decide which to use as the sample size! Thoughts?

I appreciate that there is probably no definite answer here, but your opinions would be appreciated. Also, if you think my “averages of the changes” method is stupid then please say so. Perhaps I could just calculate the “change in the averages” and calculate an error based on the standard deviations of the individual data sets? The two sides of my brain have been arguing which is the best way to analyse this data for weeks, and they can’t come to a conclusion...!
 
Physics news on Phys.org
  • #3
There is no justification in comparing individual values in A to corresponding values in B if there is no particular connection between them; for instance, if the order of the values in A and B doesn't matter.

That is to say: Is there some connection between, say, the first value in A and the first value in B, that does not exist between the first value in A and the second value in B? If not, you need to treat all pairs (value in A, value in B) equally.
 
  • #4
Good point, the average ratio is similar to a matched-pairs test, which may or may not be justified. On a related note, is there any reason to prefer the ratio A/B to the difference A-B (either pairwise, or between the two averages)? For example, if you believe that each of A and B is Lognormal, then you could test Log(A/B) = Log A - Log B, which would be Normal, for being equal to zero. Is that the case with your data?
 
Last edited:
  • #5



I understand the challenges of dealing with variability in data, especially when working with living systems. It is important to approach the analysis of your data with a clear and objective mindset. In this case, it seems like you are trying to compare two datasets (A and B) and determine the change in a specific property between the two states of the system.

Your approach of calculating the average of the changes between the two sets (dividing each value in B by each value in A and taking the average) seems reasonable. However, it is important to also consider the standard error and confidence intervals in your analysis. In this case, it may be more appropriate to use the sample size of 25 (the number of comparisons) rather than the total number of measurements (10) as your sample size.

Additionally, it may be helpful to also calculate the standard deviation of each dataset and use that in your analysis. This can provide a measure of the variability within each dataset and can help to determine if the observed change in the property is statistically significant.

Ultimately, the best approach will depend on the specific goals of your study and the nature of your data. It may be helpful to consult with a statistician or seek out additional literature on similar studies to see how they have approached similar analyses. Overall, it is important to carefully consider the methods and statistics used in your analysis to ensure that your results are accurate and meaningful.
 

1. What are the different methods used to compare two datasets?

There are several methods used to compare two datasets, including statistical tests such as t-tests and ANOVA, correlation analysis, and visual techniques such as scatter plots and box plots.

2. How do I choose the most appropriate method for comparing my datasets?

The choice of method depends on the type of data you have and the research question you are trying to answer. For example, if you have two continuous datasets, a correlation analysis may be appropriate, while if you have two categorical datasets, a chi-square test may be more suitable.

3. What is the difference between parametric and non-parametric methods?

Parametric methods assume that the data follows a normal distribution and make certain assumptions about the data, while non-parametric methods do not make any assumptions about the distribution of the data. Parametric methods are more sensitive to outliers and require a larger sample size compared to non-parametric methods.

4. How do I interpret the results of a statistical test for comparing two datasets?

The results of a statistical test will provide a p-value, which indicates the probability of obtaining the observed results if there is no real difference between the two datasets. A p-value less than 0.05 is typically considered statistically significant, meaning that there is a low probability of obtaining these results by chance.

5. Can visual techniques alone be used to compare two datasets?

No, visual techniques should be used in conjunction with statistical methods to compare two datasets. While visual techniques can provide a quick overview of the data, statistical tests are necessary to determine if any differences between the two datasets are statistically significant.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
21
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
458
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
908
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
709
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
Back
Top