How to compare two data sets with statistics?

Click For Summary

Discussion Overview

The discussion revolves around comparing two data sets, specifically a measured solar spectrum and a reference spectrum (ASTMG173), using statistical methods to assess their accuracy and fit. Participants explore the appropriateness of different statistical tests for this comparison, including the coefficient of determination (r²) and chi-squared tests.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants question the meaning of "what the spectrum should be," suggesting that it may not be a clear distinction between the measured data and the model predictions.
  • There is a proposal to use the coefficient of determination (r²) to compare the measured spectrum with the reference spectrum, with specific formulas suggested for calculation.
  • Others argue that the roles of the data sets might be reversed in the proposed r² calculation.
  • Some participants suggest considering other goodness of fit tests, such as chi-squared, and inquire about the rationale behind choosing the coefficient of determination.
  • One participant expresses a lack of familiarity with alternative tests and seeks guidance on which statistical method would be most appropriate for their situation.
  • A later reply indicates that using the coefficient of determination is not immediately ruled out, but emphasizes the nature of the comparison being made.
  • Participants reference external resources on inverse problems and goodness of fit rationales, indicating a broader context for the discussion.

Areas of Agreement / Disagreement

Participants express differing views on the appropriateness of using the coefficient of determination versus other statistical tests, such as chi-squared. The discussion remains unresolved regarding the best method for comparing the two data sets.

Contextual Notes

Participants highlight the need for a clear definition of the problem to determine the appropriate statistical approach. There is an acknowledgment of the complexity involved in comparing measured data against a reference standard.

elegysix
Messages
404
Reaction score
15
I have two questions:

I have a set of data, a measured spectrum. When I model the spectrum with a function, I calculate r2=1-([itex]\sum[/itex](y-ymodel)2/[itex]\sum[/itex](y-yavg)2).

Q1) However, I have reference data now, which is what the spectrum should be. So is it right to use the same calculation on it for r2, but instead of using ymodel, using yreference?

Q2) The model function I was fitting to the data is
Sλ = 2πhc25(ehc/λkT-1)
Is it correct to calculate goodness of fit in that way for such a distribution?


Here is a plot of my two data sets

unnamed.jpg


thanks!
 
Physics news on Phys.org
Q1> what does it mean: "what the spectrum should be"
There is what the spectrum is and what the model predicts - surely it "should be" whatever it actually is.

Q2> To decide what to do you need, first, to define the problem.
What is it you are trying to find out?

If you want to see if the model is a good fit to the data, then a goodness fit is probably warranted.
Make sure that the approach you use answers the questions you are asking.

What I am reading above is that you have not asked a clear enough question to know how to proceed.

Suspect you may need these:
http://home.comcast.net/~szemengtan/
... "Inverse Problems" towards the bottom of the page.

Those data plots are seriously cool btw.
 
Last edited by a moderator:
Thanks, sorry for being unclear.
Forget that I mentioned a "model"

"what the spectrum should be" is the ASTMG173.
We captured the solar spectrum and want to compare it with a reference spectrum (the ASTMG173) to show that our measurements are accurate.

the question is - how can I properly use statistics to say how well these two data sets match?

Is it appropriate to use this calculation: [itex]r^{2} = 1 - \frac{\sum(y_{r} - y_{s})^{2} }{\sum(y_{r} - \bar{y_{r}})^{2} }[/itex]

where [itex]y_{r}[/itex] is the reference y data, and [itex]y_{s}[/itex] is our measured y data, and [itex]\bar{y_{r}}[/itex] is the mean of the reference y data.

thanks
 
So you are testing the measuring method, to show that it is sound?

You want to use the coefficient of determination test?
I think you have the roles of the data-sets reversed.

There are other goodness of fit tests - i.e. chi-squared - what lead you to choose this one?
 
Simon Bridge said:
So you are testing the measuring method, to show that it is sound?
yes.

Simon Bridge said:
You want to use the coefficient of determination test?
Not necessarily. I want to use whatever test is appropriate for this.


Simon Bridge said:
There are other goodness of fit tests - i.e. chi-squared - what lead you to choose this one?
I am not familiar with the others, that is why I made this thread. Which test should I use? what would you use?

thanks
 
I see ... I cannot see anything immediately ruling out a CoD test.
I would use Chi-squared... but that's me.

Really you are comparing two data-sets and asking if they are close enough to come from the same forward function rather than checking a data set against a theoretical model of a forward function.

The inverse problems papers I linked you to (post #2) gives a lot of detail on different rationales for goodness of fit in different circumstances.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
1K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 5 ·
Replies
5
Views
6K
  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 8 ·
Replies
8
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 20 ·
Replies
20
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K