Statistics: How to assess the resemblance of two curves?

In summary, the conversation discusses the use of a simple and common method, the sum of squared errors, to measure the goodness of fit between two series on a graph. It is noted that this method produces a value that is dependent on the magnitude of the points on the curve, and alternative methods, such as using a weighted sum, are suggested for giving more consideration to certain values. The coefficient of determination, also known as R-squared, is mentioned as a commonly used measure of fit in simple regressions. The concept of "spread" in the observed data is also mentioned, and the use of the correl function in a spreadsheet is suggested as a way to manually calculate the goodness of fit.
  • #1
24forChromium
155
7
There are two series on a graph, series A is the prediction of a value over time, series B is a curve of observed values over time. How can one quantify how much series A resemble series B?
 
Physics news on Phys.org
  • #3
andrewkirk said:
The simplest and most usual method of measuring goodness of fit is the sum of squared errors.

From what I have seen, this technique produces a value that is dependent on the magnitude of the points on the curve, for example, for time=10s, if the measured value is 10 and the predicted is 9, then the squared residual for that particular time is 1, but if the measured is 1000, and the predicted is 998, then the squared residual would be 4, how can I transform this into a format such that the squared residual or whatever the reported value would be, give consideration to the magnitude of the measured variable?
 
  • #4
24forChromium said:
From what I have seen, this technique produces a value that is dependent on the magnitude of the points on the curve, for example, for time=10s, if the measured value is 10 and the predicted is 9, then the squared residual for that particular time is 1, but if the measured is 1000, and the predicted is 998, then the squared residual would be 4, how can I transform this into a format such that the squared residual or whatever the reported value would be, give consideration to the magnitude of the measured variable?
https://en.wikipedia.org/wiki/Coefficient_of_determination
 
  • #5
24forChromium said:
how can I transform this into a format such that the squared residual or whatever the reported value would be, give consideration to the magnitude of the measured variable?
The coefficient of determination ('R-squared') that Micromass linked uses the sum of squared errors (SSE) together with a measure of the spread of the observed values to calculate the measure of fit, so that gives some of the consideration you are seeking. the R-squared is the most commonly used measure of fit in simple regressions.

If for some case-specific reason you wanted to give even stronger consideration to the values being predicted you could replace the SSE, which is an equally-weighted sum, by a weighted sum that gave more weight to the squares that you wanted to have more influence. For example you might replace the SSE by ##\sum_{k=1}^n (y_k-\hat{y}_k)^2|y_k|## if you wanted to put more emphasis on observations of larger values. You'd need to also change your method for calculating R-squared though, to reflect the different weighting scheme.
 
  • #6
andrewkirk said:
The coefficient of determination ('R-squared') that Micromass linked uses the sum of squared errors (SSE) together with a measure of the spread of the observed values to calculate the measure of fit, so that gives some of the consideration you are seeking. the R-squared is the most commonly used measure of fit in simple regressions.

If for some case-specific reason you wanted to give even stronger consideration to the values being predicted you could replace the SSE, which is an equally-weighted sum, by a weighted sum that gave more weight to the squares that you wanted to have more influence. For example you might replace the SSE by ##\sum_{k=1}^n (y_k-\hat{y}_k)^2|y_k|## if you wanted to put more emphasis on observations of larger values. You'd need to also change your method for calculating R-squared though, to reflect the different weighting scheme.

Okay, I will be honest with you, I didn't understand very much of what you were saying because you brought up a lot of concepts that I have never heard of. Please don't take this as me blaming you for not explaining well, you have already gave me a method that I had never thought of.

The things I don't understand in the first paragraph include:
-Is the coefficient of determination (R-squared) the same as the sum of all the squares of individual differences between two curves like what you told me?
-What is "Micromass" / "Micromass linked" is it a certain computer program?
-Is sum of squared errors (SSE) the same as sum of squared differences?
-"A measure of spread": Is this some general expression for the property of the measured data? Such as its average magnitude?
-I suppose the "measure of fit" just means the reported value for the "goodness of fit" generated by a software?
-Simple regression: such as linear relationship between dependent and independent?

With the magnitude of my ignorance, I did not believe that the weighting described in the second paragraph would be much help to me so I pretty much just skimmed over it, sorry if that would be a disrespect.

In conclusion, what I understand is your message is that some software can give reports on the resemblance of two curves with considerations of their properties automatically. Trouble is, not only is my understanding of technology rather basic, I am required to give clear explanation for the meaning of the "goodness of fit" that I report, I would appreciate it if you would show me a way to calculate (manually, dare I say?) the goodness of fit, maybe something like:
(Sum of squared differences) / (Average values of prediction)*(Average of actual data)
Of course that was just a wild guess that even I am skeptical about, but I hope it demonstrate my intention.
 
  • #7
Put the predicted values and the observed values in two columns of a spreadsheet and use the correl function.

Say the predicted values are A1 to A10 and the observed values are B1 to B10. Compute the correlation with =correl(A1:A10,B1:B10).

The coefficient of determination discussed above (r squared) is this value (r) squared. This value (r) is the correlation between the predicted and observed values. It ranges from -1 to 1.
 
  • #8
Dr. Courtney said:
Put the predicted values and the observed values in two columns of a spreadsheet and use the correl function.

Say the predicted values are A1 to A10 and the observed values are B1 to B10. Compute the correlation with =correl(A1:A10,B1:B10).

The coefficient of determination discussed above (r squared) is this value (r) squared. This value (r) is the correlation between the predicted and observed values. It ranges from -1 to 1.
I would like to just make sure that this tells me how similar A1-A10 resemble B1-B10; and that the (r squared) is in some way "independent" of the magnitude of values for A and B.
Also, I have heard that by multiplying the r^2 by 100%, one can claim that blank percent of the variations in the observed dependent can be explained by the model, would this be true when the input is not the independent and the dependent but rather the theoretical and empirical?
 
  • #9
24forChromium said:
I would like to just make sure that this tells me how similar A1-A10 resemble B1-B10; and that the (r squared) is in some way "independent" of the magnitude of values for A and B.

Yes. For example, your could change the units of A and B without changing the value of the resulting r squared.

24forChromium said:
Also, I have heard that by multiplying the r^2 by 100%, one can claim that blank percent of the variations in the observed dependent can be explained by the model, would this be true when the input is not the independent and the dependent but rather the theoretical and empirical?

This is a common interpretation, but there are a lot of subtleties relating to the uncertainties, whether errors are random, and whether measurement errors are normally distributed.

You can report a correlation coefficient (r) and/or a coefficient of determination (r squared) simply as quantification of how much two series resemble each other (your original question) without trying to make the deeper interpretation.
 

1. What is the purpose of assessing the resemblance of two curves in statistics?

The purpose of assessing the resemblance of two curves in statistics is to determine how similar or dissimilar they are. This can help us identify patterns and relationships between variables, make predictions, and draw conclusions about the data.

2. What are some common methods for assessing the resemblance of two curves?

Some common methods for assessing the resemblance of two curves include visual inspection, correlation analysis, and regression analysis. These methods involve comparing the shape, direction, and strength of the two curves.

3. How do you interpret the results of a correlation analysis?

The results of a correlation analysis will provide a correlation coefficient, which is a numerical value between -1 and 1. A positive correlation coefficient indicates a positive relationship between the two curves, meaning they tend to increase or decrease together. A negative correlation coefficient indicates a negative relationship, meaning one curve tends to increase while the other decreases. A correlation coefficient close to 0 indicates no relationship between the curves.

4. What is the significance of the p-value in regression analysis for assessing the resemblance of two curves?

The p-value in regression analysis is used to determine the significance of the relationship between the two curves. A p-value less than 0.05 is considered statistically significant, indicating that the relationship between the curves is not due to chance. On the other hand, a p-value greater than 0.05 suggests that the relationship is not significant and may be due to chance.

5. How does the choice of data smoothing technique affect the assessment of resemblance between two curves?

Data smoothing techniques, such as moving averages or polynomial fitting, can affect the assessment of resemblance between two curves by altering the shape and direction of the curves. It is important to carefully choose the appropriate smoothing technique for the data to accurately assess the resemblance between the curves.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
944
  • Set Theory, Logic, Probability, Statistics
Replies
22
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
553
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
2
Replies
42
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
Back
Top