Normalized SSE for a set of data?

  • Thread starter Thread starter pergradus
  • Start date Start date
  • Tags Tags
    Data Set
AI Thread Summary
The discussion focuses on how to quantitatively measure the goodness of fit for a model against a set of data points, particularly when considering uncertainties and varying magnitudes of data. The traditional sum of squared errors (SSE) is critiqued for not accounting for the number of data points or their uncertainties, leading to misleading comparisons. Suggestions include using the mean of the sum of the squares of percentage errors as a potential solution, although this approach lacks a precise definition of what constitutes a good comparison. Mean squared error (MSE) is mentioned as a more established statistic that addresses some of these concerns, but challenges remain in assessing degrees of freedom due to variable correlations. Overall, a clear definition of the goal is emphasized as crucial for deriving a meaningful metric.
pergradus
Messages
137
Reaction score
1
Normalized SSE for a set of data??

Hi, supposed I have a set of data points, and each data point has a certain value of uncertainty associated with it.

Supposed also I have a function which models the data. What I'd like to know is how does one quantitatively measure how good the fit of the model is to the data, in such as way that one can compare the model to different sets of data?

For example, taking the SSE defined as:

\sum(y_i - f(x_i , \beta))^2

where \beta is a set of parameters, one can measure the difference between the model and the data. However, this does not take into account the number of data points or the degree of uncertainty. If I have a very large number of data points, a small difference between the data and model will result in a very large SSE, even if the model is a very good fit. On the contrary, even if the model is poor, a small set of data points may produce a small SSE - thus there is no way to compare between data sets for the same model.

Also, one must consider the magnitude of the data points when comparing sets. For example, if my y_i values range from 100...500 for one set, a small fractional change between the model and the data may still produce a huge SSE, while a huge fractional change between the model and data ranging from 0.01...0.5 will result in a small SSE.

So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?
 
Physics news on Phys.org


pergradus said:
So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?

You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.
 


Stephen Tashi said:
You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.

I was thinking along the same lines... something like this:

\frac{1}{N}\sum\frac{(y_i - f(x_i, \beta)) + e_i}{y_i}

where N is the total number of data points, and ei is the error associated with the ith data point. This would give the average maximum possible difference between the curve and the data.

Do you see any flaws in this reasoning?
 


pergradus said:
Do you see any flaws in this reasoning?

It isn't reasoning. It's the method of "I don't know exactly what I want, but I'll recognize it when I see it". Everybody who deals with practical problems resorts to this method sometimes. If you want to deduce an answer by mathematical reasoning, you have to define precisely what it is that you are trying to do. If you are just casually fooling around with some data, I think the formula is worth a try. If you have some very serious purpose in your work, you should try define you goal in precise language and see where reasoning will take you.
 
I haven't seen any reference to the term "mean squared error" (MSE) in this thread, This is simply the the SSE divided by the number of degrees of freedom and is essentially what this thread is addressing. It's a widely used and well regarded statistic and is useful in evaluating the bias of estimators among other things. I'm not aware of any particular issues associated with it. Am I missing something in this discussion?

EDIT: Assessing the number of degrees of freedom can be difficult due to correlations among variables. MANOVA analysis is usually required in the multivariate setting. In the univariate setting there is no problem provided the individual data points (observations) are independent.

http://userwww.sfsu.edu/~efc/classes/biol710/manova/manovanewest.htm
 
Last edited:
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top