Normalized SSE for a set of data?

  • Context: Graduate 
  • Thread starter Thread starter pergradus
  • Start date Start date
  • Tags Tags
    Data Set
Click For Summary
SUMMARY

This discussion focuses on the challenge of quantitatively measuring the goodness of fit for a model against a set of data points, particularly when considering uncertainty, the number of data points, and the magnitude of the dependent variables. The traditional Sum of Squared Errors (SSE) is critiqued for its inability to provide a fair comparison across different datasets. The conversation highlights the Mean Squared Error (MSE) as a more appropriate metric, as it normalizes the SSE by the number of degrees of freedom, allowing for better evaluation of model performance. Additionally, the importance of defining precise goals in statistical analysis is emphasized to avoid ambiguity in results.

PREREQUISITES
  • Understanding of statistical concepts such as Sum of Squared Errors (SSE) and Mean Squared Error (MSE).
  • Familiarity with model fitting techniques and parameter estimation.
  • Knowledge of error analysis and its implications in data modeling.
  • Basic understanding of degrees of freedom in statistical contexts.
NEXT STEPS
  • Research the application of Mean Squared Error (MSE) in model evaluation.
  • Explore methods for assessing degrees of freedom in multivariate analysis, including MANOVA.
  • Learn about the implications of data point uncertainty on model fitting.
  • Investigate alternative metrics for model comparison, such as Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
USEFUL FOR

Data scientists, statisticians, and researchers involved in model fitting and evaluation, particularly those dealing with uncertainty and varying data magnitudes.

pergradus
Messages
137
Reaction score
1
Normalized SSE for a set of data??

Hi, supposed I have a set of data points, and each data point has a certain value of uncertainty associated with it.

Supposed also I have a function which models the data. What I'd like to know is how does one quantitatively measure how good the fit of the model is to the data, in such as way that one can compare the model to different sets of data?

For example, taking the SSE defined as:

\sum(y_i - f(x_i , \beta))^2

where \beta is a set of parameters, one can measure the difference between the model and the data. However, this does not take into account the number of data points or the degree of uncertainty. If I have a very large number of data points, a small difference between the data and model will result in a very large SSE, even if the model is a very good fit. On the contrary, even if the model is poor, a small set of data points may produce a small SSE - thus there is no way to compare between data sets for the same model.

Also, one must consider the magnitude of the data points when comparing sets. For example, if my y_i values range from 100...500 for one set, a small fractional change between the model and the data may still produce a huge SSE, while a huge fractional change between the model and data ranging from 0.01...0.5 will result in a small SSE.

So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?
 
Physics news on Phys.org


pergradus said:
So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?

You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.
 


Stephen Tashi said:
You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.

I was thinking along the same lines... something like this:

\frac{1}{N}\sum\frac{(y_i - f(x_i, \beta)) + e_i}{y_i}

where N is the total number of data points, and ei is the error associated with the ith data point. This would give the average maximum possible difference between the curve and the data.

Do you see any flaws in this reasoning?
 


pergradus said:
Do you see any flaws in this reasoning?

It isn't reasoning. It's the method of "I don't know exactly what I want, but I'll recognize it when I see it". Everybody who deals with practical problems resorts to this method sometimes. If you want to deduce an answer by mathematical reasoning, you have to define precisely what it is that you are trying to do. If you are just casually fooling around with some data, I think the formula is worth a try. If you have some very serious purpose in your work, you should try define you goal in precise language and see where reasoning will take you.
 
I haven't seen any reference to the term "mean squared error" (MSE) in this thread, This is simply the the SSE divided by the number of degrees of freedom and is essentially what this thread is addressing. It's a widely used and well regarded statistic and is useful in evaluating the bias of estimators among other things. I'm not aware of any particular issues associated with it. Am I missing something in this discussion?

EDIT: Assessing the number of degrees of freedom can be difficult due to correlations among variables. MANOVA analysis is usually required in the multivariate setting. In the univariate setting there is no problem provided the individual data points (observations) are independent.

http://userwww.sfsu.edu/~efc/classes/biol710/manova/manovanewest.htm
 
Last edited:

Similar threads

  • · Replies 11 ·
Replies
11
Views
3K
  • · Replies 20 ·
Replies
20
Views
3K
  • · Replies 14 ·
Replies
14
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
Replies
24
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
4K
  • · Replies 8 ·
Replies
8
Views
2K