# Normalized SSE for a set of data?

1. Jun 16, 2011

Normalized SSE for a set of data??

Hi, supposed I have a set of data points, and each data point has a certain value of uncertainty associated with it.

Supposed also I have a function which models the data. What I'd like to know is how does one quantitatively measure how good the fit of the model is to the data, in such as way that one can compare the model to different sets of data?

For example, taking the SSE defined as:

$\sum(y_i - f(x_i , \beta))^2$

where $\beta$ is a set of parameters, one can measure the difference between the model and the data. However, this does not take into account the number of data points or the degree of uncertainty. If I have a very large number of data points, a small difference between the data and model will result in a very large SSE, even if the model is a very good fit. On the contrary, even if the model is poor, a small set of data points may produce a small SSE - thus there is no way to compare between data sets for the same model.

Also, one must consider the magnitude of the data points when comparing sets. For example, if my $y_i$ values range from 100...500 for one set, a small fractional change between the model and the data may still produce a huge SSE, while a huge fractional change between the model and data ranging from 0.01...0.5 will result in a small SSE.

So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?

2. Jun 19, 2011

### Stephen Tashi

Re: Normalized SSE for a set of data??

You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.

3. Jun 20, 2011

Re: Normalized SSE for a set of data??

I was thinking along the same lines... something like this:

$\frac{1}{N}\sum\frac{(y_i - f(x_i, \beta)) + e_i}{y_i}$

where N is the total number of data points, and ei is the error associated with the ith data point. This would give the average maximum possible difference between the curve and the data.

Do you see any flaws in this reasoning?

4. Jun 20, 2011

### Stephen Tashi

Re: Normalized SSE for a set of data??

It isn't reasoning. It's the method of "I don't know exactly what I want, but I'll recognize it when I see it". Everybody who deals with practical problems resorts to this method sometimes. If you want to deduce an answer by mathematical reasoning, you have to define precisely what it is that you are trying to do. If you are just casually fooling around with some data, I think the formula is worth a try. If you have some very serious purpose in your work, you should try define you goal in precise language and see where reasoning will take you.

5. Jun 21, 2011

### SW VandeCarr

Re: Normalized SSE for a set of data??

I haven't seen any reference to the term "mean squared error" (MSE) in this thread, This is simply the the SSE divided by the number of degrees of freedom and is essentially what this thread is addressing. It's a widely used and well regarded statistic and is useful in evaluating the bias of estimators among other things. I'm not aware of any particular issues associated with it. Am I missing something in this discussion?

EDIT: Assessing the number of degrees of freedom can be difficult due to correlations among variables. MANOVA analysis is usually required in the multivariate setting. In the univariate setting there is no problem provided the individual data points (observations) are independent.