Normalized SSE for a set of data?

pergradus · Jun 16, 2011

Normalized SSE for a set of data??

Hi, supposed I have a set of data points, and each data point has a certain value of uncertainty associated with it.

Supposed also I have a function which models the data. What I'd like to know is how does one quantitatively measure how good the fit of the model is to the data, in such as way that one can compare the model to different sets of data?

For example, taking the SSE defined as:

\sum(y_i - f(x_i , \beta))^2

where \beta is a set of parameters, one can measure the difference between the model and the data. However, this does not take into account the number of data points or the degree of uncertainty. If I have a very large number of data points, a small difference between the data and model will result in a very large SSE, even if the model is a very good fit. On the contrary, even if the model is poor, a small set of data points may produce a small SSE - thus there is no way to compare between data sets for the same model.

Also, one must consider the magnitude of the data points when comparing sets. For example, if my y_i values range from 100...500 for one set, a small fractional change between the model and the data may still produce a huge SSE, while a huge fractional change between the model and data ranging from 0.01...0.5 will result in a small SSE.

So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?

Stephen Tashi · Jun 19, 2011

pergradus said:

So, what I'd like is a way to compare the goodness of a fit for a wide variety of data sets that takes into account the error in the data, the number of data points, and the magnitude of the dependent variables in the data - can someone explain how to do this and what such a quantity is called?

You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.

pergradus · Jun 20, 2011

Stephen Tashi said:

You could try the mean of the sum of the squares of the percentage errors, but unless you can define what makes a good or bad comparison precisely, there is no mathematical answer to your question.

I was thinking along the same lines... something like this:

\frac{1}{N}\sum\frac{(y_i - f(x_i, \beta)) + e_i}{y_i}

where N is the total number of data points, and e_i is the error associated with the ith data point. This would give the average maximum possible difference between the curve and the data.

Do you see any flaws in this reasoning?

Stephen Tashi · Jun 20, 2011

pergradus said:

Do you see any flaws in this reasoning?

It isn't reasoning. It's the method of "I don't know exactly what I want, but I'll recognize it when I see it". Everybody who deals with practical problems resorts to this method sometimes. If you want to deduce an answer by mathematical reasoning, you have to define precisely what it is that you are trying to do. If you are just casually fooling around with some data, I think the formula is worth a try. If you have some very serious purpose in your work, you should try define you goal in precise language and see where reasoning will take you.

SW VandeCarr · Jun 21, 2011

I haven't seen any reference to the term "mean squared error" (MSE) in this thread, This is simply the the SSE divided by the number of degrees of freedom and is essentially what this thread is addressing. It's a widely used and well regarded statistic and is useful in evaluating the bias of estimators among other things. I'm not aware of any particular issues associated with it. Am I missing something in this discussion?

EDIT: Assessing the number of degrees of freedom can be difficult due to correlations among variables. MANOVA analysis is usually required in the multivariate setting. In the univariate setting there is no problem provided the individual data points (observations) are independent.

http://userwww.sfsu.edu/~efc/classes/biol710/manova/manovanewest.htm

Normalized SSE for a set of data?

Similar threads

Undergrad Please Explain (actually explain) The Monty Hall Problem

Undergrad A variant of the Monty Hall problem

High School How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

High School Onto set mapping is the surjective set mapping, and into injective?

Undergrad How do E[X] and E[|X|] relate?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers