# AICc and parameter likelihood from repeated fits

1. Oct 18, 2012

### digfarenough

I'm looking at using AICc to compare a set of nested models and wondered about the following.

First, my the errors in my n data points around the fit are not i.i.d. Gaussian (the data have an artifact that the models will not able to fit easily and the artifact introduces dependencies and a non-Normal distribution of errors), so I worry about using the sum of the square residuals (RSS) divided by n as the likelihood. I recall reading somewhere that using RSS/n as likelihood in AICc can still produce reasonable results even if the errors are not Gaussian, though I think they were still independent in that case.

An alternative came to mind, though, but I can't get my thinking straight on it.

The fits often hit local minima so I do repeated non-linear least squares fits from random start points. Some of the models will overfit the data, and in these some of the parameters will vary wildly as one unnecessary parameter takes on extreme values that are balanced by extreme values from another unnecessary parameter.

[As a sidenote: I may misunderstanding likelihood in the context of a fit: Likelihood is the probability of an observed outcome given certain parameters, but in a fit the resulting shape is entirely determined by the parameters, so the likelihood would seem to be 1. However, since likelihood is a function of a statistical model, perhaps I should think of the fitting itself as being part of the process. Then my data would become the "parameters" of the statistical fitting process, which produces a distribution over the estimated parameters of the model, playing the part of the "outcome" in the definition of likelihood.]

That probably makes no sense, so: By considering the vectors of estimated parameters from all the repetitions of the fit, I get points sampled from a distribution in parameter space. This is sort of an empirical estimate of likelihood that shows whether the repeated fittings tend to cluster in one part of parameter space or whether they are all spread out in space. The former would suggest the model in question is a good fit and the latter that the model is overfitting. One of these points has the lowest RSS and so is the best fit. This suggests to me I can use the estimated parameter distribution density at the point with the lowest RSS as the likelihood of the fit for the AICc computation.

Am I all mixed up here? Is this a bad idea? I still can't quite figure out how to go from a cloud of points in parameter space to a single value of likelihood to plug into the AICc expression, though. I considered using multivariate kernel density estimation to smooth the discrete distribution out, but I worry I'm making things more and more complicated.

Could it be better to simply throw out models where the best fit had very large confidence intervals and only use those with reasonable intervals in the AICc comparison, using RSS/n as the likelihood?

Any input would be appreciated. I have searched at length online but can't seem to find the appropriate search terms to give me the answers I need, and I am not well versed in this aspect of statistics.

2. Oct 19, 2012

### Stephen Tashi

If this is a real world problem, you'll get better advice by describing the real problem rather than thrashing about in mathematical abstractions. As to the AIC, I have never seen an exposition of why this criteria is optimal in a particular situation. Is it known to be optimal in yours? Is there a reason why you are using it?

3. Oct 19, 2012

### DrDu

If the models are nested, why aren't you looking at the chisquared statistics for deviance differences instead?

4. Oct 19, 2012

### DrDu

The fit yields you the maximum of the likelihood function, not the complete likelilhood function. That's why it is called "maximum likelihood method".

5. Oct 19, 2012

### chiro

Hey digfarenough and welcome to the forums.

I'm going to echo all above responses in that you should tell us the problem in the simplest language (no mathematics, statistics, or probability) and that you should understand how variation (i.e. variance) says how much of a model contributes to the description of the data and also how it does this relative to other models (as Dr. Du has pointed out).

If you are looking at fits, then the most basic idea you will want to look at is the residual sum of squares for each model and the relative figure for this model.

The basic idea behind this is that one model will contribute so much variance with regards to the data that it is trying to fit and another will contribute so much relative to this model.

If a model is not contributing enough variance to capture the variation of the model, then it is a bad fit in comparison to a model that does capture the variation of the model.

This is quantified in the residual sum of squares and in general, ANOVA and other tests use the ratio of these two variances to make a conclusion about whether to accept or reject a particular model (in a general way) so it doesn't have to just be comparing two fits, but also testing whether all means have evidence to be statistically the same and the principle behind both of these shares a common origin and idea.

Recall that the F-distribution is a ratio of chi-squares and a chi-square is just a sum of independent squared normals and what happens when you sum variances? Well you add them together (if they are all independent).

This is the intuition that you can use when looking at how the results for variances and hypothesis testing of variances comes about and why it works under the assumptions.

6. Oct 19, 2012

### digfarenough

Hi everyone, thanks for all the replies!

@Stephen -- The problem is mode estimation in the frequency domain with corrupted noisy data. I have an experimentally measured complex-valued impedance function with a superimposed noisy oscillation and want to decompose as the sum of its modes, ignoring the superimposed oscillation.
I could use Pade approximants to avoid fitting entirely, but the data is really noisy so I'm not sure how comfortable I would be, and the application is in neuroscience where reviewers will likely not be comfortable with that sort of approach. Also, were you asking about how AIC is relevant in this particular case or why anyone would ever use AIC at all? I thought it was the standard when more than two models must be compared, since the F-test can only compare two nested models.

@DrDu -- I guess my ignorance is clear! I'm using a nonlinear least squares fit, not a maximum likelihood fit. You're saying that the results of a nonlinear least squares fit is a maximum likelihood estimator of the parameters? Even when it often hits local minima? It sounds like you assume I am comparing two models, though, but I have more than two models and they are nested in two dimensions (i.e. it isn't how many time constants, but how many time constants and also how many response frequencies), so I need to select from, say, a 5x5 grid of models.

@chiro -- I described more detail in response to Stephen earlier in this post. Is that still too mathematically abstract? I'm really not sure how to describe it without using mathematics, statistics, or probability...
I don't entirely understand your suggestion, though. Are you familiar with AIC and likelihood? I guess you're saying I simply shouldn't use the AIC? It sounds like what you're describing is in the context of hypothesis testing when comparing two models where one is a null hypothesis. I'm not really doing hypothesis testing, I'm doing model selection. There's no null to reject, there's just one model that is most likely.

7. Oct 19, 2012

### Stephen Tashi

Your description isn't too mathematically abstract but it would only be understandable by an engineer in your field. It isn't clear to me what the format of your data is.

The way I visual "a complex valued impedance in the frequency domain" is that your data is ordered pairs of values (f, z) where f is a frequency and z is a complex number. The way to visualize a "superimposed noisly oscillation" is problematical. An "oscillation" implies something regular and "noisy" implies somethig irregular. So perhaps you mean you have a "noise" that is also a set of ordered pairs (f,z) where the z values are drawn at random from some distribution.

It isn't clear what part of your problem is in need of a model what part is assumed to have a definite model that is not in need of any selection. For example, are you trying to develop a model for the noise too?

8. Oct 19, 2012

### digfarenough

Hi Stephen,

Thanks for your interest! Sorry for not being clearer!

You're correct that the data is (f,z) pairs where f is real and z complex. I don't have a closed form expression for the corruption, but in the magnitude of the data |z|, it looks like a decaying sinusoid. It is more or less symmetrical around the real data.

For concreteness, I expect to approximate the data in the form

$$\hat{z}(f) = \sum_{k=0}^{k=n}\frac{w_k}{fi + a_k} + \sum_{k=0}^{k=m}\frac{x_k\omega_k + y_k(fi+b_k)}{(fi+b_k)^2 + \omega_k^2}$$ where n and m define the model, i is the imaginary unit, and w_k, a_k, x_k, y_k, b_k, and omega_k are the parameters to be estimated. I do not intend to model the noise.

To be clearer about the shape of the noise, the corrupted data magnitude would look like:
$$|z(f)| = \left|\sum_{k=0}^{k=n}\frac{w_k}{fi + a_k} + \sum_{k=0}^{k=m}\frac{x_k\omega_k + y_k(fi+b_k)}{(fi+b_k)^2 + \omega_k^2}\right| + exp(-f/\tau)cos(cf)$$
where the last term is a rough approximation of the shape of the corruption, leaving out an additional term of independent noise for each f. (I wrote that in magnitude form because I'm not sure how to express that type of oscillation as a complex number offhand...)

Since the noise is approximately symmetric about the data, it is my hope that a least squares fit will, for the most part, avoid fitting it (of course, as the model gets bigger, more and more of the corruption will be fit by the model, which is a problem).

I perform the fit for pairs of (n,m) over a grid from, say, (0,1) or (1,0) up to (5,5). Each fit can hit local minima, so I have to perform multiple fits with random starting points and I use the fit with the smallest RSS as the best fit. Of these, I want to know which (n,m) pair gives the most likely fit. Normally RSS divided by the number of points can be used as the likelihood in the AICc expression, but the cos(cf) part of the error term means the errors are certainly not independent. I've tried to do this anyhow and get decent, but not great results, so I'm hoping to exploit the results from the repeated fittings for each (n,m) pair to get a better likelihood value for cleaner results.

I hope that makes the problem clearer. Please let me know if any other information would be useful!

9. Oct 20, 2012

### Stephen Tashi

First a general observation- it is natural when doing a real life problem to feel that you must "avoid unwarranted assumptions". However, in order to have enough "givens" to make a well-defined mathematical problem, you usually must make assumptions. Real life problems can be approached mathematically by solving them in various ways using different sets of assumptions and seeing how sensitive your answers are to these assumptions. Real life problems cannot be solved mathematically by picking various algorithms that solve various mathematical problems and applying them without ever stating the mathematical problem you are solving.

It's quite common to see people fiddle with data using various mathematical methods without ever making a precise statement of the problem they are solving. I'm not saying that this isn't a productive thing to do. It may be empirically effective. I'm just saying that if you take that approach you can't expect a mathematical answer to questions about whether you should do this or that. There can be subjective opinions, but if you don't have enough "givens", you simply don't have a case for saying that a certain mathematical algorithm produces a defensible answer.

You are willing to form some opinions about the "noise" in the data, but you don't want to model it. The fact that the residuals are not independent gaussians lead to to ask, in your original post, whether you could somehow treat the parameters of the model as random and use a liklihood function based on statistics of various fitting attempts as input to computing the AIC. If you want to do it, you need to make some assumptions that introduce something probabilistic about the parameters. That might be possible, but it seems much simpler to me to add a representation of the systematic noise component to your model.

I'll attempt an exercise in mind-reading and psychology and you can evaluate my effort.
I see this as a problem where you are somewhat confident of the form of the model - perhaps you have something like a passive filter and you know the theoretical expression for its frequency response. You want to keep the model-fitting algorithm simple - you prefer to do a grid search rather than something sophisticated. Hence you don't want to model the noise. You want a final result that has some theoretical justification - such as "I fit the model using .... which is best according to the Akaike Information Criterion".

My personal preferences in this situation would be to model the systematic noise and use a variety of fitting methods such a simulated annealing and conjugate gradient. Fiddle with things when doing the fitting task, not in the other aspects of the problem.

You should not let mathematical analysis lobotimize your knowledge of physics. I don't know if conservation of energy or current, etc. plays a role in the problem, but you should think about the physics that lead to you the particular form of the model you picked. That may give hints for modeling the systematic noise.

If you don't want to model the noise then I think you should at least do a weighted least squares fit that gives more weight to the data where you think the noise is small.

10. Oct 20, 2012

### digfarenough

Thanks for your thoughts! I'm not sure that I've failed to define the problem, though perhaps I didn't do it clearly here, but you are probably right that it would be better to directly model the corruption, so that the residuals would be much closer to independent. A problem is there is no simple expression for the corruption, but I recall seeing a paper long ago where someone derived an expression that should be pretty close, so I'll have to try to track that one down again and test whether it works.

Your mind-reading is basically correct. The system is approximately LTI, and that is an expression for the impedance of a finite-dimensional linear system.

I did consider the weighting you mention, but unfortunately the corruption is largest at the values that are most important in constraining the fit parameters, so this would probably not work so well.