# Underdetermined vs Overdetermined Systems

Tags:
1. Jun 30, 2015

### CoSurShe

I'm trying to create a model which is of the form

y = (a0 + a1l)[b0+MΣm=1 bmcos(mx-αm)] [c0 + NΣn=1 cn cos(nz-βn)]

In the above system, l,x and z are independent variables and y is the dependent variable. The a, b and c terms are the unknowns. To solve for these unknowns, I have two separate data sets that I can use. Using data set 1 creates an overdetermined system providing me with more observations than unknowns, while data set 2 creates an underdetermined system with less observations than unknowns. In such a case, which approach would be better - underdetermined or overdetermined? and Why?

2. Jun 30, 2015

### HallsofIvy

Staff Emeritus
Neither is very good! Which is "better" depends upon what you want to do and what you mean by "better". The "underdetermined" system allows for an infinite number of "solutions" but you can determine a subset (actually subspace) of all possible combinations that is the set of all combinations that exactly satisfy the system. The "overdetermined" system do not gave any correct solution but you can determine the unique solution that comes closest to the satisfying the system in the least squares sense.

3. Jun 30, 2015

### CoSurShe

Thanks for the reply. I need to fit the model described above to either of the two available sets of data and use the residue to perform a separate set of analyses. The real concern is that I have too many variables to fit and I fear overfitting and hence the resulting (un)reliability and accuracy of the results. I am aware of the regularization procedures and other steps to mitigate overfitting. I am not certain if a regularized regression technique or partial least squares method of finding the coefficients for an underdetermined system is better suited for removing the trend described by the model above when compared to using the same model with the other data set which would be using a regularized overdetermined system?

4. Jun 30, 2015

### cgk

One thing you could do, in principle, is to use both data sets to fit the parameters. Then you have an even more over-determined data set than with data set 2 alone. Is this a good idea? Hard to say.

In general, what you *want* to achieve is to find a parameterization of your model which needs as few fit-dependent parameters as possible (for example, by fixing parameters or functional forms using asymptotic expansions, known constraints, symmetry, etc.). You could then either (a) check if the model is reasonable by fitting it to a subset of data, and checking if it reasonably reproduces the other data, and (b) if this works, use all data you have for least-squares fitting (or maximum likelyhood-fitting or whatever you like) the parameters model, to extend its range of applicability as far as possible.

Basically, the more parameters you need to fit, the more susceptible your system becomes to overfitting, and thus becoming unreliably (and possibly erratic) as soon as you step outside the range of data which was not included in the data set. If in doubt, I would always consider a under-fitted model with less parameters, which reasonably reproduces larger data sets, as more trustworthy than a over-fitted model which more closely reproduces the data set it was fitted on. Some of the most successful models in all of physics (e.g., http://dx.doi.org/10.1063/1.464913) achieved their success mainly because they had few parameters and thus little room for overfitting---which increased their applicability even beyond the originally envisioned applications.