# Beating linear LS on polynomials

Hi all,

Suppose I have a system which can be described using something like:

$$y(t) = a_1 x(t) + a_2 x^2(t) + \dots + a_p x^p(t)$$

I want to find the coefficients using samples from x(t) and y(t) (pairwise taken at same times) using as fewest samples as possible.

Clearly this is linear in parameters and can be recast in a LS problem.

However, despite the problem being linear, the underlying process is nonlinear in this results in the problem that the linear part "overshades" the non-linear parts (or lower terms higher terms). For example, using few samples, a1 usually gets a good estimate while the error on the nonlinear parts is much higher.

Is it proven that LS will always give the best performance in such scenario or is there hope that the error on the non-linear (or higher order) coefficients can be improved (at the cost of slightly larger error on the linear (or low order) coefficients)?

I think LS should attend the CRLB (and is therefore optimum), right?

On the other hand, I have some ideas why I think that we can do better but I didn't find a concrete way to do it. I'll wait with this to keep the initial posting short.

Thank you,
divB

Last edited:

SteamKing
Staff Emeritus
Homework Helper
You'll have to be more specific on what you mean by 'best performance'.

Regular least squares minimizes the square of the difference between the actual y-measurement and the regressed y-measurement for all the data points. I don't know how you would go about trying to find the regression of individual powers of the independent variable.

mfb
Mentor
The relative uncertainties depend on the x-values of your samples. A larger spread in x-values will give more sensitivity to higher powers of x.

Hi all,

thanks so far.
"Best performance" was maybe the wrong term. What I mean is error to the "true" coefficients.
Suppose I create a "golden model" by taking 1M measurements.
When I use then only 50 measurements then, the difference to a1 would be the best and ap would have nearly nothing to do with the value taking with 1M measurements.
The reason is that the linear (lower) terms are dominant and "absorb" most of the measuremnt power.
The question is if I can improve the error of ap by sacrificing a little bit loss in the a1 term for example. Something like a "weigthing factor".

I tried many combinations of something like "Use 25 samples, then fix a1 and then use 25 new samples (together with the 25 old ones) to estimate a2 and a3 (for p=3, e.g.).

@mfb: Are you sure this is true? This sounds counter intuitive to me. A larger spread in the x-values correspond to a larger sampling period, right? This in turn means lower sampling rate which "cuts off" even more information from higher order contribution (because the bandwidth of "x" will grow proportional to "p").

Is there any pointer for alternative estimation algorithms or different LS flavors which could be benificial here?

Thanks
divB

mfb
Mentor
I tried many combinations of something like "Use 25 samples, then fix a1 and then use 25 new samples (together with the 25 old ones) to estimate a2 and a3 (for p=3, e.g.).
That does not help. LS with your total dataset is the best you can do based on that dataset.

@mfb: Are you sure this is true?
Yes.
This sounds counter intuitive to me. A larger spread in the x-values correspond to a larger sampling period, right?
That depends on x(t).
This in turn means lower sampling rate which "cuts off" even more information from higher order contribution (because the bandwidth of "x" will grow proportional to "p").
You are fitting a polynomial, not Fourier coefficients.

Simple example:
If all coefficients are of the order of 1, the value at x=10-4 is ##y(10^{-4})=a_1 10^{-4} + a_2 10^{-8} + a_3 10^{-12} + ... \approx a_1 10^{-4}## - you have no sensitivity to higher factors.
At x=1, all contributions are similar: ##y(1)=a_1 + a_2 + a_3 + ... +a_p##
At x=104, the largest power dominates: ##y(1)=a_1 10^4 + a_2 10^8 + ... +a_p 10^{4p} \approx a_p 10^{4p}##.

It does not matter which magnitude your coefficients have, larger x will always give more sensitivity to higher powers.

AlephZero
Homework Helper
In terms of the mathematics, it doesn't matter how you write your polynomial function. But if you want to do numerical work the form in your OP is very ill-conditioned. If ##p## is fairly large (say 10) the term ##a_px^p(t)## contributes very little to the function except for when ##|x|## is close to its maximum value.

Numerically, it is better to write the function as a sum of orthogonal polynomials, for example Chebyshev polynomials (see http://mathworld.wolfram.com/ChebyshevPolynomialoftheFirstKind.html)

For a given set of data points, it is possible to construct an "optimal" set of polynomials which are exactly orthogonal at those data points. The least squares equations then become diagonal. That idea is useful if you have many sets of data at the same ##x## locations, because you only need to calculate the orthogonal polynomials once and then use them for each set of data values.

The practical details will be in any good book or website on least squares methods.

That does not help. LS with your total dataset is the best you can do based on that dataset.

I guess this is "general" knowledge but do you have a pointer by any chance? Is it that LS attends CRLB?

Simple example:

Thanks, got it!

Maybe I was oversimplifying it too much: First, it is more than just a polynomial (more like a Wiener system). Second, I do not have direct access to y(t) or x(t) but only to the samples at a specific (high) rate. So I have, lets say, 64k samples, all taken at distance of 9ns and I can only use 100. What I can do of course, is to wisely pick which samples I take. Currently I make the best results if I choose the 100 out of 64k. randomly. However, that's still not perfect and not deterministic.

The coefficients itself are complex which shouldn't make too much difference and their absolute value is within 1-2 orders of magnitude (e.g. 2 to 0.1). The magnitues are usually decreasing from higher to lower order.

Numerically, it is better to write the function as a sum of orthogonal polynomials

Thanks for this pointer. Actually I already tried them. They are definitely improving the condition number and would be very helpful for finite precision. However, with 64 bit floating point, the result is essentially the same for the same amount of measurements.

Maybe I can describe at this point why I think it could work better: Rather than looking at the problem in time domain, this can be done in frequency domain and rather than obtaining the 100 domain samples, we could obtain 100 frequency domain samples (e.g., the DFT of the sample points from above).

Suppose that x has a specific bandwidth B and the time domain measurements are such that the sampling rate is > 2*B*p.

In this case, the linear part (a1 term) only affects the frequency components up to B.
Conversely, the second order part (a2) affects the components within 2B and so on.

So it does not make sense to incorporate the frequency components >B for estimating a1 or >2B for a2 and so on.
However, using the standard LS approach this is done!

Is there any pointer where I could look regarding this?

One more note to the error: The error between x and y is which is minimzed. However, for certain applications it is the non-linear contribution (the spectral content > B) which is important and should be estimated more accurately than the spectral part < B.

I hope you can follow here what I mean.

Thanks!
divB