Beating linear LS on polynomials

  • Context: Graduate 
  • Thread starter Thread starter divB
  • Start date Start date
  • Tags Tags
    Linear Polynomials
Click For Summary

Discussion Overview

The discussion revolves around the challenges of estimating polynomial coefficients in a system described by a nonlinear relationship between input and output variables. Participants explore the effectiveness of least squares (LS) methods in this context, particularly when using a limited number of samples, and consider alternative approaches to improve the estimation of higher-order coefficients.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • Some participants suggest that while LS minimizes the square of the differences between actual and regressed measurements, it may not provide the best performance for estimating higher-order coefficients when using limited samples.
  • There is a proposal that a larger spread in x-values could enhance sensitivity to higher powers of x, although this is contested by others who argue that it may lead to lower sampling rates and reduced information for higher-order contributions.
  • One participant describes an approach of fixing lower-order coefficients and then estimating higher-order coefficients with additional samples, but finds this does not yield better results than using the total dataset.
  • Another participant mentions that the polynomial function's numerical conditioning can be improved by using orthogonal polynomials, such as Chebyshev polynomials, which may lead to better performance in certain scenarios.
  • A later reply introduces the idea of analyzing the problem in the frequency domain rather than the time domain, suggesting that this could allow for more accurate estimation of non-linear contributions by selectively incorporating frequency components relevant to each polynomial term.
  • Concerns are raised about the error minimization process, with some participants noting that the importance of accurately estimating non-linear contributions may vary depending on the application context.

Areas of Agreement / Disagreement

Participants express differing views on the effectiveness of LS methods for estimating polynomial coefficients, particularly in the presence of limited samples. There is no consensus on whether alternative approaches or modifications to LS could yield better results, and the discussion remains unresolved regarding the best strategies for improving coefficient estimation.

Contextual Notes

Participants highlight limitations related to the conditioning of polynomial functions and the potential impact of sample selection on estimation accuracy. The discussion also touches on the complexity of the coefficients and the implications of working with high-dimensional data.

divB
Messages
85
Reaction score
0
Hi all,

Suppose I have a system which can be described using something like:

y(t) = a_1 x(t) + a_2 x^2(t) + \dots + a_p x^p(t)

I want to find the coefficients using samples from x(t) and y(t) (pairwise taken at same times) using as fewest samples as possible.

Clearly this is linear in parameters and can be recast in a LS problem.However, despite the problem being linear, the underlying process is nonlinear in this results in the problem that the linear part "overshades" the non-linear parts (or lower terms higher terms). For example, using few samples, a1 usually gets a good estimate while the error on the nonlinear parts is much higher.

Is it proven that LS will always give the best performance in such scenario or is there hope that the error on the non-linear (or higher order) coefficients can be improved (at the cost of slightly larger error on the linear (or low order) coefficients)?

I think LS should attend the CRLB (and is therefore optimum), right?

On the other hand, I have some ideas why I think that we can do better but I didn't find a concrete way to do it. I'll wait with this to keep the initial posting short.

Thank you,
divB
 
Last edited:
Physics news on Phys.org
You'll have to be more specific on what you mean by 'best performance'.

Regular least squares minimizes the square of the difference between the actual y-measurement and the regressed y-measurement for all the data points. I don't know how you would go about trying to find the regression of individual powers of the independent variable.
 
The relative uncertainties depend on the x-values of your samples. A larger spread in x-values will give more sensitivity to higher powers of x.
 
Hi all,

thanks so far.
"Best performance" was maybe the wrong term. What I mean is error to the "true" coefficients.
Suppose I create a "golden model" by taking 1M measurements.
When I use then only 50 measurements then, the difference to a1 would be the best and ap would have nearly nothing to do with the value taking with 1M measurements.
The reason is that the linear (lower) terms are dominant and "absorb" most of the measuremnt power.
The question is if I can improve the error of ap by sacrificing a little bit loss in the a1 term for example. Something like a "weigthing factor".

I tried many combinations of something like "Use 25 samples, then fix a1 and then use 25 new samples (together with the 25 old ones) to estimate a2 and a3 (for p=3, e.g.).

@mfb: Are you sure this is true? This sounds counter intuitive to me. A larger spread in the x-values correspond to a larger sampling period, right? This in turn means lower sampling rate which "cuts off" even more information from higher order contribution (because the bandwidth of "x" will grow proportional to "p").

Is there any pointer for alternative estimation algorithms or different LS flavors which could be benificial here?

Thanks
divB
 
divB said:
I tried many combinations of something like "Use 25 samples, then fix a1 and then use 25 new samples (together with the 25 old ones) to estimate a2 and a3 (for p=3, e.g.).
That does not help. LS with your total dataset is the best you can do based on that dataset.

@mfb: Are you sure this is true?
Yes.
This sounds counter intuitive to me. A larger spread in the x-values correspond to a larger sampling period, right?
That depends on x(t).
This in turn means lower sampling rate which "cuts off" even more information from higher order contribution (because the bandwidth of "x" will grow proportional to "p").
You are fitting a polynomial, not Fourier coefficients.

Simple example:
If all coefficients are of the order of 1, the value at x=10-4 is ##y(10^{-4})=a_1 10^{-4} + a_2 10^{-8} + a_3 10^{-12} + ... \approx a_1 10^{-4}## - you have no sensitivity to higher factors.
At x=1, all contributions are similar: ##y(1)=a_1 + a_2 + a_3 + ... +a_p##
At x=104, the largest power dominates: ##y(1)=a_1 10^4 + a_2 10^8 + ... +a_p 10^{4p} \approx a_p 10^{4p}##.

It does not matter which magnitude your coefficients have, larger x will always give more sensitivity to higher powers.
 
In terms of the mathematics, it doesn't matter how you write your polynomial function. But if you want to do numerical work the form in your OP is very ill-conditioned. If ##p## is fairly large (say 10) the term ##a_px^p(t)## contributes very little to the function except for when ##|x|## is close to its maximum value.

Numerically, it is better to write the function as a sum of orthogonal polynomials, for example Chebyshev polynomials (see http://mathworld.wolfram.com/ChebyshevPolynomialoftheFirstKind.html)

For a given set of data points, it is possible to construct an "optimal" set of polynomials which are exactly orthogonal at those data points. The least squares equations then become diagonal. That idea is useful if you have many sets of data at the same ##x## locations, because you only need to calculate the orthogonal polynomials once and then use them for each set of data values.

The practical details will be in any good book or website on least squares methods.
 
That does not help. LS with your total dataset is the best you can do based on that dataset.

I guess this is "general" knowledge but do you have a pointer by any chance? Is it that LS attends CRLB?

Simple example:

Thanks, got it!

Maybe I was oversimplifying it too much: First, it is more than just a polynomial (more like a Wiener system). Second, I do not have direct access to y(t) or x(t) but only to the samples at a specific (high) rate. So I have, let's say, 64k samples, all taken at distance of 9ns and I can only use 100. What I can do of course, is to wisely pick which samples I take. Currently I make the best results if I choose the 100 out of 64k. randomly. However, that's still not perfect and not deterministic.

The coefficients itself are complex which shouldn't make too much difference and their absolute value is within 1-2 orders of magnitude (e.g. 2 to 0.1). The magnitues are usually decreasing from higher to lower order.

Numerically, it is better to write the function as a sum of orthogonal polynomials

Thanks for this pointer. Actually I already tried them. They are definitely improving the condition number and would be very helpful for finite precision. However, with 64 bit floating point, the result is essentially the same for the same amount of measurements.




Maybe I can describe at this point why I think it could work better: Rather than looking at the problem in time domain, this can be done in frequency domain and rather than obtaining the 100 domain samples, we could obtain 100 frequency domain samples (e.g., the DFT of the sample points from above).

Suppose that x has a specific bandwidth B and the time domain measurements are such that the sampling rate is > 2*B*p.

In this case, the linear part (a1 term) only affects the frequency components up to B.
Conversely, the second order part (a2) affects the components within 2B and so on.

So it does not make sense to incorporate the frequency components >B for estimating a1 or >2B for a2 and so on.
However, using the standard LS approach this is done!

Is there any pointer where I could look regarding this?


One more note to the error: The error between x and y is which is minimzed. However, for certain applications it is the non-linear contribution (the spectral content > B) which is important and should be estimated more accurately than the spectral part < B.


I hope you can follow here what I mean.


Thanks!
divB
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 3 ·
Replies
3
Views
1K