# Curve fitting with errors-in-variables

## Main Question or Discussion Point

Hi,
I need some directions to target a problem that is bothering me quite a lot, even some links or small explanation if possible, thank you in advance!!!

I have a huge dataset Ω (unknown) of experimental values {$e_i$} that should approximate (with noise both in the value and in the variable) a unknown curve f, with the condition that f is not-decreasing.

I get a subset of Ω, called "training subset", that is randomly chosen, so that should reflect the same characteristics of Ω in distribution and other geometrical evaluators. In the following image the subset is Ω' = {$x_1,...,x_{18}$}, just to explain it with an intuitive drawing: Now, what I like is to find the "best" curve fitting the data, and if possible two curves of confidence, like the orange and the green in the previous image, expressing some percentile confidence that the curve will be between the two curves.

I see some different issues:
1. what means "best" curve? well, if we have many data with overlapping variable confidence intervals (like between $x_5$ and $x_{15}$), it means the more probable value; more points I add to Ω' the more "precise" I expect to be the expected curve;
2. because there are errors in variable, can it make sense to evaluate $f$ only in a finite number of points with some algorithm?
3. where points are dense, I expect almost to see the distribution of errors and the 2 confidence curves be nearer to the "best" curve, while where points are not dense the curve should be highly imprecise (confidence curves far away); are these 2 cases to be treated separately?
4. the condition of monotonicity for f implies some limitation of the confidence curves also in absence of non dense intervals, as I expect confidence curves monotonous too (am I right?), but what this means in terms of construction of the "best" curve fitting the data?
5. the errors of the data can be considered random gaussian errors, with σ to be determined; while the error distribution in the value can be evaluated within a dense interval, how to estimate the error distribution in variable?

I really like to understand how to address this kind of problem, even using some computational algorithm, but I've difficulty trying to find the proper terms to look for, or a hint to show me where to start.

#### Attachments

• 22.8 KB Views: 398
• 70 KB Views: 531

Related Set Theory, Logic, Probability, Statistics News on Phys.org
Stephen Tashi
First, lets consider the problem of fitting curves that are non-decreasing. One search phrase for this is "monotone approximation".

There are families of functions that are non-decreasing. For example, in the real numbers the square of a number is always positive. So if $g(x)$ is any real-valued function then $h(s) = (g(s))^2$ is non-negative. The integral of a non-negative function is an increasing function of the upper limit of integration. So $f(x) = \int_0^x h(s) dx$ is a non-decreasing function of $x$.

So you can define a family of non-decreasing curves by setting $g(s)$ to be some family of functions defined by parameters - such as $g(s) = As + B$. Then you can work out $h(s)$ and do the integration. You'll get a family of cubic polynomials, but it won't be as general the family of all cubic polynomials.

To fit such a family to data by least squares, you need a software package that lets you specify the family. I think such packages exist, but I'm not familiar with current curve fitting programs.

If instead of a single function, you want to model the data by splines, there are also methods of creating non-decreasing spline functions. In fact, "non-decreasing spline" might turn out to be a better search phrase than "monotone approximation" because "monotone approximation" can be a very abstract topic.

Thank you Stephen,
one thing that I like to understand is how to find this "best" curve without a curve family to fit to. Otherwise every choice of the family will inherit the specificity of the family and not the data to be fit (I don't need the curve to be smooth, for example fitting with non-decreasing polinomial will have artifacts and soo on...)

Imagine that there were no "errors", that I just got the points $f(x_i)$ for $x_i \in Ω'$, so I believe that a good approximation is a piecewise-linear function passing by $(x,f(x))$, why not? To have other approximations (spline or others) I need some more informations to tell me why am I prefering the spline to the piecewise-linear, am I right?

Now, if the point x is not a point but a probability/distribution of the position, it's like if I'm spreading the value $f(x)$ into an interval (in high school it was like $[x_i-\epsilon, x_i+\epsilon]$ ); let's assume that measurements error in the variable x are random, so with a gaussian distribution. What this information is giving us for a generic x of the interval? For example, in the interval $[x_5,x_{15}]$ I expect that the curve is not like a saw-tooth, because the error in x is not allowing this kind of resolution (to have errors-in-variable seems to me a bit like filtering to take out high frequencies from Fourier spectrum).

Maybe the issue of conditions on f (like non-decreasing, but can be others in similar problems) can be addressed later, once understood how to deal with the problem without conditions. For example, in the drawing $f(x_5)>f(x_6)$, but this is normal because we have also errors-in-value (in high school it was like $[f(x_i)-\epsilon_f, f(x_i)+\epsilon_f]$ ), let's assume again gaussian: how to combine the conditions (non-decreasing f) with the distributions of value in a distribution of position?

Stephen Tashi
one thing that I like to understand is how to find this "best" curve without a curve family to fit to. Otherwise every choice of the family will inherit the specificity of the family and not the data to be fit
If your data has n distinct values of x and you insist on treating each value of x in isolation then there is no justification for creating a curve at the values that are not in the data. There is no justification for estimating the vaule y at each x and then joining the points (x,y) with a straight line.

Unless you only want to estimate y with a function at exactly the values of x that are in the data, you must make some assumption that establishes a relation between values of x that are in the data and those that are not. For example, if you use fourier analysis, you are using a family of functions, namely those that can be represented as fourier series. Each term in a fourier series is a periodic funciton. When you estimate the coefficient of that function you are letting the values of the data at different values x affect the estimate at all other values of x.

Imagine that there were no "errors", that I just got the points $f(x_i)$ for $x_i \in Ω'$, so I believe that a good approximation is a piecewise-linear function passing by $(x,f(x))$, why not?
A better question is "why?". You can believe what you want, but such a belief is not a mathematical justification.

To have other approximations (spline or others) I need some more informations to tell me why am I prefering the spline to the piecewise-linear, am I right?
Esstentially no. You need "more information" even to justify a piecewise linear approximation. The "more information", you need consists of what model you assume for how the data is generated, including the errors in measurement. If you don't have such information, you are asking a question that has insufficient information to give a definite mathematical answer. It's like asking to find the sides and angles of a triangle when given only one side and one angle.

Now, if the point x is not a point but a probability/distribution of the position, it's like if I'm spreading the value $f(x)$ into an interval (in high school it was like $[x_i-\epsilon, x_i+\epsilon]$ ); let's assume that measurements error in the variable x are random, so with a gaussian distribution. What this information is giving us for a generic x of the interval? For example, in the interval $[x_5,x_{15}]$ I expect that the curve is not like a saw-tooth, because the error in x is not allowing this kind of resolution (to have errors-in-variable seems to me a bit like filtering to take out high frequencies from Fourier spectrum).
To justify a particular filtering method, you still need assumptions or information about how the data is generated. You seem to be afraid to make specific assumptions about this and you are hoping that "math" can provide a specific answer. "Math" doesn't have an answer unless it has more to go on.

Hi Stephen,
you are right, and your words were very very useful for me. I passed almost all the last days playing with these concepts and finding very interesting things on the topic.

What I would like in this specific problem, is not to limit the family of functions to a subspace of very few dimensions (like in the case of polynimials of degree n), but instead to work with a bigger family (like, continuous functions) and to have the assumptions coming from other sources (like, how the $(x_i,f(x_i))$ influences the nearby values $(x,f(x))$, or limits on the derivates f', f'',...

I've found something around the name "Bayesian non-parametric models", and tried to play a bit with some ideas... still I don't have absolutely a clear viewpoint of the topic.
...do you have some good ideas of where I can continue to search or get a good summary on the topic? (of course, I'm not requesting you to do, but if you know already the answer and have time...)

Stephen Tashi