# Is there an optimal distance between measurements for regression

Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldnt there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?

Last edited:

berkeman
Mentor
I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

Last edited:
WWGD and fahraynk
FactChecker
Gold Member
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

fahraynk
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

The initial standard deviation and expectation are basically the best guess without any measurements, and a very large standard deviation. The measurements are to approximate the function with an accurate expectation and a low standard deviation.

I am just trying to figure out how to optimally spread out my measurements, based on an initial expectation and variance (a rough guess which is normally distributed).

I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

Hey, thanks, its not time dependent. You brought back some memories of studying modulators though.

FactChecker
Gold Member
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.

fahraynk
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?

Last edited:
WWGD
Gold Member
Isn't there a general formula for the confidence interval for a in terms of the data pairs? Maybe we can see if the interval contains 0? I guess it depends on the sampling distribution of the slope.

FactChecker
Gold Member
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

Tosh5457
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

Is there any way to calculate this?

WWGD
Gold Member
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldnt there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?
I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?

FactChecker
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?