Is there an optimal distance between measurements for regression

Click For Summary
The discussion revolves around the optimal spacing of measurements for approximating an unknown function, particularly in the context of regression analysis. It highlights the importance of knowing the expectation and variance of the function, questioning the utility of regression if these parameters are already known. Participants explore the concept of maximizing the distance between sample points to improve slope estimates and minimize confidence intervals. The Nyquist rate is referenced as a principle for sampling frequency, although its applicability to non-time-dependent functions is debated. Ultimately, the conversation emphasizes the need for strategic measurement placement to enhance the accuracy of function approximation while considering the costs associated with data collection.
fahraynk
Messages
185
Reaction score
6
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?
 
Last edited:
Physics news on Phys.org
I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.
 
Last edited:
  • Like
Likes WWGD and fahraynk
fahraynk said:
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.
 
  • Like
Likes fahraynk
FactChecker said:
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

The initial standard deviation and expectation are basically the best guess without any measurements, and a very large standard deviation. The measurements are to approximate the function with an accurate expectation and a low standard deviation.

I am just trying to figure out how to optimally spread out my measurements, based on an initial expectation and variance (a rough guess which is normally distributed).
 
berkeman said:
I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

Hey, thanks, its not time dependent. You brought back some memories of studying modulators though.
 
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.
 
  • Like
Likes fahraynk
FactChecker said:
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?
 
Last edited:
Isn't there a general formula for the confidence interval for a in terms of the data pairs? Maybe we can see if the interval contains 0? I guess it depends on the sampling distribution of the slope.
 
Tosh5457 said:
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.
 
  • Like
Likes Tosh5457
  • #10
FactChecker said:
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

Is there any way to calculate this?
 
  • #11
fahraynk said:
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?
I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.
 
  • #12
WWGD said:
I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?
 
  • #13
fahraynk said:
Is there any way to calculate this?
You say your function is nonlinear, so what I said is not applicable and would probably be a very bad thing to do.
 
  • #14
fahraynk said:
In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?
I am sorry, I don't understand, do you mean the function is constant in intervals ## \frac {1}{2 \pi \sigma} ##? Is sigma known to be small , so the ratio is large , or is sigma more towards being large? I assume maybe f takes some jumps in order to explain why it ? As I understand, we would need to know the type of regression
 

Similar threads

  • · Replies 23 ·
Replies
23
Views
4K
Replies
3
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 18 ·
Replies
18
Views
3K