Is there an optimal distance between measurements for regression

In summary, the conversation discusses approximating a function with measurements and knowing its expectation and variance. There is a question about computing the confidence in the regression approximation as a function of sample size and whether there is a maximum number of useful measurements. The topic of the Nyquist rate is also mentioned. Finally, there is a discussion about the best sample location for a linear regression and the reasoning behind it, as well as a question about calculating it.
  • #1
fahraynk
186
6
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?
 
Last edited:
Physics news on Phys.org
  • #2
I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.
 
Last edited:
  • Like
Likes WWGD and fahraynk
  • #3
fahraynk said:
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.
 
  • Like
Likes fahraynk
  • #4
FactChecker said:
If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

The initial standard deviation and expectation are basically the best guess without any measurements, and a very large standard deviation. The measurements are to approximate the function with an accurate expectation and a low standard deviation.

I am just trying to figure out how to optimally spread out my measurements, based on an initial expectation and variance (a rough guess which is normally distributed).
 
  • #5
berkeman said:
I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

Hey, thanks, its not time dependent. You brought back some memories of studying modulators though.
 
  • #6
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.
 
  • Like
Likes fahraynk
  • #7
FactChecker said:
I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?
 
Last edited:
  • #8
Isn't there a general formula for the confidence interval for a in terms of the data pairs? Maybe we can see if the interval contains 0? I guess it depends on the sampling distribution of the slope.
 
  • #9
Tosh5457 said:
What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.
 
  • Like
Likes Tosh5457
  • #10
FactChecker said:
Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

Is there any way to calculate this?
 
  • #11
fahraynk said:
Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?
I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.
 
  • #12
WWGD said:
I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?
 
  • #13
fahraynk said:
Is there any way to calculate this?
You say your function is nonlinear, so what I said is not applicable and would probably be a very bad thing to do.
 
  • #14
fahraynk said:
In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?
I am sorry, I don't understand, do you mean the function is constant in intervals ## \frac {1}{2 \pi \sigma} ##? Is sigma known to be small , so the ratio is large , or is sigma more towards being large? I assume maybe f takes some jumps in order to explain why it ? As I understand, we would need to know the type of regression
 

1. What is the optimal distance between measurements for regression?

The optimal distance between measurements for regression is dependent on the specific data set and the type of regression being used. There is no one set distance that is considered optimal for all cases.

2. How do I determine the optimal distance between measurements for regression?

The optimal distance is typically determined through trial and error, by testing various distances and comparing the performance of the regression model. It is important to also consider the underlying assumptions of the regression model and ensure that they are being met.

3. Can the optimal distance between measurements change for different types of regression?

Yes, the optimal distance may vary for different types of regression. For example, linear regression may have a different optimal distance than logistic regression or polynomial regression.

4. Is there a rule of thumb for determining the optimal distance between measurements?

There is no set rule of thumb for determining the optimal distance between measurements for regression. However, some researchers suggest using a distance that is approximately 10% of the range of the independent variable.

5. How does the optimal distance between measurements affect the accuracy of the regression model?

The optimal distance between measurements can greatly impact the accuracy of the regression model. If the distance is too small, the model may overfit the data, leading to poor generalization. If the distance is too large, the model may oversimplify the relationship between the variables and result in underfitting. Finding the right balance is crucial for accurate regression modeling.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
454
  • Set Theory, Logic, Probability, Statistics
Replies
18
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
900
  • Set Theory, Logic, Probability, Statistics
2
Replies
56
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
724
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
475
Back
Top