Is there an optimal distance between measurements for regression

fahraynk · Jun 15, 2018

Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?

berkeman · Jun 15, 2018

I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

FactChecker · Jun 15, 2018

fahraynk said:

Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.

If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

fahraynk · Jun 15, 2018

FactChecker said:

If you already know the expectation and variance, what more do you expect to get from a regression?
It seems like there is no point in doing a regression. On the other hand, if you don't know those, any estimate of the confidence in the regression estimate would depend on the data collected. If each sample is expensive, you should try to perform regressions as you get each new data point and stop when the confidence is good enough. The regression algorithms will calculate virtually anything you need to know.

The initial standard deviation and expectation are basically the best guess without any measurements, and a very large standard deviation. The measurements are to approximate the function with an accurate expectation and a low standard deviation.

I am just trying to figure out how to optimally spread out my measurements, based on an initial expectation and variance (a rough guess which is normally distributed).

fahraynk · Jun 15, 2018

berkeman said:

I'm not sure this will help, but for time-dependent functions, the sampling frequency does have a minimum:

https://en.wikipedia.org/wiki/Nyquist_rate

You need to sample at a rate greater than or equal to the Nyquist rate, or you lose information. Does that help at all?

EDIT / ADD -- Do you have any idea how quickly this function can vary with x? If it something like a Gaussian distribution, then the Nyquist Rate probably does not apply. If it is a quickly varying function of x, it might still apply.

Hey, thanks, its not time dependent. You brought back some memories of studying modulators though.

FactChecker · Jun 15, 2018

I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.

Tosh5457 · Jun 16, 2018

FactChecker said:

I have never studied that question, but I bet that the best sample location for a linear regression of a subject of the form ##Y=ax+b+\mathcal{N}(\mu,\,\sigma^{2})## is to put half of the data points at each end of the range of x. That should give you the best estimate of the slope, but it will give you no way to verify that the subject really is linear.

What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?

WWGD · Jun 17, 2018

Isn't there a general formula for the confidence interval for a in terms of the data pairs? Maybe we can see if the interval contains 0? I guess it depends on the sampling distribution of the slope.

FactChecker · Jun 17, 2018

Tosh5457 said:

What would be the reasoning behind it? To keep the confidence interval as narrow as possible, we should be minimizing the coefficient of variation of the regression's estimator (=regression coefficient estimator's standard deviation/coefficient estimate), right?

Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

fahraynk · Jun 17, 2018

FactChecker said:

Suppose you only had two sample values. The wider the separation of their x values, the less the random component changes the estimated slope. So one will get the best slope estimate if the separation is maximized. Adding more sample points in pairs leads one to also position them at the extremes.

Is there any way to calculate this?

WWGD · Jun 17, 2018

fahraynk said:

Suppose I am trying to approximate a function which I do not know, but I can measure. Each measurement takes a lot of effort.

Say the function I am approximating is ##y=f(x)## and ##x \in [0,100]##

Supose I know the expectation and variance of ##f(x)##.
Is there a way to compute the confidence in regression approximation as a function of sample size?
Is there a maximum number of useful measurements? For example, if I take 20 measurements between ##x=1## and ##x=2##, and 1 measurement at ##x=20##, another measurement at ##x=2## won't really tell me any new information right? So shouldn't there be a maximum number of measurements, or a minimum distance between measurements based upon the standard deviation of ##f(x)##?

I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

fahraynk · Jun 17, 2018

WWGD said:

I am just curious: Do you know the distribution of the response y, and whether there is a significant linear relationship between the variables.

In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?

FactChecker · Jun 17, 2018

fahraynk said:

Is there any way to calculate this?

You say your function is nonlinear, so what I said is not applicable and would probably be a very bad thing to do.

WWGD · Jun 17, 2018

fahraynk said:

In general, the higher ##X##, the higher ##Y##, but its non linear.
I read that ##2## ##x## coordinates don't give new information unless they are spaced apart a distance: $$\frac{1}{2\pi \sigma}$$ where sigma is my predicted standard deviation of ##x##, but I don't want to trust that unless I have an explanation, because intuitively the expectation of ##f(x)## should improve at a point, even if I take ##10## measurements at the same point, right?

I am sorry, I don't understand, do you mean the function is constant in intervals ## \frac {1}{2 \pi \sigma} ##? Is sigma known to be small , so the ratio is large , or is sigma more towards being large? I assume maybe f takes some jumps in order to explain why it ? As I understand, we would need to know the type of regression

Is there an optimal distance between measurements for regression

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect