Generating data from trendline

1. Feb 6, 2012

PixelDictator

Hello all,

I am trying to take a fitted line, with given standard error in slope and y-intercept, and generate sets of random data points (and corresponding uncertainties) which would give the same line with the same uncertainties.

I'm at a loss for ways to achieve this, and I'm not quite sure that it would be possible without trying to brute-force it with programming, or something equally ugly... Is there any method that would make this happen? We don't have any original data points, just the few numbers about the trendline.

2. Feb 6, 2012

Number Nine

Generate some numbers and transform them according to the equation of your line. Then just draw "noise" from a normal distribution centred at zero and add it to your data. In matlab, you would do something like this...

x = rand(1,100); % Generate some data
noise = normrnd(0,1,1,100); % Generate noise
y = 2*x + 1 + noise; % Transform it according to the equation of your line

3. Feb 7, 2012

Stephen Tashi

Do you mean exactly the same line with exactly the same standard deviation for the errors? - so someone fitting a line to the generated data would get the exactly the same slope and intercept?

Or do you mean you want to do what Number Nine suggested -which is to assume your line is the correct deterministic part of the equation for the data and then generate the random errors? In that case, someone fitting a line to the generated data might not get exactly the same line as you began with.

4. Feb 13, 2012

PixelDictator

Stephen,
I'm attempting to do the former. I've set up a program to do what Number Nine suggested, which works pretty well in the meantime, but it would be a lot better if I had a way to recreate the line and uncertainties perfectly.

5. Feb 13, 2012

Stephen Tashi

You can scale a set of values to have whatever mean and standard deviation you want by adding and multiplying it by two constants. For example, generate a set of values E. Suppose it has mean mu and variance sigma_sq. For constants c and k, created scaled data by setting F = k E + c. The data F has mean = k mu + c and variance = k^2 ( sigma_sq). You can solve for the values of k and c that produce the mean and variance that you want.

(In this post I'm talking about variances as a "sample variances", which are computed with a denominator of n = the number of data points, not with a denominator of n-1, as in the unbiased estimator for population variance.)

You are using the ambiguous word "uncertainties", and I can't be sure what quantity or quantities you mean by that.

One interesting technicality about linear least squares regression is that if you fit a line to (x,y) data viewing x as the independent variable, you may get a different line that if you regard y as the independent variable. If you want "artificial" data so that the procedure for linear least squares regression produces a given line when applied to that data, then you must be careful to specify which variable is treated as independent.

Assume x is the independent variable and the artificial data is (x, y) with y = A x + B + F where A and B are constants and the F are artificial "errors" from the trendline. The equations that must be satisified in order for the linear regression to reproduce A and B when applied to the data are (as I recall):

A = ( cov(x,Ax + B + F))/ var(x)
B = mean of (A x + B + F) - (A)( mean of x).

where the means and variances involved are sample means and variances of the data.

If I'm clear on what you are trying to do then we can check if I got those equations right and solve for them for k and c.

Share this great discussion with others via Reddit, Google+, Twitter, or Facebook