CAF123 said:
The original data is Gaussian distributed.
I don't understand the probability model for the data - or whatever the fitted function represents. If the the data was (scalar) Gaussian data, you could fit a two parameter gaussian to it and one parameter wouldn't be a function of the other. I also don't understand how minimizing chi-square enters the picture. Are you using the concept of minimizing chi-square to get a different fit than a "least squares" fit? What is the format of the data? ##(x_i)##? ##(x_i, y_i)##? ##(x_i, y_i, z_i)##?
Also, I guess more automated fitting programs would compute the covariance matrix through computing the curvature matrix ( = second derivative of the chi square wrt parameters and result evaluated at best fit values ). I know Mathematica has this as a useable function but I suppose this is what it is doing in the background.
I'm not sure whether "best fit values" refers to values of the data or values of the parameters. If it refers only to values of the parameters, I don't see how statistical properties of the data are included in Mathematica's calculation.
Things that can be "computed" from a specific data are not population parameters. They can be
estimators of population parameters. So you should distinguish between "sample mean" vs "mean" and "covariance matrix" versus "sample covariance matrix" etc.
For a given set of data, we have only one value for the best fitting parameters. So we don't have a "sample covariance matrix" for the parameters. We can use the sample statistics from the data to
estimate the population covariance matrix for the parameters.
Technically
any method of doing this qualifies as an estimator - it may not be unbiased, minimum variance, maximum liklihood etc , but it can still be called an esimator.
Guessing the formula for a good estimator is usually done by expressing (or imagining) the vector of best fitting parameters to be a function of the
sample statistics of the data. ##\overrightarrow{\theta} = \overrightarrow{A}(\overrightarrow{S})## where ##\overrightarrow{S}## is a vector of
sample statistics, such as the sample means, sample covariances etc. of one realization of the data ##\overrightarrow{x}##. (For example, in least squrare fitting of a line to (x,y) data, the slope and intercept of the line are each a function of the sample means, sample variances and sample covariance of the data.)
The sample statistics ##\overrightarrow{S}## are themselves are random variables since they depend on the random variable ##\overrightarrow{x}## When one random variable ##\overrightarrow{\theta}## is a function of another ##\overrightarrow{S}##, we can, in principle, compute the parameters and (population) statistics of ##\overrightarrow{\theta}## if we know the distribution of ##\overrightarrow{S}##. The distribution of ##\overrightarrow{S}## is, in principle, computable from the distribution of ##\overrightarrow{x}##. In many cases it can be computed by only knowing some (population) parameters of the distribution for ##\overrightarrow{x}##.
However, having only data, we don't know the distribution of ##\overrightarrow{x}##. In particular, we don't know the population parameters of that distribution. So the above process of deduction can't be carried out. That line of deduction does
suggest a procedure for
estimation. This would be:
1) Use the sample statistics from the particular data we have as estimators of the population parameters for the distribution of ##\overrightarrow{x}##.
2) Use the estimated distribution of ##\overrightarrow{x}## to estimate the distribution of ##\overrightarrow{S}##.
3) Use the estimated distribution of ##\overrightarrow{S}## to compute an estimated distribution of ##\overrightarrow{\theta}##.Linear approximations are often used in the above calculations. In a linear approximation we expand a function ##G(\overrightarrow{y})## about some point ##\overrightarrow{y_0}## using coefficients that depend on the partial derivatives of ##G##. In the above process, what points ##\overrightarrow{y_0}## are used?
In step 2), the only point we know is the vector of sample statistics ##\overrightarrow{S_0}## computed from the particular data we have.
The above procedure is plausible, but it is not a proof that such an estimation process is good.
The outcome of it will be results like ##\mu_{\theta_1} = 25.2##, ##\sigma_{\theta_1} = 4.3##. People tend to interpret such results as telling the probability that a parameter is in a specific numerical interval. This interpretation is unjustified. It is a Bayesian interpretation of calculations that were done without the proper Bayesian technique. (An actual Bayesian approach requires assuming a prior distribution for ##\overrightarrow{\theta}##).
Furthermore, the above estimation process computes things like ##\sigma_{\theta_1}## by assuming that the particular data we have is representative enough so that the "uncertainties" in the probability distribution for the data are accurately estimated from the uncertanties (e.g. sample standard deviations, sample covariances) in our particular sample. So trying to portray the above process a logical deduction gets into a circular argument: If we assume the data is representative (with probability 1) then we compute uncertainties in our estimates for the parameters of the distribution of the data, and this implies that the probability that our data is representative is less than 1.
However, human nature makes performing the above estimation procedure irresistible. In a particular situation, you can perform simulations to get a practical idea of how well it works.