That's a good question and I can't give you an authoritative answer!
Thinking about this clearly requires using certain terminology precisely. My thoughts on the subject:
When we have a random variable with a normal probability distribution then the standard deviation of that distribution is indeed a parameter that measures the spread of the distribution and you can calculate (or use tables) to find the probability that a random draw from that distribution is within plus or minus so many standard deviations of the mean value.
When you have a different distribution you can usually do the same sort of thing, but you must use different calculations and tables. In other words, the tables for the normal distribution are not "universal"; they don't apply to all probability distributions. However, the normal distribution is the most commonly encountered and many not-quite-normal distributions can be well approximated by the normal distribution.
When we have a sample of data, this is not a probability distribution. We usually assume it came from a probability distribution and we use some calculation of the sample values to estimate the parameters of the distribution that we assumed. So, in your situation, the errors of the residuals are involved in the calculations.
As to terminology: In your problem, the quantity e^2/(n-1) is an "estimator" of the standard deviation of the distribution of errors. Some people define "the standard deviation of the sample" to be the quantity e^2/(n-1) and some people define it to be e^2/n. It's a rather arbitrary decision. However, the "estimator of the population standard deviation" is not such an arbitrary choice. It is almost always taken to be e^2/(n-1) since one can prove that this formula has desirable properties.
In summary, we have at least 3 different sorts of things involved in statistics:
A) Probability distributions and their parameters, such as their standard deviations and means
B) Samples and their parameters such as their standard deviations and means
C) Estimators , which are formulas or algorithms that estimate the parameters of distributions as functions of the data in samples.
So when you say "standard deviation", it is an ambiguous phrase until you specify whether you mean a parameter of a probability distribution, a parameter of a sample, or an estimate of the parameter of the distribution based on the data in a sample.
As I understand the method in the paper you linked, it doesn't worry much about the distinction between the value of the estimated standard deviation and the actual standard deviation of the probability distribution. It assumes that the estimated value is close enough to the actual value.
----
Your (and my ) big conceptual problem is that in the curve fitting scenario the data is assumed to be randomly generated ( to the extent that it has random deviations from a curve that gives the mean value of the data), but the output of the program talks about the parameters of the curve as if they were somehow random. It does calculations that are based on the parameters of the curve having a standard deviation, in the sense of the standard deviation of a probability distribution.
I haven't studied the details of what is going on. My impression is that the scenario is as a follows. We use the data to estimate the parameters of a probability distribution for the data. We assume this is the true distribution of the data. We imagine that many different samples (each consisting of n data points) are drawn from this distribution and that the curve fitting process produces a slightly different curve fit on different samples. So there is a distribution of different parameters for the curves. This way, the parameters of the curves do have random variation.
If a computer program carried out the above calculation exactly it would:
1) Use the data to estimate a probability disbribution for the data
2) Use the estimated probability distribution to compute the distribution of parameters that would occur when we fit curves to random batches of n data points.
3) Use the distributions of the parameters to state confidence intervals for the parameters.
As typical nonlinear curve fitting problems work, I think they make some simplifying assumptions. In spite of the fact that they are "nonlinear" curve fits, I think that they do calculations that assume the curve is approximated by a linear function. I don't know the technical details of that yet. They also assume that the distributions involved are normal distribiutions (which may be true "asymptotically" as we use larger and larger sample sizes). This is why I think the confidence intervals for the parameters are "asymptotic linear confidence intervals" rather than ordinary confidence intervals.
From browsing the web, it appears that the use of nonlinear curve fitting software is becoming more and more common, so the question of what these programs produce is very topical. I intend to study it more.