Numerical computation of the derivative

sysprog · Nov 11, 2021

@Gaussian97, any chance you could post the code in question?

valenumr · Nov 11, 2021

So thinking about this a little more, I think 1± e is pretty well defined computationally. I'm not even sure this would run into problems except for very large values of x. And as long as f(x) isn't very "flat" around x, this is actually very good.

Gaussian97 · Nov 11, 2021

sysprog said:

@Gaussian97, any chance you could post the code in question?

Well, the code in question is more than 2000 lines, but the relevant part is:

Python:

# The i and Q variables are not relevant.
def deriv(f, i, Q, x0, epsilon):
    h = epsilon * x0
    result = (f(i, x0 + h, Q**2) - f(i, x0 - h, Q**2))/(2*h)
    return result

# The function we want to differentiate is:
import lhapdf

pdf = lhapdf.mkPDF(PDFname)
f = pdf.xfxQ2

I don't know if this gives any extra information.

pbuk · Nov 11, 2021

FactChecker said:

I think that, given equal amounts of round-off and truncation errors, this will tend to give a more accurate estimate at the midpoint value than a one-sided calculation would. This is just a gut feeling on my part and my knowledge of the numerical issues is too old to be reliable.

Yes, expanding the Taylor series for each side of the symmetric difference method shows that the ## \frac{h^2}{2!} f''(x) ## term eliminates and you are left with errors ## O(h^3) ## instead of ## O(h^2) ## with a one-sided (Newton) method.

Gut feelings are great, but it's good to remind yourself of the maths every few decades

valenumr · Nov 11, 2021

Gaussian97 said:

Summary:: Question about an algorithm to compute the derivative of a function.

I'm not sure if this is the correct forum to post this question, or should I post it in a math forum. But I was looking at some code when I found a 'strange' implementation to compute the derivative of a function, and I wanted to know if any of you has an idea of why such an implementation is used.

The formula used is
$$f'(x) = \frac{f((1+\varepsilon)x)-f((1-\varepsilon)x)}{2\varepsilon x}$$
Of course, ##\varepsilon## should be a small number. I know that there are many ways to implement the derivative of a function numerically, and obviously, the formula above does indeed converge to the derivative in the limit ##\varepsilon\to 0## in the case of differentiable functions.

My question is if someone has also used this formula instead of the usual ##f'(x) = \frac{f(x+\varepsilon)-f(x-\varepsilon)}{2\varepsilon}## or if someone knows any advantage for using this alternative formula.

Something that may be important is that the formula is used to compute derivatives of functions that are only defined in the interval ##(0,1)##, so maybe I thought this formula has some advantage when ##x \sim 0## or ##x \sim 1##?
For example, this formula has the advantage that even if ##x<\varepsilon## the argument will never be smaller than 0, which probably is one of the reasons for using it.

Does anyone have any information?

So I'm still thinking on this a lot... I'm thinking the (1±e) is a very accurate way to do this. It removes scale, I guess it's the best way I can put it. I actually like this a lot and now that I get it, going to rewrite some code

FactChecker · Nov 11, 2021

pasmith said:

Floating point arithmetic is inherently inaccurate - particularly if you're dealing with very small or very large numbers. You don't want [itex]1/\epsilon[/itex] to overflow, you don't want [itex]x \pm \epsilon [/itex] to be rounded to [itex]x[/itex] and you don't want [itex]|f(x + \epsilon) - f(x - \epsilon)|[/itex] to be rounded to zero.

This is an attempt to avoid those problems.

If x is in (0,1), doesn't multiplying by it aggravate those problems?

valenumr · Nov 11, 2021

FactChecker said:

If x is in (0,1), doesn't multiplying by it aggravate those problems?

I'm still trying to make sense of that fact

FactChecker · Nov 11, 2021

Unless there is some reason to include ##x## as a factor in ##h = x\epsilon##, I would prefer that the characteristics of the numerical estimate be unaffected by a horizontal translation, i.e. the nature of the estimate of ##f'(x+c)## at ##x=x_0## would be the same as the nature of the estimate of ##f'(x)## at ##x=x_0+c##.

pbuk · Nov 11, 2021

It is not that we need to make the step size (which I am going to call ## h ## because ## \epsilon ## is always machine epsilon) small near zero - we need to make the step size as small as we can everywhere: at least small enough that we can be sure that the ## O(h^3) ## term disappears below ## \epsilon ##.

However the smaller we make the step size in relation to ## x ##, the bigger the problem we are going to have with roundoff error: bear in mind that we try to calculate values of ## f(x + h) ## we are actually calculating ## f(x + h + \epsilon(x + h)) ##.

As a result we can see that we can in general minimise the errors introduced by ## \epsilon ## by making ## h ## proportional to ## x ##, and this is what the quoted code does. Exactly what proportion to choose depends on how badly behaved ## f(x) ## is, but given that we are trying to make the ## O(h^3) ## term disappear, something around ## \epsilon^{1 \over 3} ## (adjusted for the range of ## f'''(x) ##) is probably a good place to start.

valenumr · Nov 11, 2021

pbuk said:

It is not that we need to make the step size (which I am going to call ## h ## because ## \varepsilon ## is always machine epsilon) small near zero - we need to make the step size as small as we can everywhere: at least small enough that we can be sure that the ## O(h^3) ## term disappears below ## \varepsilon ##.

However the smaller we make the step size in relation to ## x ##, the bigger the problem we are going to have with roundoff error: bear in mind that we try to calculate values of ## f(x + h) ## we are actually calculating ## f(x + h + \varepsilon(x + h)) ##.

As a result we can see that we can in general minimise the errors introduced by ## \varepsilon ## by making ## h ## proportional to ## x ##, and this is what the quoted code does. Exactly what proportion to choose depends on how badly behaved ## f(x) ## is, but given that we are trying to make the ## O(h^3) ## term disappear, something around ## \varepsilon^{1 \over 3} ## (adjusted for the range of ## f'''(x) ##) is probably a good place to start.

Yeah, for some reason I initially interpreted the function as f(x + e) etc. It makes a lot more sense with f(x * (1 + e))

sysprog · Nov 11, 2021

Gaussian97 said:

I don't know if this gives any extra information.

I think that if it had been in other than an interpreted language, it potentially could have ##-## thanks for the reply . . .

valenumr · Nov 12, 2021

pbuk said:

It is not that we need to make the step size (which I am going to call ## h ## because ## \varepsilon ## is always machine epsilon) small near zero - we need to make the step size as small as we can everywhere: at least small enough that we can be sure that the ## O(h^3) ## term disappears below ## \varepsilon ##.

However the smaller we make the step size in relation to ## x ##, the bigger the problem we are going to have with roundoff error: bear in mind that we try to calculate values of ## f(x + h) ## we are actually calculating ## f(x + h + \varepsilon(x + h)) ##.

As a result we can see that we can in general minimise the errors introduced by ## \varepsilon ## by making ## h ## proportional to ## x ##, and this is what the quoted code does. Exactly what proportion to choose depends on how badly behaved ## f(x) ## is, but given that we are trying to make the ## O(h^3) ## term disappear, something around ## \varepsilon^{1 \over 3} ## (adjusted for the range of ## f'''(x) ##) is probably a good place to start.

It seems pretty clear that this is what the implementation is trying to do, but after thinking about it more, I agree with @FactChecker, for a couple of reasons. I think the implementation in some cases introduces more problems that it solves, though perhaps not applicable in this specific case, but for example, there are problems with negative values of x, and the artificial discontinuity introduced at x=0. Also, if you take something like f(x) = sin(x), answers for f'(x) and f'(x + 2n*pi) will not be exactly the same. Obviously minimizing the epsilon based on the magnitude of x might be more important, but there are other ways to accomplish this without introducing the hx term or the division by x, for example, h should be O(x * 2^-24), so a constant value can be chosen based on the floating point exponent of x that will still guarantee that (x+/-h <> x) without all the weird side effects.

pbuk · Nov 12, 2021

But the implementation is specifically and only for functions over the domain (0, 1) so none of that applies.

valenumr · Nov 12, 2021

pbuk said:

But the implementation is specifically and only for functions over the domain (0, 1) so none of that applies.

Right, I understood that. But also in that case you can just pick h to be 2^-24 (or maybe -23?) For all x, for 32-bit floats.

valenumr · Nov 12, 2021

valenumr said:

Right, I understood that. But also in that case you can just pick h to be 2^-24 (or maybe -23?) For all x, for 32-bit floats.

Well, that's actually wrong if you use x very close to zero, but that can be accounted for.

pbuk · Nov 13, 2021

valenumr said:

Right, I understood that.

That does not appear to be the case.

valenumr said:

But also in that case you can just pick h to be 2^-24 (or maybe -23?) For all x, for 32-bit floats.

We are in Python, there aren't any 32-bit floats.
A 32 bit float has ## \epsilon = 2^{-24}##, what is going to happen to your accuracy when say ## x = 0.9 ##?
If you use a step of ## h = 2^{-24} ## at ## x = 2^{-25} ## the rocket explodes.
I could go on but this is not a good use of anyone's time: if you have something to contribute then show us the math.

valenumr · Nov 13, 2021

valenumr said:

though perhaps not applicable in this specific case

Yes, I did get it, but I don't think this is a good implementation for reasons mentioned.

You are correct that I wasn't thinking about very small values initially, but...

In python, you can use frexp(x) to obtain the exponent of x, and just choose h (e.g, h = 2** [(frexp(x)-53)] for double precision) on that value without all the ugliness involved using h = (1+e)x. It requires some knowledge of the platform specific floating point implementation, but its pretty standard, and if IIRC, python has some methods and constants that will help out.

As an aside, you can abuse c, and just cast abs(x) to a uint with the proper number of bits and add 1 and then reapply the sign (IEEE 754 is lexographically ordered, excepting the sign). Of course in python or c, you will need to deal with over/under flow. It might look messier, but I think it is better.

I would personally avoid the OPs implementation because it is very possible to implement a better algorithm that is probably even more efficient, and just overall consistent over any interval.

But toward your example, choosing h with my approach would yield smaller values for most numbers, rather than running with x, so h on the interval from [0.25, 0.5), for example, would actually be the same, whereas that is not the case for the implementation in question.

valenumr · Nov 13, 2021

Thinking about this more, I like it even less. Again, there are no details on the function being differentiated, but take, for example f(x) = 1 / (x - 0.5) and evaluate at 0.5. it should be undefined, but would give an answer.

mgeorge001 · Nov 13, 2021

I like this formula because in x + Eps, Eps has a dimension, usually ignored, but scaling is often important, something this formula is explicitly giving as Eps is dimensionless in (1 + Eps)x and we measure scale relative to 1, whereas now, x is the only quantity with a dimension. Numerical analysts must take this sort of thing into account all the time to get "adaptive" algorithms. The one point to be aware of in allowing x to set dimension (or absolute scale) in the problem is that if x = 0 falls within the map, this loses scale information, and you will see phenomenon akin to Gibbs phenomenon or Runge's phenomenon. Of course, there are strategems to aviod the trivial setting x = 0 exactly, so strictly speaking, once you are aware, you can avoid program crashes, but Gibbs phenomenon looks pretty unavoidable here. I guess I would use "weak" methods like test functions in Sobolev theory to avoid this. Other than a few minor issues like that, it seems superficially to me to warrant some investigation. In classical applied math and physics people did this sort of thing routinely as in the dielectric constant that appears in classical electrodynamics. Lots of mathematicians are sharp about this sort of thing (e.g. Sobolev, himself) but often there is a gap between the mathematicians and the numerical analyst practitioners. I am only a math teacher, so that's probably worse!

valenumr · Nov 13, 2021

mgeorge001 said:

I like this formula because in x + Eps, Eps has a dimension, usually ignored, but scaling is often important, something this formula is explicitly giving as Eps is dimensionless in (1 + Eps)x and we measure scale relative to 1, whereas now, x is the only quantity with a dimension. Numerical analysts must take this sort of thing into account all the time to get "adaptive" algorithms. The one point to be aware of in allowing x to set dimension (or absolute scale) in the problem is that if x = 0 falls within the map, this loses scale information, and you will see phenomenon akin to Gibbs phenomenon or Runge's phenomenon. Of course, there are strategems to aviod the trivial setting x = 0 exactly, so strictly speaking, once you are aware, you can avoid program crashes, but Gibbs phenomenon looks pretty unavoidable here. I guess I would use "weak" methods like test functions in Sobolev theory to avoid this. Other than a few minor issues like that, it seems superficially to me to warrant some investigation. In classical applied math and physics people did this sort of thing routinely as in the dielectric constant that appears in classical electrodynamics. Lots of mathematicians are sharp about this sort of thing (e.g. Sobolev, himself) but often there is a gap between the mathematicians and the numerical analyst practitioners. I am only a math teacher, so that's probably worse!

To me, it's not impractical at all to implement a method that provides the most accurate possible results within the capabilities of machine representation using the actual limit definition of a derivative.

I keep thinking of examples where this approach fails, and perhaps in specialized circumstances it is works well, but I am not really seeing any utility other than expedient implementation.

mgeorge001 · Nov 13, 2021

valenumr said:

To me, it's not impractical at all to implement a method that provides the most accurate possible results within the capabilities of machine representation using the actual limit definition of a derivative.

I keep thinking of examples where this approach fails, and perhaps in specialized circumstances it is works well, but I am not really seeing any utility other than expedient implementation.

I think it is important to see the proposed formula more in light of classical "macroscopics", i.e. not trying to get at the highly accurate microscopic detail. I personally do not know whether the formula is of much use or not. We see the same "classical" problem of "loss of scale" that can haunt Newton's method for finding zeros. I only meant to point out that the proposal has some appeal in a "big picture" way. It is worth bearing in mind that classical Maxwell theory had to give way to quantum mechanics as ways of addressing microscopic behaviors became more relevant and accessible. But I don't think that implies "lack of utility" as there are instances where we want this sort of picture, and as we well know, Newton's method works pretty nicely in lots of cases despite grievous deficiencies.

rayj · Nov 13, 2021

The OP ( @Gaussian97 ) presented a function for a symmetrical derivative calculation (central difference). This may be useful is some instances.

As this is a computational activity, a key in implementing it is the resultant errors in the steps necessary to carry it out. Compare the original presentation with a simple
[f(x+e) - f(x)]/e
This includes two function evaluations, two sums/differences, and one divide.
The original formula includes two function evaluations, two sums/differences, one divide, and 7 multiplications. The multiplications include terms which are small and the products are then summed. Digital accuracy is compromised.

Another consideration is origination of the data and how the derivatives will be used. If this is experimental data (as most all data is), there will be noise. It is typical for noise to be high frequency such as may be found with radio frequency interference on electrical transducers and on digitally computed results. When such noisy data goes into either of the derivative functions, the high frequency is amplified and may overwhelm the fundamental data.

See R.W. Hamming's book, Digital FIlters, section 6.4 on the design of a filter which only requires products and sums to shape a filter so it does not amplify the high frequency noise. Digital division is very 'expensive'. Converting the derivative to only sums and products reduces this cost, and provides a solution that is not only easy to implement on a general purpose computer but also provides the ladder structure for direct implementation on digital signal processors for rapid, dedicated applications.

This physics forum is a great place to have this discussion as the problem is fundamental to analyses.

rayj · Nov 13, 2021

While Hamming's book is a great resource, if you are in a hurry, online resources may help:

Google search for digital filter differentiation
https://www.mathworks.com/help/signal/ug/take-derivatives-of-a-signal.html
https://www.dsprelated.com/showarticle/35.php
https://eeweb.engineering.nyu.edu/iselesni/nb_lowdiff/nb_lowdiff_slides.pdf
https://www.cns.nyu.edu/pub/lcv/farid03-reprint.pdf
https://www.degruyter.com/document/doi/10.1515/cdbme-2016-0048/html
https://www.analog.com/media/en/training-seminars/design-handbooks/MixedSignal_Sect6.pdf

FactChecker · Nov 13, 2021

rayj said:

Another consideration is origination of the data and how the derivatives will be used. If this is experimental data (as most all data is), there will be noise. It is typical for noise to be high frequency such as may be found with radio frequency interference on electrical transducers and on digitally computed results. When such noisy data goes into either of the derivative functions, the high frequency is amplified and may overwhelm the fundamental data.

This is a case where the two-sided estimate would tend to (slightly) mitigate the noise compared to the one-sided estimate (for equal ##\epsilon## values). It is averaging the function change over a larger change of x.

Swamp Thing · Nov 24, 2021

I thought I'd share this timestamped link into a video that YouTube just randomly recommended. It's a nice numerical differentiation trick that might interest those who have been following this thread.

valenumr · Nov 25, 2021

mgeorge001 said:

I think it is important to see the proposed formula more in light of classical "macroscopics", i.e. not trying to get at the highly accurate microscopic detail. I personally do not know whether the formula is of much use or not. We see the same "classical" problem of "loss of scale" that can haunt Newton's method for finding zeros. I only meant to point out that the proposal has some appeal in a "big picture" way. It is worth bearing in mind that classical Maxwell theory had to give way to quantum mechanics as ways of addressing microscopic behaviors became more relevant and accessible. But I don't think that implies "lack of utility" as there are instances where we want this sort of picture, and as we well know, Newton's method works pretty nicely in lots of cases despite grievous deficiencies.

Well, for sure you can have functions that just don't work well with basic numerical approaches. If it is something high frequency, for example, and you try to compute a derivative that should be at a maxima or minima, it won't work well with a standard approach, but might be better with a +/- dh. But I think that really is a more fundamental problem of scale and numerical accuracy in the simulation. I think it's reasonable (I haven't gotten deep on the math too much) that the extra error term (say h²) in a quadratic gets erased doing it this way, but in a lot of cases, this term will be much smaller than f(x) to the point that it is outside of machine precision. Obviously not always. But if it does become problem I don't think the method of taking the derivative is the issue. It would be better to scale the function appropriately if at all possible to get more accurate results.

pbuk · Nov 25, 2021

valenumr said:

I haven't gotten deep on the math too much

<sigh> then what is the point in commenting? Numerical analysis without the analysis is just guessing.

valenumr said:

the extra error term (say h²) in a quadratic gets erased doing it this way

No, I repeat, the point in a two-sided calculation is that the ## f''(x) ## term ~~gets erased~~ is eliminated. Where ## f''(x) \gg f'(x) ##, which is of course true at an extremum, this leads to better accuracy. However where this is not true the two-sided calculation can be less accurate, because you are doubling the effect of ## \epsilon ##.

mgeorge001 · Nov 25, 2021

valenumr said:

Well, for sure you can have functions that just don't work well with basic numerical approaches. If it is something high frequency, for example, and you try to compute a derivative that should be at a maxima or minima, it won't work well with a standard approach, but might be better with a +/- dh. But I think that really is a more fundamental problem of scale and numerical accuracy in the simulation. I think it's reasonable (I haven't gotten deep on the math too much) that the extra error term (say h²) in a quadratic gets erased doing it this way, but in a lot of cases, this term will be much smaller than f(x) to the point that it is outside of machine precision. Obviously not always. But if it does become problem I don't think the method of taking the derivative is the issue. It would be better to scale the function appropriately if at all possible to get more accurate results.

I think your point about scale is significant. Thanks. You have a mature perspective on this, that's obvious. A lot of this kind of work boils down to assumptions about small/small: That's a tricky 0/0 issue, but with a lot of work, like being able to reliably neglect say an h^2, I think numerical experts can address the difficulties meaningfully. The 0/0 problem vis a vis logs is equivalent to the infinity - infinity problem. Both enter the picture sometimes.

valenumr · Nov 26, 2021

pbuk said:

<sigh> then what is the point in commenting? Numerical analysis without the analysis is just guessing.No, I repeat, the point in a two-sided calculation is that the ## f''(x) ## term ~~gets erased~~ is eliminated. Where ## f''(x) \gg f'(x) ##, which is of course true at an extremum, this leads to better accuracy. However where this is not true the two-sided calculation can be less accurate, because you are doubling the effect of ## \epsilon ##.

Perhaps I misread or misinterpreted that as saying the term is eliminated when ## f''(x) ## is zero, which over an interval is just a straight line, and isn't that interesting. So I meant to say, and didn't express it well, that I'm pretty sure if ## f''(x) ## is some constant, then the answer should be exact (within machine rounding precision).

For example, you can think of the symmetrical derivation function in terms of the mean value theorem. It is essentially computing a secant line between two points on f. MVT says there is an x' between those two points whose tangent line is parallel to the secant. It doesn't necessarily mean that x' == x, but in the case of a parabola, you can show that for any choice of h, x' will always be equal to x, or in other words, the two endpoints of the secant line will always be equidistant from the value x whose tangent line is parallel.

So when I say I didn't go into the math more deeply, I just meant that I didn't make any an attempt to prove the statement: "if ## f''(x) ## is some constant over the evaluation interval, then the answer is exact" for all functions, but I think it is probably true.

Numerical computation of the derivative

Similar threads

Hot Threads

Recent Insights