Calculating distance between origin and subsequent points

DaanV · Jul 12, 2013

Greetings helpful gentlemen and gentle women!

The following is a problem I ran into during my internship. The background of my problem is biological/chemical in nature, so I’ll try to bother you as little as possible with the details.

Thanks in advance for any help given.

Homework Statement

I have a point p0 on a 2-dimensional surface. The next point (p1) is at distance L1 from p0. The next point p2 is at distance L2 from p1. This continues, up to p28.

I would like to find an equation to calculate the expected distance between the origin and each subsequent point (d1, d2, etc), as a direct function of the number of steps (s) between the origin and that point.

The step length L between each two subsequent points is assumed to be variable, but with a limited maximum value Lmax. L = x*Lmax, where x is a random (or at least poorly understood) variable between 0 and 1.

Homework Equations

My dear friend Pythagoras:
[tex]A^2+B^2=C^2[/tex]
Also:
[tex]\sin\left(\alpha\right)^2+\cos\left(\alpha\right)^2=1[/tex]

The Attempt at a Solution

The distance d1 (between p0 and p1) is obviously trivial: It is L1.

The angle between d1 and the line p1-p2 is called a2.

d2 is then calculated as:
[tex]d2^2=(L1+L2*\cos\left(a2\right))^2+(L2+\sin\left(a2\right))^2<br /> d2^2=L1^2+L2^2+2*L1*L2*\cos\left(a2\right)[/tex]

d3 is calculated in a very similar manner, with a3 the angle between d2 and the line p2-p3:

[tex]d3^2=(d2+L3*\cos\left(a3\right))^2+(L3+\sin\left(a3\right))^2<br /> d3^2=d2^2+L3^2+2*d2*L3*\cos\left(a3\right)[/tex]

In general, it can be said that:

[tex]ds^2=d\left(s-1\right)^2+Ls^2+2*d\left(s-1\right)*\cos\left(as\right)[/tex]

Now, this equation requires that I always calculate d(s-1) before I can calculate ds. I would like to find a direct equation, if that is at all possible. I am aware that that for any single iteration of this process, one needs to know all previous steps, but I would say that there is some expected value if the process is repeated often enough.

To establish this, I used Excel to simulate this process 12,000 times, and noted the average distance between the origin and ps after s steps. What resulted was a clear correlation, with very little variability (due to the large number of iterations). This to me seems like a clear indicator that there is indeed some expected value for ds, but I can’t seem to nail down exactly how I could calculate it.

Again, thanks in advance for any help provided. You would really help me out if you could point me in the right direction.

Regards,

DaanV

mfb · Jul 12, 2013

The expected squared distance (or the root of this) would be easier to calculate in an exact way - you can simply add all steps individually (and take the root of the final result).
In the limit of many steps (28 is not really that limit, but not so far away - I don't know which precision you need), you get a Gaussian distribution around the origin, where the expected squared distance is easy to calculate. This allows to calculate the expected absolute distance, too.

L = x*Lmax, where x is a random (or at least poorly understood) variable between 0 and 1.

Do you know the distribution of x, or at least something about it? Or do you know all the lengths of a process, if you want to calculate it? In that case, what is left to average over?

haruspex · Jul 12, 2013

Suppose you are some distance r from the origin after n steps. The next step is a distance uniformly distributed from 0 to L in (equally likely) any direction, right? Can you compute the average distance from the origin after that step?
(Are you familiar with Brownian motion?)

DaanV · Jul 17, 2013

I thank you both kindly for your help. I also apologize for the slowness of this reply: I got caught up in some other stuff that needed immediate attention.

So far, I have calculated ds for every value of s [1,28], 12k times for each value.

I notice that calculating ds and then averaging over all values does not yield the same results as calculating ds^2, averaging and then taking the root. This makes sense, since the root of average is not equal to the average of roots. This makes me wonder: Which is the correct way to calculate?

When calculating the root over the average, it follows that:
[itex]ds^2=Lmax^2*x^2*s[/itex]

or
[itex]ds=Lmax*\sqrt{x^2}*\sqrt{s}[/itex]

(These results were derived from observing the obtained values after computer ds^2 manually. I'm not sure how I can mathematically show that this is indeed the case. Help?)

As for the distribution of x.. I'm afraid I don't know a lot about it. As said, I'm trying to describe a rather poorly understood process:
Basically I've got a strand of DNA of a certain length (corresponding to Lmax), that is covalently attached to a surface on one end. Using various chemical and physical manipulations, it is possible to have the strand form a 'bridge', so that it's other end also gets attached to the surface, at some distance away from the other end. This is the distance that I'm trying to describe.

It's possible that this is a normal/Gaussian distribution, with mean equal to 0.5*Lmax.
Perhaps it is more likely though that the 'bridge' has a preference for the least-stressed conformation, which would be when it forms a perfect semi-circle. This would mean that the expected step length L would become Lmax*2/π.
It's also possible that the strand folds back upon itself (forming a loop). Once, or twice, or several times. This would greatly decrease the expected step length L.

In short, there are many distributions possible. Having little experience with distributions, perhaps one of you could point me in the right direction? One thing that is for certain though is that x is a value between 0 and 1 (so the infinite tails of the normal distribution are impossible, right..?).

I hope this post is a bit clear. Any help or insight granted would be very much appreciated.

Kind regards

mfb · Jul 17, 2013

Hmm, then I think (p3-p2) and (p2-p1) (as an example) are not independent - as a large difference would mean a large bend in the strand.

Perhaps it is more likely though that the 'bridge' has a preference for the least-stressed conformation, which would be when it forms a perfect semi-circle. This would mean that the expected step length L would become Lmax*2/π.

I don't think it is possible to determine this without chemistry.
Do you need a theoretical argument for a distribution, or just a description of the distribution?

If you have some experimental distribution, can you show it?

haruspex · Jul 17, 2013

So there are two parts to the problem
1. A distribution for the length of each step
2. Combining to form the expected total displacement from multiple steps.
mfb, why do you say the step lengths would not be independent (given the distribution)?

DaanV said:

Perhaps it is more likely though that the 'bridge' has a preference for the least-stressed conformation, which would be when it forms a perfect semi-circle.

That works if the endpoints would be attached orthogonally to the surface. Is there a reason to suppose that? If they're attached equally likely in any direction then the PDF for the angle it makes to the surface is cos(θ). Assuming then a perfect arc, that gives a step length L sin(θ)/θ.

mfb · Jul 17, 2013

haruspex said:

mfb, why do you say the step lengths would not be independent (given the distribution)?

Hmm, now I am unsure how to translate "steps" to DNA. I thought it would be one piece of the DNA strand, but the Lmax description seems to indicate that the whole piece is one step. Why do we have 28 steps then, each one starting at the end of the previous one?

haruspex · Jul 17, 2013

mfb said:

Hmm, now I am unsure how to translate "steps" to DNA. I thought it would be one piece of the DNA strand, but the Lmax description seems to indicate that the whole piece is one step. Why do we have 28 steps then, each one starting at the end of the previous one?

Yes, I'm a bit confused about that too. Moreover, it seems likely that there would be physical interference between successive steps. If one step comes in at a steep angle, the next might be likely to go out at a shallow one. And there may well be a positive correlation in the direction taken by consecutive steps.
Glossing over that, suppose the step lengths are iid with mean μ and variance σ², and the orientations are independent and uniform 0 to 2π. If the distance after n steps is s_nin some direction, and the next step is length d at angle theta to that, ##s_{n+1}^2 = s_n^2 -2ds_n\cos(\theta) + d^2##. Clearly E(s_n²) = n E(d²) = n(μ²+σ²). For sufficiently large n, we can show E(s_n) = √(n(μ²+σ²)/2), so var(s_n) = E(s_n)².

DaanV · Jul 18, 2013

Note: See further down for an explanation of the process, if you’re interested in the background information.

haruspex said:

Glossing over that, suppose the step lengths are iid with mean μ and variance σ², and the orientations are independent and uniform 0 to 2π. If the distance after n steps is s_nin some direction, and the next step is length d at angle theta to that, ##s_{n+1}^2 = s_n^2 -2ds_n\cos(\theta) + d^2##.

Agreed. This is exactly the same as what I put in the opening post of this topic. I only added rather than subtracted the second term. Overall though this should give the same expected distance, as the cosine is equally likely to be positive or negative, am I right?

haruspex said:

Clearly E(s_n²) = n E(d²) = n(μ²+σ²).

I agree on the first part. It's what I realized in my second post. Is there any mathematical proof that I can bring forward to demonstrate this? The expected audience of my internship report is (like myself) not very strong in mathematics.

I don't immediately see the second part of your equation though. Basically you're saying that E(d²) = μ²+σ². Why is this? Certainly if d = μ +/- σ, then E(d²) = μ² +/- 2*μ*σ + σ²? Or am I being overly simplistic in my definition of d?
Sorry if I'm being silly and asking very very basic questions here. As you may have noticed, I am not very big into mathematics. ;)
You don't have to take the time to explain all the details in here. Linking me to some place that explains it will do. Thanks in advance.

haruspex said:

For sufficiently large n, we can show E(s_n) = √(n(μ²+σ²)/2), so var(s_n) = E(s_n)².

Again, my mathematics skills let me down. Where does the factor 1/2 come from in E(s_n)?

Also, what do you mean by var(s_n)? Is it the expected variance of s_n? Why is it equal to n(μ²+σ²)/2?

Again, thanks for all the help given so far. Thanks also for any patience displayed in the future. Thanks for everything. :)

mfb said:

Hmm, now I am unsure how to translate "steps" to DNA. I thought it would be one piece of the DNA strand, but the Lmax description seems to indicate that the whole piece is one step. Why do we have 28 steps then, each one starting at the end of the previous one?

Ok, I understand the confusion. I purposely left out the chemical details of the process so as not to bother you with it. Here comes a brief description of the process though:

A single strand of DNA (the origin) is attached to a surface at the start of the process. This strand has a known piece of sequence (comprised of the DNA nucleotides G, A, T and C) on either end. We will call these ends A and B for now. The piece of DNA in between A and B has an unknown sequence, and can be of variable length.

Now, let's assume that the original strand is attached to the surface with its A side. The entire surface surrounding this strand is filled with sequences complementary to A and B, covalently attached to the surface. (Agreed, this immediately means that the distribution isn't perfect and the step length and direction can't be iid. For now we'll make this assumption though, as the surface is very crowded with these complementary fragments)

Using some chemical and physical processes, the B side of the original strand can be made to hybridize to its complement on the surface, forming what I have previously called a bridge. Next the original strand is copied (while in bridge form), using the B that is attached to the surface as a starting point (‘primer’). This creates a double-stranded bridge between p0 and p1. In the next step, these two strands are denatured, giving two single strands pointing upwards into the solution. One is at p0 (the original strand), the other is at p1 (the copy). This whole process (bridging, copying, denaturing) is called a cycle.

During the next cycle, this process is repeated, so that both strands are copied. The resulting 4 strands (the origin, 2 at one step length away and 1 at a double step length away) are copied again in the next cycle, etc, up to 28 cycles. Under perfect circumstances this gives 2ⁿ strands with a binomial distribution over the number of 'steps' that each strand has made. The resulting group of fragments is called a cluster.

I wish to model this process. I have already incorporated the distribution over the number of steps, including with imperfections in the process (it was found that DNA fragments aren't always copied at 100% efficiency). What's next is to try and find a plausible way to describe the influence of the relative fragment length on the resulting size of the cluster. Obviously, longer fragments lead to bigger clusters. But how much bigger? Does the area covered by the cluster double when the fragment gets twice as big? Does the radius double?

Since clusters don't really have a defined boundary (they simply get less intense near the edges), I figured that the average distance of all fragments to the origin would be a good measure to describe cluster size. Hence my quest to try and calculate the expected distance that a fragment will have covered after s steps.

I'm well aware that I've made some assumptions in here that could prove detrimental to my model: So far I have assumed infinitely many 'bridging spots' available to every fragment during any cycle. I'm afraid it falls beyond the scope of my internship to take the limits into account though.

As for actual data to back up my model: I'm working on it, but it's proving hard to retrieve the data. Suffice to say that the clusters are really tiny: Usually we get about 800k+ clusters per square millimeter. The machine that interprets these clusters images them, analyzes the data through some convoluted algorithm and then deletes the images.

We have intercepted a few images before they were deleted. We can clearly see bigger clusters and smaller clusters, as we expected as we brought longer and shorter fragments into it. In theory we can differentiate which cluster belongs to which type of fragment, as we have the coordinates on the image per index: Each fragment type was given its own index ('barcode')..

In practice though it's proving hard to fit these coordinates onto the images, as there are several layers of offsets and tilting in the output data that we don't know how to implement. We will need help from someone from the company that makes these machines so that we may learn how to make this fit. Once we know which fragment length belongs to which cluster, we may be able to analyze the cluster size on the image and couple that to the fragment length.

Soo.. That became a bit of a long post. I hope it helps to clear stuff up, rather than confuse people and scare them away. As said, feel free to ask further questions or clarifications on specifics.

mfb · Jul 18, 2013

DaanV said:

Agreed. This is exactly the same as what I put in the opening post of this topic. I only added rather than subtracted the second term. Overall though this should give the same expected distance, as the cosine is equally likely to be positive or negative, am I right?

Right.

I agree on the first part. It's what I realized in my second post. Is there any mathematical proof that I can bring forward to demonstrate this? The expected audience of my internship report is (like myself) not very strong in mathematics.

It is the same concept as with μ and σ. You get the cos term, its average is zero, so the expected squared distance can be added for all steps.
If you don't want to calculate it, just mention that it can be calculated?

I don't immediately see the second part of your equation though. Basically you're saying that E(d²) = μ²+σ². Why is this? Certainly if d = μ +/- σ, then E(d²) = μ² +/- 2*μ*σ + σ²?

Same thing again, the expectation value of +/- 2*μ*σ is zero (as σ is symmetric).

Again, my mathematics skills let me down. Where does the factor 1/2 come from in E(s_n)?

I guess that was a mistake.

Also, what do you mean by var(s_n)? Is it the expected variance of s_n?

It is just the variance. The variance is an expected value (for the squared distance).

Okay... to describe the cluster size, the expected squared distance looks fine. We can relate the expected squared distance for each step to the expected squared distance after n steps (see haruspex), and you have a model to determine the expected number of strands n steps away from the origin (for n=0 to 28). This allows to related the cluster size to μ and σ - but I guess it is sufficient to use (μ² + σ²) as (strand-dependent) parameter.

With the few images you got, it could be possible to estimate this parameter for your strands.

DaanV · Jul 18, 2013

Thanks for your quick reply and kind help, mfb. :)

mfb said:

DaanV said:

Again, my mathematics skills let me down. Where does the factor 1/2 come from in E(s_n)?

I guess that was a mistake.

Right. Is it then just a coincidence that E(s_n) = √E(s_n²) ? Normally I'd say that E(s_n) = √E(s_n)², right?

mfb said:

DaanV said:

Also, what do you mean by var(s_n)? Is it the expected variance of s_n?

It is just the variance. The variance is an expected value (for the squared distance).

Okay.. Haruspex says that var(s_n) = E(s_n)².
Let's say for a moment that E(s_n) = 5, then according to this the variance would be 25. Does that make any sense?

mfb said:

Okay... to describe the cluster size, the expected squared distance looks fine. We can relate the expected squared distance for each step to the expected squared distance after n steps (see haruspex), and you have a model to determine the expected number of strands n steps away from the origin (for n=0 to 28). This allows to related the cluster size to μ and σ - but I guess it is sufficient to use (μ² + σ²) as (strand-dependent) parameter.

My thoughts exactly.

mfb said:

With the few images you got, it could be possible to estimate this parameter for your strands.

That is the idea, yes.
As hinted at briefly in my previous post though, it's proving to be quite hard to figure out exactly what cluster in the image belongs to what fragment length.

More detailed explanation (in case you're interested):
What we get is an image of the clusters (as said, generally 800,000 per square millimeter) that lie virtually on top of each other, and indeed more often than not they overlap. When two clusters do overlap, the algorithm uses the difference in DNA sequence in the first few bases to see if it can differentiate between the two. As long as it can distinguish two different clusters, it will be able to bring results. If they overlap too much (resulting in a mixed signal) it will throw away the signal from both.

For the clusters that it can recognize (usually >95%, but if there are too many (or too large!) clusters, this drops extremely fast towards as low as 10%), it creates a separate file where it will denote the coordinates of each cluster, and the information that it obtains for this cluster. Among this information is the index/barcode of each cluster.

Now the most straightforward approach would be to overlay the coordinates with the image, and then see which barcode ends up at each cluster, and then seeing if longer fragments indeed lead to bigger clusters (and by how much). However, it's the overlaying of the coordinates that is proving difficult, as there are multiple separate files that hold information on 'tilting', 'offsets', 'subtile offsets', etc. We have no clue how (and in what order) to implement these files. We are in the process of contacting the tech support to see if they can help us out.

Anyway, that's a part where I don't think you can help me out. Unless you can come up with a way to find (an indication for) cluster size for each fragment type without having to overlay the coordinates with the images and estimating cluster size from there.

haruspex · Jul 18, 2013

DaanV said:

haruspex said:

For sufficiently large n, we can show E(s_n) = √(n(μ²+σ²)/2), so var(s_n) = E(s_n)².

Where does the factor 1/2 come from ?

Since you ask:
##s_{n+1} = \sqrt(s_n^2-2d s_n \cos(\theta) + d^2) = s_n\sqrt(1-2d \cos(\theta)/s_n + (d/s_n)^2)##
## ≈ s_n(1 - d \cos(\theta)/s_n + (d/s_n)^2/2 - (2d \cos(\theta)/s_n)^2/8) ## (binomial theorem) ##= s_n - d \cos(\theta) + (d^2/s_n)(1 - \cos^2(\theta))/2= s_n - d \cos(\theta) + (d^2/s_n)\sin^2(\theta)/2 ##
##\int_0^{2\pi}\sin^2(\theta).d\theta = \frac 12\int_0^{2\pi}(1-\cos(2\theta)).d\theta = \pi##, so averaging over theta:
##E(s_{n+1}) = E(s_n) + E(d^2/s_n)/4##
I shall write D as shorthand for √E(d²). If E(s_n) = AD√n for some constant A then E(s_n+1-s_n) = AD(√(n+1)-√n) ≈ AD/(2√n).
Ok, there's a bit of a fudge here equating ##E(d^2/s_n)## with ##E(d^2)/E(s_n)##, but it leads to A = 1/√2.
So we have E(s_n) = D√(n/2), E(s_n)² = D²(n/2), var(s_n) = D²(n/2), E(s_n²) = E(s_n)² + var(s_n) = D²n

mfb said:

The variance is an expected value (for the squared distance).

You mean, it's the expected value of the square of the difference from the average.
That's enough for this post...

mfb · Jul 19, 2013

and the information that it obtains for this cluster

Does that include the cluster size? :p.

It might be possible to work with relative coordinates, if you can match some cluster structure in the image to a coordinate structure in your files, but that is just a guess.

DaanV · Jul 23, 2013

@Haruspex: Needless to say, I couldn't follow your calculations. Instead I resorted to Excel, my hero in dark days.

Calculating s_n as well as s_n² 12k times yields that E(s_n²) = n*E(d²), as was predicted.

However, E(s_n) ≠ √(n*E(d²)/2)

Going by the equation E(s_n²) = E(s_n)² + var(s_n) (which did fit with my data), it's possible to say that if var(s_n) is some fraction A of E(s_n²), then E(s_n)² must be (1-A)*E(s_n²).

By your calculations, you say that A=0.5 (since you say that var(s_n) = E(s_n)²).
However, from my model it appears to be that A≈0.21. I have calculated it for various values of d, as well as various levels of variation over d (represented by a linear distribution between 0 and some value d_max).

Do you have any explanations for this? Could it be a mistake in your calculations?

@mfb: Ha!
No, sadly not. That would have made life much too easy!

Ray Vickson · Jul 23, 2013

DaanV said:

@Haruspex: Needless to say, I couldn't follow your calculations. Instead I resorted to Excel, my hero in dark days.

Calculating s_n as well as s_n² 12k times yields that E(s_n²) = n*E(d²), as was predicted.

However, E(s_n) ≠ √(n*E(d²)/2)

Going by the equation E(s_n²) = E(s_n)² + var(s_n) (which did fit with my data), it's possible to say that if var(s_n) is some fraction A of E(s_n²), then E(s_n)² must be (1-A)*E(s_n²).

By your calculations, you say that A=0.5 (since you say that var(s_n) = E(s_n)²).
However, from my model it appears to be that A≈0.21. I have calculated it for various values of d, as well as various levels of variation over d (represented by a linear distribution between 0 and some value d_max).

Do you have any explanations for this? Could it be a mistake in your calculations?

@mfb: Ha!
No, sadly not. That would have made life much too easy!

If you want the expected value of
[tex]s_n = \| \vec{X_1} + \vec{X_2} + \cdots + \vec{X_n} \|[/tex] you will not find an exact "analytical" expression, because
[tex]s_n = \sqrt{\left(\vec{X_1} + \vec{X_2} + \cdots + \vec{X_n}\right)\cdot<br /> \left(\vec{X_1} + \vec{X_2} + \cdots + \vec{X_n}\right)}[/tex]
and we cannot take the expectation inside the square root. However, you may be able to find some approximations that are reasonable for, say, large n, etc.

You can find the "root mean square" distance ##\sqrt{ E s_n^2 }## without much difficulty in the case you cite, where each ##\vec{X}## has known probability distribution for magnitude and uniform distribution of angle, but that is much different from the problem of trying to find ##E s_n##. I believe the expressions haruspex gave were for an approximation, not the exact value.

haruspex · Jul 24, 2013

DaanV said:

However, from my model it appears to be that A≈0.21.

Yes, it would seem that my approximation is not accurate enough. Need to get more of a handle on the distribution of s_n. I suggest that your Monte Carlo approach is the best for now.

haruspex · Jul 24, 2013

Seems that the distribution for Brownian motion is 2-D Gaussian. So if r is the distance from the origin after time t then the PDF for r is ##\lambda r e^{-\lambda r^2/2}##, where λ ~ √t. That gives ##E(r^2) = \frac 4{\pi} E(r)^2##.
We have E(s_n²) = n, so assuming the same ratio E(s_n) = √(nπ)/2 ≈ 0.8862 √n, in accordance with observation.
This means we can, if necessary, write down the actual distribution of s_n for large n.

Calculating distance between origin and subsequent points

Homework Statement

Homework Equations

The Attempt at a Solution

"Critical" Triangle Problem

The optimal way of dividing the bet three ways

Hedging on a weather prediction

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Calculating distance between origin and subsequent points

Homework Statement

Homework Equations

The Attempt at a Solution

Similar threads