# Variance of 'concatenated' distributions

• query_ious
In summary, the data does not fit a binomial distribution with a constant p. Assuming that p is variable, however, things work out well. This is due to the 'concatenation' of two distributions with different p's, which results in a better distribution.
query_ious
The below is motivated by a problem I'm observing in my experimental data

I have m boxes, where each box is supposed to contain k molecules of mRNA. The measurement process includes labeling all the molecules with a box-specific tag, mixing them, amplifying them to detectable levels and deconvoluting based on tags. As tag-labeling is a lossy process with estimated efficiency of 5-10% and as we are essentially counting successes a binomial model with n=k and 0.05<p<0.1 sounds fitting. Graphing the variance/mean vs. mean shows that it doesn't fit to the straight line expected from a binomial with constant p. If we assume that p is variable, however, things work out well. (see bottom for data)

The intuition is that we are 'concatenating' distributions - as an example assume that m/2 boxes are sampled with p=0.05 and m/2 boxes with p=0.1. Then the distribution of all m boxes would be an 'overlap' of 2 binomials each with the same n but different p. Intuitively I would expect the mean to be bound between the mean of the two distributions (as the mean is a point found between the two other points) while I would expect the variance to be larger than either of the two variances (as variance is correlated with the width of the distribution and the 'overlapping' width is by definition longer than each of its components). Simulations support my argument but I don't know how to go about formalizing it and I don't know how to use this to assist is modeling the data. (Esp - how to use this to estimate the variability in tagging efficiency).

What do you say?

The below graph shows real data (each point is generated from 24 boxes with equal k) versus the toy model suggested above (100 boxes with p=0.05 and 100 boxes with p=0.1 for different n).

[Edited for clarity + typos]

Last edited:
I don't see the image. Just a blank space.
Edit: Looks like some decoding problem. The image exists, but it does not get displayed correctly here. Downloading and opening it locally works.
How did you get the individual data points? Each one is one experiment with m boxes? How do they differ from each other?

Could p depend on m? That would allow to make better predictions.

For the simplified model with m/2 at p=0.05 and m/2 at p=0.10, variance should be 1/2 Var(k,p=0.05) + 1/2 Var(k,p=0.10) + (k*0.025)^2 where the last term is the deviation between mean of the total distribution (k*0.075) and mean of the individual sample distributions (k*0.05 and k*0.10 respectively).
Making more complex models is possible in the same way, but you quickly end up fitting curves to get the best model.

Last edited:
mfb said:
I don't see the image. Just a blank space.
I see it on my screen... not sure how to fix this.

mfb said:
How did you get the individual data points
The setup is a bit detailed but basically this is one big experiment where a mix of ~100 RNA types is pipetted into 24 boxes such that each box gets an equal amount of mix. Each box is labeled separately and then all boxes are pooled into one big box, amplified and separated algorithmically. Each data point in the graph is generated from a 24-member list consisting of the number of estimated molecules for a certain RNA type in all 24 boxes.

mfb said:
Could p depend on m?
I don't expect it to but it might vary between experimental batches. I tested this and didn't see much of a difference.

mfb said:
For the simplified model with m/2 at p=0.05 and m/2 at p=0.10, variance should be 1/2 Var(k,p=0.05) + 1/2 Var(k,p=0.10) + (k*0.025)^2
Would you mind sharing the derivation? I'm a bit lost...

Are the "amplification process" and the detection afterwards exact? If you start with n marked RNA sequences, you will always measure that number n?

The leftmost numbers would correspond to a mean of <1/16, or just one labeled RNA sequence from a single box out of 24? Then they should have a more discrete spacing, and the data samples should not have a well-defined variance.
And what about the data point at (4,3.6)? A mean of 16, a variance of ~200? Would be interesting to see the actual distribution for those outliers.

query_ious said:
Would you mind sharing the derivation? I'm a bit lost...
In general, if you add two distributions with known means µ1 and µ2 and variances V1 and V2, and if a fraction p comes from distribution 1 and (1-p) from distribution 2, the new mean is
$$\mu = p\mu_1 + (1-p) \mu_2$$
And the variance is
$$V = p(V_1+(\mu-\mu_1)^2) + (1-p) (V_2+(\mu-\mu_2)^2)$$
The latter formula is very similar to the parallel axis theorem in mechanics, as you are interested in the second moment in both cases.

query_ious said:
I have m boxes, where each box is supposed to contain k molecules of mRNA. The measurement process includes labeling all the molecules with a box-specific tag, mixing them, amplifying them to detectable levels and deconvoluting based on tags.

What does it mean to "deconvolute based on tags"?

I don't understand what you are graphing. The graph apparently shows the mean (of something) versus its variance. What is the random variable being sampled?

What makes the data points different? Are they from batches of repetitions of the above process? Or does each data point represent the sample mean and variance of a count of molecules with a given "label" taken over repetitions of the above process?

## 1. What is the concept of 'concatenated' distributions?

The concept of 'concatenated' distributions refers to combining multiple probability distributions into a single distribution. This is typically done by joining the individual distributions together end to end, resulting in a new distribution with a wider range of possible outcomes.

## 2. How is the variance of 'concatenated' distributions calculated?

The variance of 'concatenated' distributions can be calculated by finding the weighted average of the variances of the individual distributions. This involves multiplying each variance by its corresponding probability and summing these values.

## 3. What is the significance of the variance of 'concatenated' distributions?

The variance of 'concatenated' distributions plays an important role in understanding the spread or variability of the combined distribution. It can help determine the likelihood of different outcomes and assess the overall risk or uncertainty associated with the combined distribution.

## 4. Are there any assumptions or limitations when dealing with 'concatenated' distributions?

Yes, there are some assumptions and limitations to consider when dealing with 'concatenated' distributions. These include assuming that the individual distributions are independent and identically distributed, and that the combined distribution is continuous and unimodal.

## 5. Can the concept of 'concatenated' distributions be applied to any type of data?

The concept of 'concatenated' distributions can be applied to a wide range of data types, including continuous, discrete, and mixed data. However, it is important to ensure that the individual distributions being combined are appropriate for the data and follow certain assumptions to accurately interpret the resulting concatenated distribution.

• Set Theory, Logic, Probability, Statistics
Replies
3
Views
986
• Set Theory, Logic, Probability, Statistics
Replies
14
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
7
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
9
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
932