I Find Variance of Ratios in Lab Experiments (No Maths Background)

DaanV · Sep 28, 2017

This is from labwork experiments. I have no background in mathematics.

We have an experimental setup that measures the levels of two targets: The Reference (R) and the target of Interest (I). We are interested in the ratio I/R. For healthy individuals we know this ratio to be precisely 1. For patients we expect I/R > 1.

In order to validate our experimental setup we want to learn the technical variance of our results, such that we can set thresholds above which we can confidently distinguish healthy from patient (e.g. "If I/R ≥ 1.034, subject has disease"). In order to find this we will be measuring healthy material in ~30-fold, resulting in 30 scores for I, R and thus I/R.

For most publications on the subject that I have seen so far, people seem to just take the scores for I/R, calculate the variance and the confidence intervals from there.

My question is: Is this a statistically valid approach?

My gut feeling is that since I and R both have their own independent technical variance (even if the means are not independent), then the variance of the ratio between the two should be greater than just the variance of the ratios? Can anyone confirm or refute my feeling?

Thanks in advance for any help provided!
Sincerest apologies if this topic is in the wrong place.

jim mcnamara · Sep 28, 2017

If I understand, then you are probably correct because those two value are independent variables - I assume. If you assume correlations then you are assuming the opposite.

By independent: the I value has no relationship with the R value because the method to generate I, for example, would create the same result whether R was generated or not. R has to meet the same independence criteria - You get to decide the correctness of this.

Experimental design is the deciding factor. And what does 'means are not independent' refer to? How can two independent variables be related by a mean value?
I'm confused.

BvU · Sep 28, 2017

Hello Daan,

I was confused too.

DaanV said:

My gut feeling is that since I and R both have their own independent technical variance (even if the means are not independent), then the variance of the ratio between the two should be greater than just the variance of the ratios?

If I and R are really independent measurements, each with a gaussian distribution, you get the square of the relative variance of I/R by adding up relative variances squared of I and R :$$
{\sigma_{I/R}^2\over (I/R)^2 } = {\sigma_{I}^2\over I^2 } + {\sigma_{R}^2\over R^2 }$$(This is easily verified with the rand() function in excel)

But 'really independent' is a tough requirement; it's satisfied if totally different equipment is used (e.g. I = length and R = weight), but not if e.g. R = shoulder height.

BvU · Sep 28, 2017

And, Daan:

You also want to be aware that the variance you determine from a sample of 30 measurements is not very accurately determined! The rule is that $${\Delta\sigma\over \sigma} \approx {1\over\sqrt{n-1}}$$ where n is the number of observations.
Which means some 20% error in the standard deviation for a sample of 30 measurements

DaanV · Sep 28, 2017

Thank you jim mcnamara and BvU for your replies. I am sorry for any confusion caused by my poor description.

Please allow me to rephrase, plus give a bit more detail on the experimental setup. I hope that helps clearing up any confusion at all.

This problem stems from oncology/genetics. The target of Interest (I) is the number of copies of an oncogene that we find in a sample. The Reference (R) is the number of copies of a reference gene that we find in the same sample. Since for a healthy individual we assume diploidy (each cell contains exactly 2 copies of each gene), we know the true ratio I/R to be exactly 1. Therefore, they are most definitely related. However, by jim mcnamara's criteria: The value we obtain for I should remain the same whether we measure R or not.

The point in the last paragraph was that the technical variance of I may be unrelated to the technical variance of R, even though the means are related (even equal).

I'll try to clarify with an example of fictional measurement values:
(sorry, I don't know how to format neat tables in here)

Code:

Measurement# | Value I | Value R | Ratio I/R 1 12.83 11.89 1.08 2 13.35 11.72 1.14 3 12.43 12.17 1.02 4 12.43 11.99 1.04 5 12.27 11.82 1.04 6 11.05 12.03 0.92 7 12.08 12.02 1.00 8 10.89 11.54 0.94 9 12.97 11.91 1.09 10 10.87 11.92 0.91 11 13.56 11.99 1.13 12 12.99 12.05 1.08 13 11.39 12.24 0.93 14 12.47 12.11 1.03 15 11.39 12.06 0.94 16 11.63 12.07 0.96 17 12.08 12.01 1.01 18 10.64 12.11 0.88 19 11.84 11.55 1.03 20 12.21 11.80 1.03 21 10.38 11.83 0.88 22 12.24 12.23 1.00 23 10.81 12.34 0.88 24 13.01 11.84 1.10 25 10.86 11.87 0.91 26 11.55 11.70 0.99 27 12.29 12.18 1.01 28 11.19 12.22 0.92 29 12.84 12.08 1.06 30 11.32 11.69 0.97

Now I could calculate the variance over the I/R column and be done with it, or add up the relative variances of I and R as BvU suggested. Which is the way to roll?

Thanks again for your help and patience.

Edit: Thanks for the comment about number of samples to determine the variance, BvU! I will take it into account in designing the experiment. Unfortunately we can't run it indefinitely due to cost, I would say max 60 but that still leaves some 13% error. I suppose one way to resolve it would be to "assume the worst" and impose stricter cut-off values for our future experiments. Any advice on how to incorporate this uncertainty on the standard deviation into our estimation models would be most welcome.

BvU · Sep 29, 2017

DaanV said:

Now I could calculate the variance over the I/R column and be done with it, or add up the relative variances of I and R as BvU suggested. Which is the way to roll?

A most unfortunate example: first yields a sigma of 0.754, second gives 0.752

so: exactly equal for all purposes.

Code:

             I         R         I/R
average   11.92867  11.966       0.997333
stdev      0.873561  0.202852    0.075244
rel. sigma 0.073232  0.016952    0.075445

sqrt(sum(sigma^2))               0.075169

Which you will most likely get most of the time -- almost all the time within the 13 or 19%You have relative standard deviations of 7 and 2% in I and R and a standard deviation in the average of 0.16 in ##I_{\rm average}\ ##, 0.04 in ##R_{\rm average}\ ##, and 0.014 in the ratio (using ##\sigma_m = \sigma/\sqrt n\ ## ). Meaning you need a ratio of around 1.4 to achieve a confidence level of 99% that the diagnose 'unhealthy' is correct.

We should drop insignificant digits and report I = 11.93 ##\pm## 0.16 and R = 11.97 ##\pm## 0.04 so that I/R = 0.997 ##\pm## 0.014 (1.00 ##\pm## 0.01 would probable be even better, in light of the inaccuracy of the error estimate...)

This moves into the domain of hypothesis testing; my modest experience is mostly in error handling, and with minimal risk in case of a mistake (less credibility with colleagues, at worst becoming a laughing stock).
For medical diagnostics things become rather serious very quickly. I suggest you read up on hypothesis testing (it must have been in your curriculum ?) and consult a colleague who feels more comfortable with statistics.

Anyway, thanks for the clarification and the confidence

.

However, things aren't completely clear to me yet: if R and I are counts, how come they aren't integers ?

DaanV · Sep 29, 2017

Thanks a lot BvU for taking the time to help me out. It is really appreciated. I am happy to hear my additional explanation helped.

BvU said:

A most unfortunate example: first yields a sigma of 0.754, second gives 0.752 so: exactly equal for all purposes.
Which you will most likely get most of the time -- almost all the time within the 13 or 19%

I will look into whether there are situations imaginable where there is great difference between the two approaches mentioned.

BvU said:

I suggest you read up on hypothesis testing (it must have been in your curriculum ?) and consult a colleague who feels more comfortable with statistics.

I will surely look closer into hypothesis testing, although this is indeed a field my colleagues and I are more comfortable in.

BvU said:

However, things aren't completely clear to me yet: if R and I are counts, how come they aren't integers ?

Very astute question. This is because they are reported as 'copies per microliter'. I am undecided yet whether we'll be working with total copy count or this copies per microliter, and whether or not it makes any difference at all.

jim mcnamara · Sep 29, 2017

Integer values will help the statistics a lot as @BvU indicated. Whether that kind of result is meaningful, I cannot say.. In other words, you currently report n significant digits which I assume you can reasonably assert is correct and meaningful. So your division result cannot have more than n significant digits, right? The same affects your analysis. Example: I would assert that 1.0002 is very likely not significantly different from 1.0000 unless you have a huge number of samples. Which is what BvU is saying, too.

As a Biologist I don't get the reason for reporting fractional genetic data - cells either have or do not have a given gene and in most tissues the cells are effectively identical in that regard. My problem, not yours necessarily. Maybe you are dealing with something like DNA methylation and epigenetics. But the fractionized non-integer number is going to cause issues in getting stats results done.

BvU · Sep 29, 2017

DaanV said:

they are reported as 'copies per microliter'. I am undecided yet whether we'll be working with total copy count or this copies per microliter, and whether or not it makes any difference at all.

Also interesting: I expect that I and R are counted from a single sample and that brings in a source of fluctuations:
If you take samples of ##x## microliter from a 'population' (i.e. a very big, well mixed, container in this case) where the copy density is ##N## million copies/liter, you expect to count ##Nx## copies. Meaning that if you repeat this (with exactly the same volume ##x##) a huge number of times you expect an average of ##Nx## counts.

As you can imagine, not every sample will contain exactly that number: there is a probability distribution for the results, the Poisson distribution. Which has a standard deviation of ##\sqrt {Nx}## if the mean value is ##Nx##. In other words: 90 and 110 counts are within 2 sigma in a sample with exactly I/R = 1 !

You count ##N_I x## and ##N_R x## for ## I## and for ##R## and thus far report ##N_I## and ##N_R##. This muddles the error handling if the I and R counts are indeed taken from single samples: the ##x## makes them correlated (more I means more R) and the ##\sqrt{Nx}## part is uncorrelated.

If I and R are taken from separate samples the errors become independent, but the accuracy of the ratio suffers if ##\sigma_x## contributes significantly.

Golden rule in physics: report observations, not calculated results. At the very least in your lab journal. Certainly nowadays the observations are a lot more expensive than the calculations and error analysis becomes nigh impossible.

All in all a good reason for me to stumble over non-integer 'counts'

Good to have @jim mcnamara listening in: there are a lot of facets to this case.

DaanV · Sep 29, 2017

Thank you again Jim and BvU for your insights and continued interest.

jim mcnamara said:

Integer values will help the statistics a lot as @BvU indicated. Whether that kind of result is meaningful, I cannot say.. In other words, you currently report n significant digits which I assume you can reasonably assert is correct and meaningful. So your division result cannot have more than n significant digits, right? The same affects your analysis. Example: I would assert that 1.0002 is very likely not significantly different from 1.0000 unless you have a huge number of samples. Which is what BvU is saying, too.

I am aware of significant digits. I follow the number of significant digits that the software reports, which I assume to be relevant and meaningful. Again, the above data was purely fictional.

jim mcnamara said:

As a Biologist I don't get the reason for reporting fractional genetic data - cells either have or do not have a given gene and in most tissues the cells are effectively identical in that regard. My problem, not yours necessarily. Maybe you are dealing with something like DNA methylation and epigenetics. But the fractionized non-integer number is going to cause issues in getting stats results done.

I am unsure how much detail about the experimental setup I must bore you with in order to get the best possible answers. :) Please bare with me, you guys are great.

The machine we use is droplet digital PCR (Link to commercial web-page). We take 9ul of DNA isolate from blood plasma (cell free DNA) and bring it to a total reaction volume of 20ul. The ddPCR process fractionates this 20ul into roughly 12k-18k individual droplets-in-oil, making for that many purely individual PCR reactions. We perform end-point PCR with taqman probes labeled for either the Reference (HEX fluorophore) or the gene of Interest (FAM fluorophore).

DNA fragments occupy the droplets by chance, according to a Poisson distribution (Milbury et al.). We can count the number of droplets that are FAM positive, HEX positive, or both, or neither. This is the raw output of the machine. This however does not equate gene counts, as there is of course probability for two of the same fragments to occupy the same droplet (and we don't see the difference). So the accompanying software calculates from the droplet counts the expected number of molecules based on Poisson statistics. Since the total number of droplets tends to vary between experiments, but the volume of each droplet is (supposedly) very constant, the copies/ul reaction volume is the metric that can best be compared between experiments. I hope that the above helps to resolve some of your confusion regarding the units we tend to use and why they are fractionated rather than integers.

BvU said:

Also interesting: I expect that I and R are counted from a single sample and that brings in a source of fluctuations:
If you take samples of xxx microliter from a 'population' (i.e. a very big, well mixed, container in this case) where the copy density is NNN million copies/liter, you expect to count NxNxNx copies. Meaning that if you repeat this (with exactly the same volume xxx) a huge number of times you expect an average of NxNxNx counts.

As you can imagine, not every sample will contain exactly that number: there is a probability distribution for the results, the Poisson distribution. Which has a standard deviation of √NxNx\sqrt {Nx} if the mean value is NxNxNx. In other words: 90 and 110 counts are within 2 sigma in a sample with exactly I/R = 1 !

You count NIxNIxN_I x and NRxNRxN_R x for II I and for RRR and thus far report NININ_I and NRNRN_R. This muddles the error handling if the I and R counts are indeed taken from single samples: the xxx makes them correlated (more I means more R) and the √NxNx\sqrt{Nx} part is uncorrelated.

If I and R are taken from separate samples the errors become independent, but the accuracy of the ratio suffers if σxσx\sigma_x contributes significantly.

I am aware of variation caused by sampling error. Yes, the counts are taken from the same 9ul sample. I would expect the sampling error for either gene to be uncorrellated. The sigma of the 9ul that we use (x) should be in the order of 0.5%. Thanks again very much, both of you, for your patience.

BvU · Sep 29, 2017

I think I can follow a little bit (looking upward, since this feels like being above my head

). No wonder you have non-integer 'counts'. At first I reasoned 9 ##\mu##l at around 11 counts/##\mu##l is only 100 counts, but that's after the machine divided by a multiplication factor (of around 400 if I can believe the 0.5%). This would still leave a Poisson sigma of ##\sqrt{1/100}##, some 10% -- wouldn 't it ?

DaanV · Sep 29, 2017

Sorry for inclarity. The 0.5% was supposed to be the rough error for pipetting 9μl -- we'd generally be pipetting 9.0 μl ± 0.05 μl.

Actually the total reaction volume (after adding all reagents necessary to get the PCR running) is 20 μl, so your initial reasoning is sort of valid: 20 μl at around 12 counts/μl yields some 240 counts total. sqrt(1/240) makes for about 6%.
In standard routine we perform each measurement in duplicate. Since each droplet was already an individual PCR, we can consider the two separate measurements to be equivalent to one big measurement with 24k-36k droplets instead. In this case we'd expect 480 counts, or 5% Poisson sigma still remaining.

jim mcnamara · Sep 29, 2017

Assuming I understand, have you considered ANOVA - analysis of variance? You have two "sources" of variance in separate results, I & R, and one "result" - your computation: result=I/R. You can perform this using Excel spreadsheet with add-ons like Analysis ToolPak. I am not sure how you would handle the 24k-36k droplet volume data -> 2 droplet test.

Hmm Maybe @Stephen Tashi has some ideas.

Stephen Tashi · Sep 29, 2017

DaanV said:

For most publications on the subject that I have seen so far, people seem to just take the scores for I/R, calculate the variance and the confidence intervals from there.

Let's use precise terminology. One can "calculate" the sample mean and sample variance of a sample, but this does not "calculate" the mean and variance of the population. Instead , it "estimates" the mean and variance of the population.

My question is: Is this a statistically valid approach?

"Statistically valid" needs clarification. A formula that is applied to sample data for purpose of estimating a population parameter, is (naturally) called an "estimator". Phrases such "valid estimator", "good estimator", or "best estimator" are ambiguous. They can be clarified by telling: valid with respect to what property? best with respect to what property? etc.

If you have independent samples of a random variable (such as I/R) then estimating the population mean by using the sample mean gives you an "unbaised" estimate of the population mean (i.e. the expected value of the estimator is equal to the true value of the parameter estimated). So estimating the population mean from the sample mean of I/R without any separate calculations on I and R is "good", in the sense of giving an unbiased estimate. Similarly, if you use the formula for the unbiased estimate of the variance (https://en.wikipedia.or/wiki/Unbiased_estimation_of_standard_deviation ) you get an unbiased estimate of the population variance of I/R without considering I and R separately. (However the square root of that estimate, is not, in general, an unbiased estimate of the standard deviation of the population.)

What you have seen people doing is not unreasonable, so I wouldn't declare to be "statistically invalid".

That said, it may be possible to find a better estimators that people have used. For example, an estimator is itself a random variable, so the estimator has its own population mean and variance - meaning the mean and variance of all estimates that, conceptually, can be made from all random samples of the population being measured. By doing a computation that treats I and R as separate variables, it might be possible to find a formula that gives an unbiased estimate of the population variance of I/R and has less variance as an estimator than the estimator based just on the I/R values.

My gut feeling is that since I and R both have their own independent technical variance (even if the means are not independent), then the variance of the ratio between the two should be greater than just the variance of the ratios? Can anyone confirm or refute my feeling?

If you are talking about population variances your gut feeling is correct. It depends on the specific distributions that I and R have. For example, if both I and R are independently normally distributed, the population variance of I/R doesn't exist because there is (theoretically) the possibility that we can get an R value arbitrarily close to zero.

However, we have to keep in mind that computations from the sample data are estimators, so the theoretical relations between the population variances of I and R the variance of I/R is just a starting point for investigating estimators of those parameters.

As far as I know, there are no helpful results dealing with the ratio of two independent random variables as a generality. Making progress on your problem depends on studying a special case of some sort - two independent random variables with some specific type of distribution. What type of probability distribution is a good model for samples of I and R ?

Stephen Tashi · Oct 1, 2017

DaanV said:

In order to validate our experimental setup we want to learn the technical variance of our results, such that we can set thresholds above which we can confidently distinguish healthy from patient (e.g. "If I/R ≥ 1.034, subject has disease").

How do we define a healthy patient? Before getting into the details of I/R, we (at least "I") need to understand what the "bottom line" is for doing the test.

Beginning at the beginning, you can correct my model of the situation:

I'll imagine the blood plasma sample to be a set of cells. In my way of thinking, there could be varying degrees of "sickness" in the population of patients - some patients having a lot of "cancer cells" in their plasma and other sick patients having fewer. Is it (empirically) the case that the number of cells in the blood sample is so large that a sample from a sick patient can be assumed to have at least 1 "cancer cell"? Is the particular type of cancer being tested-for known to produce a certain fraction of "bad" cells in the body ( e.g. 100% ? 5% to 10%?).

DaanV · Oct 2, 2017

Thanks a lot @Stephen Tashi . This is very useful. I'll see if I can get you meaningful answers and will try to be more precise in my terminology.

Stephen Tashi said:

"Statistically valid" needs clarification. A formula that is applied to sample data for purpose of estimating a population parameter, is (naturally) called an "estimator". Phrases such "valid estimator", "good estimator", or "best estimator" are ambiguous. They can be clarified by telling: valid with respect to what property? best with respect to what property? etc.

Right. I was hoping for it to be less ambiguous. I was hoping that there was some consensus on how to treat sample variance of a ratio of two independent random variables.

Stephen Tashi said:

If you have independent samples of a random variable (such as I/R) then estimating the population mean by using the sample mean gives you an "unbaised" estimate of the population mean (i.e. the expected value of the estimator is equal to the true value of the parameter estimated). So estimating the population mean from the sample mean of I/R without any separate calculations on I and R is "good", in the sense of giving an unbiased estimate. Similarly, if you use the formula for the unbiased estimate of the variance (https://en.wikipedia.or/wiki/Unbiased_estimation_of_standard_deviation ) you get an unbiased estimate of the population variance of I/R without considering I and R separately. (However the square root of that estimate, is not, in general, an unbiased estimate of the standard deviation of the population.)

What you have seen people doing is not unreasonable, so I wouldn't declare to be "statistically invalid".

Thank you.
So the approach I've seen used is not, to your mind, inherently or overly biased? That's good to know.

Stephen Tashi said:

That said, it may be possible to find a better estimators that people have used. For example, an estimator is itself a random variable, so the estimator has its own population mean and variance - meaning the mean and variance of all estimates that, conceptually, can be made from all random samples of the population being measured. By doing a computation that treats I and R as separate variables, it might be possible to find a formula that gives an unbiased estimate of the population variance of I/R and has less variance as an estimator than the estimator based just on the I/R values.

This is beyond the scope of the project, I think. If we have time left we may dive a little bit into modelling this, would be great if we could come up with a more unbiased estimator.

Stephen Tashi said:

If you are talking about population variances your gut feeling is correct. It depends on the specific distributions that I and R have. For example, if both I and R are independently normally distributed, the population variance of I/R doesn't exist because there is (theoretically) the possibility that we can get an R value arbitrarily close to zero.

However, we have to keep in mind that computations from the sample data are estimators, so the theoretical relations between the population variances of I and R the variance of I/R is just a starting point for investigating estimators of those parameters.

As far as I know, there are no helpful results dealing with the ratio of two independent random variables as a generality. Making progress on your problem depends on studying a special case of some sort - two independent random variables with some specific type of distribution. What type of probability distribution is a good model for samples of I and R ?

I'm going to have to waste a few words on the distribution of I and R, probably because I'm unfamiliar with the precise terminology.
So as mentioned before, the values for I and R are entirely dependent. If I input more I, I will also input more R.
However, the idea is to make one large pool of DNA isolate from plasma and test that in multitude (30-60 times the same sample). We are interested in the sample variance of I and R, not so much the mean values.
I expect the sample variance of I and R to be independent, and the distribution around the population mean to be normal.

Stephen Tashi said:

How do we define a healthy patient? Before getting into the details of I/R, we (at least "I") need to understand what the "bottom line" is for doing the test.

Beginning at the beginning, you can correct my model of the situation:

I'll imagine the blood plasma sample to be a set of cells. In my way of thinking, there could be varying degrees of "sickness" in the population of patients - some patients having a lot of "cancer cells" in their plasma and other sick patients having fewer. Is it (empirically) the case that the number of cells in the blood sample is so large that a sample from a sick patient can be assumed to have at least 1 "cancer cell"? Is the particular type of cancer being tested-for known to produce a certain fraction of "bad" cells in the body ( e.g. 100% ? 5% to 10%?).

Thank you for returning to my comfort zone. :)
Beyond the fact that there are no actual cells in the blood plasma (we lose them by centrifugation), the model is not bad. In fact we are looking at cell-free DNA. The model is that when cells anywhere in your body die by apoptosis, their DNA ends up in the blood circulation. This happens all the time in healthy individuals, leading to a certain background level of healthy DNA in healthy persons and patients alike.
In the case of cancer patients, some of the DNA in the circulation will naturally have come from the cancer cells. For early stage (non-metastasised, small tumours) cancers this can be as little as 0.1% of the DNA. For late stage disease (heavy metastasis, large tumours, etc) this can be up to 80% or 90%. Obviously in the latter case there will also be much elevated total levels of DNA in the bloodstream.

So to exemplify this a little bit:
Assume we have an individual where we have healthy background DNA on a level of 500 copies/ml plasma (this is a realistic ballpark figure).
The patient has a low tumour load, and 0.1% of his/her cell-free DNA comes from cancer cells.
In these cancer cells, the gene of interest (I) is present at 10-fold copy-number-gain when compared to reference R.
The population mean for R is simply 500 copies/ml.
The population mean for I is 500 + 500 * 0.1% * (10-1) = 504.5 copies/ml.
The population mean for I/R is 504.5/500 = 1.009
Given the experienced technical variance of I and R, I don't realistically expect to be able to measure this difference significantly.

Assume we have another individual with the exact same level of background healthy DNA.
The patient has a mediocre tumour load, and 10% of his/her cell-free DNA comes from cancer cells.
In these cancer cells, the gene of interest (I) is present at 10-fold copy-number-gain when compared to reference R.
The population mean for R is simply 500 copies/ml.
The population mean for I is 500 + 500 * 10% * (10-1) = 950 copies/ml.
The population mean for I/R is 950/500 = 1.9
Given the experienced technical variance of I and R, I expect to be able to measure this difference significantly.
Obviously, we can't distinguish from this result alone whether the patient has 10% abundance and 10-fold copy-number-gain, or higher abundance and lower gain or vice versa.

The problem we are currently trying to investigate is: Where do we put the cut-off on what we can and can't distinguish from healthy persons? In an ideal world with no technical variance we could distinguish 500 from 504 no problem. In reality we can't. So where do we put the cut-off?

Best regards and thanks again for any help given,
Daan

FactChecker · Oct 2, 2017

Maybe I misunderstood the OP. I scanned through this thread and couldn't see if anyone has already pointed this out:

You seem to be asking about the variance of the ratios versus the ratio of the individual variances. If so, you should definitely use the sample variance of ratios. Remember that if the denominator, R, of I/R has a mean that is near zero, then the variance of I/R can be huge even if σ_I/σ_R is small. (In your case the mean of the denominator, R, it is not near 0, but the possibility in other cases illustrates that σ_I/σ_R is the wrong approach.)

So you should use the ratios of the samples x_j = i_j/r_j (the last column of your data in post #5) and calculate the variance of those numbers. DO NOT use the ratio of the variance of column 2 over the variance of column 3.

DaanV · Oct 2, 2017

FactChecker said:

Maybe I misunderstood the OP. I scanned through this thread and couldn't see if anyone has already pointed this out:

You seem to be asking about the variance of the ratios versus the ratio of the individual variances. If so, you should definitely use the sample variance of ratios. Remember that if the denominator, R, of I/R has a mean that is near zero, then the variance of I/R can be huge even if σ_I/σ_R is small. (In your case the mean of the denominator, R, it is not near 0, but the possibility in other cases illustrates that σ_I/σ_R is the wrong approach.)

So you should use the ratios of the samples x_j = i_j/r_j (the last column of your data in post #5) and calculate the variance of those numbers. DO NOT use the ratio of the variance of column 2 over the variance of column 3.

Thank you @FactChecker for your reply. :)

I am aware I should not simply be using the ratio of the individual variances. As you point out, this would make little sense. And it would lead to enormous issues if σ_R is (very) small.

My question is whether there is some accepted method of taking into account the fact that both R and I have individual variances, and calculating the variance of I/R from those (by some method obviously other than calculating the ratio of the two variances), rather than considering I/R to be a single independent variable.

Thanks for your consideration.

Stephen Tashi · Oct 3, 2017

DaanV said:

So the approach I've seen used is not, to your mind, inherently or overly biased?

It is not biased in the statistical sense of the word "biased". "Unbiased" means "not at all biased".

The statistical sense of "unbiased" is reasonable implementation of the common language use of the word "unbiased" provided the population of people being sampled is representative of the population of people you are interested in.

If we have time left we may dive a little bit into modelling this, would be great if we could come up with a more unbiased estimator.

Perhaps you mean "less variable" instead of "more unbiased". Suppose we take a sample of 100 people from a population. The usual way of estimating the mean weight for the population would be to take the average weight of the 100 people. Suppose, for some reason, a person decides to pick 10 people at random from the 100 and use the average weight of only those 10 people as an estimate of the mean weight of the population. That peculiar method of estimation is not (statistically) biased, because its expected value is still the mean weight of the population. The peculiar estimator has a higher variance then the estimator that uses all the data. It is more likely to be "way off" than the usual method, but no more likely to be "way above" than "way below".

We are interested in the sample variance of I and R, not so much the mean values.

I expect the sample variance of I and R to be independent, and the distribution around the population mean to be normal.

I think your goal is to estimate the population variance of I/R. The sample variances of I,R and I/R are that are computed from experimental data are of interest insofar as they can be used to make that estimate. (There is going to be some confusion about the word "sample" in this discussion because we have "sample" in the biological sense and "sample" in the statistical sense. The random variable I/R is realized from a population of possible biological samples. The variance of I/R is the variance of that population. It is the variance of a population of possible biological samples. If we have 30 numerical results for I/R, we can compute the statistical "sample variance" of those numbers.)

The problem we are currently trying to investigate is: Where do we put the cut-off on what we can and can't distinguish from healthy persons?

At the most general level, as you probably know, a test that has a "positive" or "negative" outcome can have the results: true positive, false positive, true negative, false negative.

As I visualize the goal, "the answer" to the problem could begin with a table that has the format "If a person has c% of his DNA from cancerous cells then there is probability of ##p## that the value of ##I/R## from one sample of his blood will be less than ##\alpha##". We want to estimate the data for ##\alpha## in this table for various values of ##c## and ##p##. (e.g.##c = 0, p =.95## or ##c = 80, p = .90## etc.). From that sort of data one could begin to analyze the statistical effect of using the test, taking into account all the medical and economic factors. ( E.g. Maybe the a test produces few false negatives and a lot of false positives but it is cheap and simple and false negatives will cost many lives. Maybe a test is very accurate for very sick patients, but the prognosis for curing such patients is poor.)

I think your approach is to assume the population mean of I/R is known ( e.g. ##(500 + 500*.001*C *9)/500 ## in one of your examples) and that I/R is normally distributed. Then by estimating the (population) variance of I/R from experimental data you could produce a table of such data using the values of the normal distribution with a known mean and known variance.There may be some people who can do a pencil and paper analysis of the problem. The way I would feel comfortable approaching it is by simulation - and if anyone offered a pencil and paper analysis, I would check it by simulation. Even if no computer program is written to implement the simulation, it's still helpful to imagine the structure of a simulation - this forces us to understand the details of how randomness of various sorts contributes to the variability of the answer.

As an example of one point to investigate, it is not in general true for random variables X and Y that:

1) mean of XY = (mean of X)(mean of Y)

It is also not true in general that

2) mean of (1/Y) =1/ (mean of Y).

When you comput the population mean of I/R in your example, you do it by (mean of I)/(mean of R).

It may be that in the problem at hand, this is a reasonable approximation, but it is not an automatic consequence of the laws of probability. (i.e. You can't prove mean(I/R) = mean I/ mean R by letting X = I and Y = 1/R and then applying 1) and 2) )

I Find Variance of Ratios in Lab Experiments (No Maths Background)

Similar threads

Hot Threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

A Does this computation satisfy LTL formulas?

A Prove that points which are indistinguishable from 0 exist (using logic)

A Mathematical Connection between Cosmic Expansion and Exponential Growth

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective