Thanks a lot
@Stephen Tashi . This is very useful. I'll see if I can get you meaningful answers and will try to be more precise in my terminology.
Stephen Tashi said:
"Statistically valid" needs clarification. A formula that is applied to sample data for purpose of estimating a population parameter, is (naturally) called an "estimator". Phrases such "valid estimator", "good estimator", or "best estimator" are ambiguous. They can be clarified by telling: valid with respect to what property? best with respect to what property? etc.
Right. I was hoping for it to be less ambiguous. I was hoping that there was some consensus on how to treat sample variance of a ratio of two independent random variables.
Stephen Tashi said:
If you have independent samples of a random variable (such as I/R) then estimating the population mean by using the sample mean gives you an "unbaised" estimate of the population mean (i.e. the expected value of the estimator is equal to the true value of the parameter estimated). So estimating the population mean from the sample mean of I/R without any separate calculations on I and R is "good", in the sense of giving an unbiased estimate. Similarly, if you use the formula for the unbiased estimate of the variance (
https://en.wikipedia.or/wiki/Unbiased_estimation_of_standard_deviation ) you get an unbiased estimate of the population variance of I/R without considering I and R separately. (However the square root of that estimate, is not, in general, an unbiased estimate of the standard deviation of the population.)
What you have seen people doing is not unreasonable, so I wouldn't declare to be "statistically invalid".
Thank you.
So the approach I've seen used is not, to your mind, inherently or overly biased? That's good to know.
Stephen Tashi said:
That said, it may be possible to find a better estimators that people have used. For example, an estimator is itself a random variable, so the estimator has its own population mean and variance - meaning the mean and variance of all estimates that, conceptually, can be made from all random samples of the population being measured. By doing a computation that treats I and R as separate variables, it might be possible to find a formula that gives an unbiased estimate of the population variance of I/R and has less variance as an estimator than the estimator based just on the I/R values.
This is beyond the scope of the project, I think. If we have time left we may dive a little bit into modelling this, would be great if we could come up with a more unbiased estimator.
Stephen Tashi said:
If you are talking about population variances your gut feeling is correct. It depends on the specific distributions that I and R have. For example, if both I and R are independently normally distributed, the population variance of I/R doesn't exist because there is (theoretically) the possibility that we can get an R value arbitrarily close to zero.
However, we have to keep in mind that computations from the sample data are estimators, so the theoretical relations between the population variances of I and R the variance of I/R is just a starting point for investigating estimators of those parameters.
As far as I know, there are no helpful results dealing with the ratio of two independent random variables as a generality. Making progress on your problem depends on studying a special case of some sort - two independent random variables with some specific type of distribution. What type of probability distribution is a good model for samples of I and R ?
I'm going to have to waste a few words on the distribution of I and R, probably because I'm unfamiliar with the precise terminology.
So as mentioned before, the values for I and R are entirely dependent. If I input more
I, I will also input more
R.
However, the idea is to make one large pool of DNA isolate from plasma and test that in multitude (30-60 times the same sample). We are interested in the
sample variance of
I and
R, not so much the mean values.
I expect the sample variance of
I and
R to be independent, and the distribution around the population mean to be normal.
Stephen Tashi said:
How do we define a healthy patient? Before getting into the details of I/R, we (at least "I") need to understand what the "bottom line" is for doing the test.
Beginning at the beginning, you can correct my model of the situation:
I'll imagine the blood plasma sample to be a set of cells. In my way of thinking, there could be varying degrees of "sickness" in the population of patients - some patients having a lot of "cancer cells" in their plasma and other sick patients having fewer. Is it (empirically) the case that the number of cells in the blood sample is so large that a sample from a sick patient can be assumed to have at least 1 "cancer cell"? Is the particular type of cancer being tested-for known to produce a certain fraction of "bad" cells in the body ( e.g. 100% ? 5% to 10%?).
Thank you for returning to my comfort zone. :)
Beyond the fact that there are no actual cells in the blood plasma (we lose them by centrifugation), the model is not bad. In fact we are looking at cell-free DNA. The model is that when cells anywhere in your body die by apoptosis, their DNA ends up in the blood circulation. This happens all the time in healthy individuals, leading to a certain background level of healthy DNA in healthy persons and patients alike.
In the case of cancer patients, some of the DNA in the circulation will naturally have come from the cancer cells. For early stage (non-metastasised, small tumours) cancers this can be as little as 0.1% of the DNA. For late stage disease (heavy metastasis, large tumours, etc) this can be up to 80% or 90%. Obviously in the latter case there will also be much elevated total levels of DNA in the bloodstream.
So to exemplify this a little bit:
Assume we have an individual where we have healthy background DNA on a level of 500 copies/ml plasma (this is a realistic ballpark figure).
The patient has a low tumour load, and 0.1% of his/her cell-free DNA comes from cancer cells.
In these cancer cells, the gene of interest (
I) is present at 10-fold copy-number-gain when compared to reference
R.
The population mean for
R is simply 500 copies/ml.
The population mean for
I is 500 + 500 * 0.1% * (10-1) = 504.5 copies/ml.
The population mean for
I/R is 504.5/500 = 1.009
Given the experienced technical variance of
I and
R, I don't realistically expect to be able to measure this difference significantly.
Assume we have another individual with the exact same level of background healthy DNA.
The patient has a mediocre tumour load, and 10% of his/her cell-free DNA comes from cancer cells.
In these cancer cells, the gene of interest (
I) is present at 10-fold copy-number-gain when compared to reference
R.
The population mean for
R is simply 500 copies/ml.
The population mean for
I is 500 + 500 * 10% * (10-1) = 950 copies/ml.
The population mean for
I/R is 950/500 = 1.9
Given the experienced technical variance of
I and
R, I expect to be able to measure this difference significantly.
Obviously, we can't distinguish from this result alone whether the patient has 10% abundance and 10-fold copy-number-gain, or higher abundance and lower gain or vice versa.
The problem we are currently trying to investigate is: Where do we put the cut-off on what we can and can't distinguish from healthy persons? In an ideal world with no technical variance we could distinguish 500 from 504 no problem. In reality we can't. So where do we put the cut-off?
Best regards and thanks again for any help given,
Daan