A Calculation of probabilities assuming correlation between events

Bob Anger · Feb 17, 2025

Hi all,
I need your help regarding a probability problem for which I couldn’t find an answer. The topic has been presented by a finance Professor at MIT in a paper published a few years ago.

Here is the problem: A medical experiment, seeking the development of a new medical drug, has a probability of success of 5% (so 95% of failure). We consider conducting 150 IID expirements.
Using the binomial law, the probability to get at least 3 successes out of 150 expirements is exactly 98.18%. So far so good.

But here where I have troubles: if we consider that failures among the 150 expirements are pairwise correlated at 10%, the probability to get at leat 3 successes declines to 89.3% as computed by the Professor. With a correlation at 40%, the probability dropped to 56.1% and so on.

Can someone please help me to understand how I can compute the probabilities with regard to correlation and the formulas to use? Despite many research, I couldn’t find much help and the published paper doesn’t give details on the method used.
Many thanks in advance.

Regards,
Bob

jbergman · Feb 19, 2025

Bob Anger said:

TL;DR Summary: Calculation of probabilities assuming correlation between events

Hi all,
I need your help regarding a probability problem for which I couldn’t find an answer. The topic has been presented by a finance Professor at MIT in a paper published a few years ago.

Here is the problem: A medical experiment, seeking the development of a new medical drug, has a probability of success of 5% (so 95% of failure). We consider conducting 150 IID expirements.
Using the binomial law, the probability to get at least 3 successes out of 150 expirements is exactly 98.18%. So far so good.

But here where I have troubles: if we consider that failures among the 150 expirements are pairwise correlated at 10%, the probability to get at leat 3 successes declines to 89.3% as computed by the Professor. With a correlation at 40%, the probability dropped to 56.1% and so on.

Can someone please help me to understand how I can compute the probabilities with regard to correlation and the formulas to use? Despite many research, I couldn’t find much help and the published paper doesn’t give details on the method used.
Many thanks in advance.

Regards,
Bob

Do you have the name of the paper?

Bob Anger · Feb 19, 2025

jbergman said:

Do you have the name of the paper?

Yes, the paper’s name is ‘Can Financial Economics Cure Cancer?’ by Professor Andrew W. Lo from MIT

BvU · Feb 19, 2025

https://link.springer.com/article/10.1007/s11293-021-09704-7

jbergman · Feb 19, 2025

I searched around and found a paper that discusses this. https://arxiv.org/abs/physics/0605189

They mention some finance models for this which I am guessing are the ones Lo used based on his background in finance.

Bob Anger · Feb 20, 2025

jbergman said:

I searched around and found a paper that discusses this. https://arxiv.org/abs/physics/0605189

They mention some finance models for this which I am guessing are the ones Lo used based on his background in finance.

Thank you very much for your help, will take a close look at it.
Cheers

jbergman · Feb 20, 2025

Bob Anger said:

Thank you very much for your help, will take a close look at it.
Cheers

If you find the answer, I'd be interested in knowing about it. Interesting problem.

Bob Anger · Feb 25, 2025

jbergman said:

If you find the answer, I'd be interested in knowing about it. Interesting problem.

Hi,

Indeed, the topic is interesting.

I finally emailed Professor Lo and one of his assistants kindly sent me 2 published documents related to another study conducted in 2013-2014 on Alzheimer's disease (AD). I believe the methodology used is quite the same as the one followed in the Cancer paper. The study is about building a Megafund that can finance the research & development of a portfolio of 64 drugs for AD. Documents are attached.

Basically, as you can see in the ‘Supplementary Information’ document, Professor Lo used linear algebra and a Monte Carlo simulation to generate the probabilities considering single or multiple values for correlation (ρ). In the case of single correlation value, the expressions derived are mainly useful for developing intuition, not for computation, as equi-correlated outcomes are rare.

The Monte Carlo simulation is based on several assumptions:

1. The implementation of a 64x64 Covariance matrix ∑: for the case of single’ ρ’, I believe that the diagonal is composed of '1's and the off-diagonal values contain the correlation value ‘ρ’

2. The computation of the ‘Cholesky decomposition’ from this Covariance matrix ∑, which is a sort of extension of the square root operations to matrices

3. The implementation of a colomun-vector matrix, denoted Ɛ, containing generated IID normally distributed random variables (in Excel for example, I believe the formula is: NORMINV(rand(),0,1))

4. The multiplication of the Cholesky matrix by a Ɛ to generate a new matrix ‘Z’

5. Computation a column-vector, denoted ‘B’, containing 0’s and 1’s: 0’s for Zi’s less than αi and 1’s if Zi’s are greater. The values for αi are computed as the inverse standard normal distribution with parameter (1-pi). In the case of single probability of success, I understand that pi’s would take the same value

6. Calculation of the probability of at least one success (see below)

The above process is repeated a large number of times so that the generated values of matrix ‘B’ follow a Bernoulli distribution and I’m guessing that the probability summarized in the table page 10 are computed as an average of the different simulations, but I didn’t find a precision regarding this point in the paper.

I’m struggling with the case of multiples correlations across the 64 drugs (any help would be highly appreciated !): the paper explains that a numerical algorithm developed by 'Qi and Sun' was applied to compute the closest positive-definite matrix to the one specified manually (i.e. containing the 2016 pairs of correlations). The purpose of this computation is to avoid negative values for variance when the original matrix is derived to obtain the Cholesky decomposition. Despite several research for simplified explanation of this algorithm (I'm not a mathematicien) and how to apply it to initial Covariance matrix, I wasn’t able to get the closest positive definite matrix, so it would not be possible to replicate the probability calculation at this point☹

Would be happy to discuss all these points.

Cheers

jbergman · Feb 26, 2025

Bob Anger said:

Hi,

Indeed, the topic is interesting.

I finally emailed Professor Lo and one of his assistants kindly sent me 2 published documents related to another study conducted in 2013-2014 on Alzheimer's disease (AD). I believe the methodology used is quite the same as the one followed in the Cancer paper. The study is about building a Megafund that can finance the research & development of a portfolio of 64 drugs for AD. Documents are attached.

Basically, as you can see in the ‘Supplementary Information’ document, Professor Lo used linear algebra and a Monte Carlo simulation to generate the probabilities considering single or multiple values for correlation (ρ). In the case of single correlation value, the expressions derived are mainly useful for developing intuition, not for computation, as equi-correlated outcomes are rare.

The Monte Carlo simulation is based on several assumptions:

1. The implementation of a 64x64 Covariance matrix ∑: for the case of single’ ρ’, I believe that the diagonal is composed of '1's and the off-diagonal values contain the correlation value ‘ρ’

2. The computation of the ‘Cholesky decomposition’ from this Covariance matrix ∑, which is a sort of extension of the square root operations to matrices

3. The implementation of a colomun-vector matrix, denoted Ɛ, containing generated IID normally distributed random variables (in Excel for example, I believe the formula is: NORMINV(rand(),0,1))

4. The multiplication of the Cholesky matrix by a Ɛ to generate a new matrix ‘Z’

5. Computation a column-vector, denoted ‘B’, containing 0’s and 1’s: 0’s for Zi’s less than αi and 1’s if Zi’s are greater. The values for αi are computed as the inverse standard normal distribution with parameter (1-pi). In the case of single probability of success, I understand that pi’s would take the same value

6. Calculation of the probability of at least one success (see below)

The above process is repeated a large number of times so that the generated values of matrix ‘B’ follow a Bernoulli distribution and I’m guessing that the probability summarized in the table page 10 are computed as an average of the different simulations, but I didn’t find a precision regarding this point in the paper.

I’m struggling with the case of multiples correlations across the 64 drugs (any help would be highly appreciated !): the paper explains that a numerical algorithm developed by 'Qi and Sun' was applied to compute the closest positive-definite matrix to the one specified manually (i.e. containing the 2016 pairs of correlations). The purpose of this computation is to avoid negative values for variance when the original matrix is derived to obtain the Cholesky decomposition. Despite several research for simplified explanation of this algorithm (I'm not a mathematicien) and how to apply it to initial Covariance matrix, I wasn’t able to get the closest positive definite matrix, so it would not be possible to replicate the probability calculation at this point☹

Would be happy to discuss all these points.

Cheers

It looks like they just simulate random variables using a correlated multivariate normal distribution and then take functions of those random variables to get the Bernoulli random variables.

I am not sure that the final result has the correlations they claim. In other words, the correlation in the latent space may result in different correlations in the final output variables.

Anyways, I don't fundamentally have a problem with it but I think the way describe it in the previous paper is slightly misleading.

Bob Anger · Feb 26, 2025

I agree but it seems that randomization is somewhat inevitable to solve this type of problems.

When I first read the paper, I thought that the probability calculation assuming correlations should give an 'exact' value as under the Binomial considering IID and no correlation, not a MC simulation.
I ran 100,000 simulations for the scenario based on a single correlation 'ρ' of 10% and a probability of success 'p' of 5%. I found a probability of 87% to get at least one hit (vs. 84% in the paper). I think that when the number of simulations is high enough, the value would approximate the 84%.

I searched for other strategies to solve the problem and it seems that a Markov chain model can be an option for approximating the probability, but I am not familiar with calculations that has to be done at this stage. It would be interesting to compare the two results!

A Calculation of probabilities assuming correlation between events

Attachments

Similar threads

B A Little Probability Puzzle

I Need help solving this Existence Algorithm for truth

I What Are the Axioms of Fuzzy Logic and How Do They Extend Boolean Algebra?

A Distribution of Range of Samples taken from N(0,1)

B How Rare Is Low Smartphone Usage Among Metro Travelers in Japan?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers