# DNA sequence modeled as 4 faced die

• bowlbase
In summary: Yes, provided that the resulting 4 probs add to 1; I will let the OP worry about whether or not that will...
bowlbase

## Homework Statement

I have a DNA sequence generated by L throws of a 4 faced die with probabilities ##\pi_A, \pi_C, \pi_G, \pi_T##. Each probability is unknown. Task: estimate the probability of each side of the die. Hint: use a random variable defined by the sequence that has a binomial distribution then use the likelihood maximization.

## The Attempt at a Solution

So, as always with these problems my first attempt at this is wildly incorrect despite making perfect sense to me. My naive approach would just be to count the number of each A, C, G, T within the sequence and divide by the sequence length to get an approximate probability for that letter at any position within the sequence. But this doesn't use either hint.

The other way I would do this is to still use each letter as a separate random variable 4 binomial distributions. These would give me probability distribution of particular letters in the sequence. Then for each binomial distribution I would take the derivative and set to 0 to get the solution to the probability.

For instance: ##F(\pi_A)=\binom{L}{N_A}\pi_A^{N_A}(1-\pi_A)^{L-N_A}##
Doing log on both sides, taking the derivative and setting to 0.
##0=\frac{N_A}{\pi_A}- \frac{L-N_A}{1-\pi_A}##

I'm about to be late for class but at first glance this looks like I made a mistake or that my method is wrong here as well. But the idea is that I do this four letters.

Thanks for the help.

bowlbase said:

## Homework Statement

I have a DNA sequence generated by L throws of a 4 faced die with probabilities ##\pi_A, \pi_C, \pi_G, \pi_T##. Each probability is unknown. Task: estimate the probability of each side of the die. Hint: use a random variable defined by the sequence that has a binomial distribution then use the likelihood maximization.

## The Attempt at a Solution

So, as always with these problems my first attempt at this is wildly incorrect despite making perfect sense to me. My naive approach would just be to count the number of each A, C, G, T within the sequence and divide by the sequence length to get an approximate probability for that letter at any position within the sequence. But this doesn't use either hint.

The other way I would do this is to still use each letter as a separate random variable 4 binomial distributions. These would give me probability distribution of particular letters in the sequence. Then for each binomial distribution I would take the derivative and set to 0 to get the solution to the probability.

For instance: ##F(\pi_A)=\binom{L}{N_A}\pi_A^{N_A}(1-\pi_A)^{L-N_A}##
Doing log on both sides, taking the derivative and setting to 0.
##0=\frac{N_A}{\pi_A}- \frac{L-N_A}{1-\pi_A}##

I'm about to be late for class but at first glance this looks like I made a mistake or that my method is wrong here as well. But the idea is that I do this four letters.

Thanks for the help.

I liked your first method better, but if you insist on using something like the second method, you should use the appropriate distribution. In this case you have 4 possible outcomes at each "toss", so a binomial would not be appropriate (unless you look at two outcomes only, such as "A" or "not-A", etc.) Much better, I think, would be to use the multinomial distribution (see, eg., http://en.wikipedia.org/wiki/Multinomial_distribution or http://stattrek.com/probability-distributions/multinomial.aspx ). So, the probability of outcome counts of ##n_A,n_C,n_G,n_T## for given total ##N = n_A + n_C + n_G + n_T## is
$$\text{probability} = f(p_A,p_C,p_G,p_T) \equiv \frac{N!}{n_A! n_C! n_G! n_T!} p_A^{n_A} p_C^{n_C} p_G^{n_G} p_T^{n_T}$$
A maximum-likelihood estimator of the ##p_i## would be given by solving the constrained optimization problem
$$\text{maximize} f(p_A,p_C,p_G,p_T),\\ \text{subject to} \;\; p_A + p_C + p_G + p_T = 1 \;\; \text{and}\;\ p_A, p_C, p_G, p_T \geq 0$$
This could be tackled by the Lagrange multiplier method (provided that we neglect the "##\geq 0##" constraints). I will let you worry about whether or not you get the same final solution as given by your first, simple, method.

Ray Vickson said:
a binomial would not be appropriate (unless you look at two outcomes only, such as "A" or "not-A", etc.)
Sounds like a good method to me. Isn't it a lot simpler than handling all four at once?

haruspex said:
Sounds like a good method to me. Isn't it a lot simpler than handling all four at once?

Yes, provided that the resulting 4 probs add to 1; I will let the OP worry about whether or not that will happen,

My idea was handle each as either A or Not-A as you mentioned. Since I was explicitly given the hint to use the binomial and maximization likelihood. I don't believe we ever discussed the multinomial distribution but I can see how it works here.

Ray Vickson said:
Yes, provided that the resulting 4 probs add to 1; I will let the OP worry about whether or not that will happen,

I think that since the problem is really just looking for an estimation that if the sum is not exactly 1, that it will be okay.

bowlbase said:
My idea was handle each as either A or Not-A as you mentioned. Since I was explicitly given the hint to use the binomial and maximization likelihood. I don't believe we ever discussed the multinomial distribution but I can see how it works here.

I think that since the problem is really just looking for an estimation that if the sum is not exactly 1, that it will be okay.

I suggest that before deciding this one way or another, you carry out the complete solution for your max. likelihood estimate ##\pi_A## in terms of ##N_A## and ##L##. Then, of course, you would estimate ##\pi_C, \pi_G, \pi_T## using the same formula, but with ##N_A## replaced by by ##N_C, N_G, N_T##.

I finally found time to sit and finish this, sorry it took so long. So, I found the maximum with the method I described initially and got the exact some result as if I had just taken the first "simple" method I thought of. It seems my first instinct was correct.

Thanks for the help.

bowlbase said:
I found the maximum with the method I described initially and got the exact some result as if I had just taken the first "simple" method I thought of. It seems my first instinct was correct.
Oh yes, that was always going to be the answer, but I thought the object of the exercise was to derive it from the maximum likelihood method.

Well, the ML method described in class was to take the binomial distribution's derivative, set it to 0 and solve for the probability. So that is the method I used. Is this not correct?

bowlbase said:
Well, the ML method described in class was to take the binomial distribution's derivative, set it to 0 and solve for the probability. So that is the method I used. Is this not correct?
Yes, it's correct. I was responding to this:
bowlbase said:
got the exact some result as if I had just taken the first "simple" method I thought of
bowlbase said:
My naive approach would just be to count the number of each A, C, G, T within the sequence and divide by the sequence length to get an approximate probability for that letter at any position within the sequence.
was always going to give the right answer. According to the hint, you were to use ML to get the answer, which you have done.

Oh, okay. I was worried for a second that I had done something else wrong. Thanks!

bowlbase said:
Oh, okay. I was worried for a second that I had done something else wrong. Thanks!

The MLE from the binomial (applied four times to the four different ##p##s) gives the same probabilities as the MLE from the multinomial, applied once to all four ##p##s simultaneously. That is a nice fact, because IF they had given different results that would have been a real source of worry.

## 1. What is a DNA sequence modeled as a 4 faced die?

A DNA sequence modeled as a 4 faced die represents the four nucleotide bases (adenine, cytosine, guanine, and thymine) found in DNA as four sides of a die. Each base is assigned to a number on the die (1 for adenine, 2 for cytosine, 3 for guanine, and 4 for thymine).

## 2. How is a DNA sequence modeled as a 4 faced die used in scientific research?

This model is used in computational biology to simulate and analyze DNA sequences. By assigning numerical values to each base, researchers can perform statistical analysis and simulations to better understand the properties and functions of DNA sequences.

## 3. How does the 4 faced die model simulate DNA mutations?

In the 4 faced die model, a mutation can occur when the value on one side of the die is changed to a different value. This simulates a change in the nucleotide base at a specific position in the DNA sequence, which can result in a genetic mutation.

## 4. Can the 4 faced die model be applied to all DNA sequences?

Yes, the 4 faced die model can be applied to any DNA sequence, as it is based on the four universal nucleotide bases found in all living organisms. However, the model may need to be modified for specific purposes, such as incorporating additional bases for RNA sequences.

## 5. Are there any limitations to using the 4 faced die model for DNA sequences?

One limitation of this model is that it simplifies the complex structure and functions of DNA. It does not take into account other factors such as epigenetic modifications and interactions with proteins. Therefore, the model should be used in conjunction with other methods for a more comprehensive understanding of DNA sequences.

• Precalculus Mathematics Homework Help
Replies
3
Views
805
• Precalculus Mathematics Homework Help
Replies
2
Views
847
• Biology and Medical
Replies
1
Views
957
• Precalculus Mathematics Homework Help
Replies
15
Views
1K
• Precalculus Mathematics Homework Help
Replies
8
Views
1K
• Precalculus Mathematics Homework Help
Replies
4
Views
740
• Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
• Quantum Physics
Replies
2
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
15
Views
2K
• Precalculus Mathematics Homework Help
Replies
1
Views
3K