Calculating Shannon Entropy of DNA Sequences

AI Thread Summary
The discussion centers on calculating Shannon entropy for DNA sequences, specifically addressing two tasks. For task 1, the entropy is calculated using equal probabilities for the four bases, resulting in a value of 2. In task 2, there is confusion regarding the probabilities for GC and AT content, with a proposed calculation yielding an entropy of 1.94. Participants clarify that the probabilities should sum to 1 and that the bases should be treated individually rather than as pairs. Ultimately, the correct understanding of base representation and probability distribution is emphasized, leading to a clearer approach to the calculations.
GravityX
Messages
19
Reaction score
1
Homework Statement
Calculate the information content of a DNA base pair
Relevant Equations
##I(A)=-\sum\limits_{x=A}^{}P_xlog_2(P_x)##
Unfortunately, I have problems with the following task

Bildschirmfoto 2023-01-10 um 16.07.44.png

For task 1, I proceeded as follows. Since the four bases have the same probability, this is ##P=\frac{1}{4}## I then simply used this probability in the formula for the Shannon entropy:

$$I=-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})=2$$

Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:

$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
 
Physics news on Phys.org
GravityX said:
Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
Not familiar with the topic but if the formula in 'Relevant Equations' is correct then, for task 2, the four probabilties should presumably be:
P(G) = 0.2 (i.e. not 0.4)
P(C) = 0.2 (i.e. not 0.4)
P(A) = 0.3 (i.e. not 0.6)
P(T) = 0.3 (i.e. not 0.6)
(They have to add-up to 1.)
 
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
But treating it as a matter of base pairs would yield ##-\frac 12\ln(\frac 12)-\frac 12\ln(\frac 12)=1##.
The question ought to read "for a single DNA base" (i.e., per base in a single strand).

This may be what fooled you into using 0.4 and 0.6 instead of 0.2 and 0.3 in the second question. But note that we can only get 0.2 and 0.3 by assuming that the orientations of the base pairs (which base is in which strand) are independent. One could imagine some sort of autocorrelation instead.

As to whether it would be surprising, that might depend whether we consider also the relative stabilities of the base pairs and the scheme that maps codons to amino acids.
 
Last edited:

haruspex said:
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand. See link below.

With this convention:
‘A’ represents adenine-thymine on the double strand.
‘C’ represents cytosine-guanine on the double strand.
‘G’ represents guanine- cytosine on the double strand.
‘T’ represents thymine- adenine on the double strand.

For example (using lower case for the bases) the sequence TATAGC represents the double strand: tatagc atatcg
From https://www.futurelearn.com/info/courses/bacterial-genomes-bioinformatics/0/steps/47002:
“Despite being a double helix of complementary DNA sequences, DNA is almost always represented as a single sequence.”
 
Steve4Physics said:
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand.
Ok, thanks.
 
Thank you Steve4Physics and haruspex for your help 👍, I had completely forgotten that the 0.4 was for the pair and not one base alone.
 
Kindly see the attached pdf. My attempt to solve it, is in it. I'm wondering if my solution is right. My idea is this: At any point of time, the ball may be assumed to be at an incline which is at an angle of θ(kindly see both the pics in the pdf file). The value of θ will continuously change and so will the value of friction. I'm not able to figure out, why my solution is wrong, if it is wrong .
TL;DR Summary: I came across this question from a Sri Lankan A-level textbook. Question - An ice cube with a length of 10 cm is immersed in water at 0 °C. An observer observes the ice cube from the water, and it seems to be 7.75 cm long. If the refractive index of water is 4/3, find the height of the ice cube immersed in the water. I could not understand how the apparent height of the ice cube in the water depends on the height of the ice cube immersed in the water. Does anyone have an...
Thread 'A bead-mass oscillatory system problem'
I can't figure out how to find the velocity of the particle at 37 degrees. Basically the bead moves with velocity towards right let's call it v1. The particle moves with some velocity v2. In frame of the bead, the particle is performing circular motion. So v of particle wrt bead would be perpendicular to the string. But how would I find the velocity of particle in ground frame? I tried using vectors to figure it out and the angle is coming out to be extremely long. One equation is by work...
Back
Top