Calculating Shannon Entropy of DNA Sequences

AI Thread Summary
The discussion centers on calculating Shannon entropy for DNA sequences, specifically addressing two tasks. For task 1, the entropy is calculated using equal probabilities for the four bases, resulting in a value of 2. In task 2, there is confusion regarding the probabilities for GC and AT content, with a proposed calculation yielding an entropy of 1.94. Participants clarify that the probabilities should sum to 1 and that the bases should be treated individually rather than as pairs. Ultimately, the correct understanding of base representation and probability distribution is emphasized, leading to a clearer approach to the calculations.
GravityX
Messages
19
Reaction score
1
Homework Statement
Calculate the information content of a DNA base pair
Relevant Equations
##I(A)=-\sum\limits_{x=A}^{}P_xlog_2(P_x)##
Unfortunately, I have problems with the following task

Bildschirmfoto 2023-01-10 um 16.07.44.png

For task 1, I proceeded as follows. Since the four bases have the same probability, this is ##P=\frac{1}{4}## I then simply used this probability in the formula for the Shannon entropy:

$$I=-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})=2$$

Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:

$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
 
Physics news on Phys.org
GravityX said:
Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
Not familiar with the topic but if the formula in 'Relevant Equations' is correct then, for task 2, the four probabilties should presumably be:
P(G) = 0.2 (i.e. not 0.4)
P(C) = 0.2 (i.e. not 0.4)
P(A) = 0.3 (i.e. not 0.6)
P(T) = 0.3 (i.e. not 0.6)
(They have to add-up to 1.)
 
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
But treating it as a matter of base pairs would yield ##-\frac 12\ln(\frac 12)-\frac 12\ln(\frac 12)=1##.
The question ought to read "for a single DNA base" (i.e., per base in a single strand).

This may be what fooled you into using 0.4 and 0.6 instead of 0.2 and 0.3 in the second question. But note that we can only get 0.2 and 0.3 by assuming that the orientations of the base pairs (which base is in which strand) are independent. One could imagine some sort of autocorrelation instead.

As to whether it would be surprising, that might depend whether we consider also the relative stabilities of the base pairs and the scheme that maps codons to amino acids.
 
Last edited:

haruspex said:
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand. See link below.

With this convention:
‘A’ represents adenine-thymine on the double strand.
‘C’ represents cytosine-guanine on the double strand.
‘G’ represents guanine- cytosine on the double strand.
‘T’ represents thymine- adenine on the double strand.

For example (using lower case for the bases) the sequence TATAGC represents the double strand: tatagc atatcg
From https://www.futurelearn.com/info/courses/bacterial-genomes-bioinformatics/0/steps/47002:
“Despite being a double helix of complementary DNA sequences, DNA is almost always represented as a single sequence.”
 
Steve4Physics said:
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand.
Ok, thanks.
 
Thank you Steve4Physics and haruspex for your help 👍, I had completely forgotten that the 0.4 was for the pair and not one base alone.
 
Thread 'Minimum mass of a block'
Here we know that if block B is going to move up or just be at the verge of moving up ##Mg \sin \theta ## will act downwards and maximum static friction will act downwards ## \mu Mg \cos \theta ## Now what im confused by is how will we know " how quickly" block B reaches its maximum static friction value without any numbers, the suggested solution says that when block A is at its maximum extension, then block B will start to move up but with a certain set of values couldn't block A reach...
TL;DR Summary: Find Electric field due to charges between 2 parallel infinite planes using Gauss law at any point Here's the diagram. We have a uniform p (rho) density of charges between 2 infinite planes in the cartesian coordinates system. I used a cube of thickness a that spans from z=-a/2 to z=a/2 as a Gaussian surface, each side of the cube has area A. I know that the field depends only on z since there is translational invariance in x and y directions because the planes are...
Thread 'Calculation of Tensile Forces in Piston-Type Water-Lifting Devices at Elevated Locations'
Figure 1 Overall Structure Diagram Figure 2: Top view of the piston when it is cylindrical A circular opening is created at a height of 5 meters above the water surface. Inside this opening is a sleeve-type piston with a cross-sectional area of 1 square meter. The piston is pulled to the right at a constant speed. The pulling force is(Figure 2): F = ρshg = 1000 × 1 × 5 × 10 = 50,000 N. Figure 3: Modifying the structure to incorporate a fixed internal piston When I modify the piston...
Back
Top