Calculating Shannon Entropy of DNA Sequences

Click For Summary
SUMMARY

The discussion focuses on calculating Shannon entropy for DNA sequences, specifically addressing two tasks. For the first task, the entropy is calculated using equal probabilities for the four bases (A, T, G, C), resulting in an entropy value of 2. In the second task, the user attempts to calculate entropy based on a GC content of 40% and AT content of 60%, leading to a calculated entropy of approximately 1.94. However, there is confusion regarding the correct probabilities for the bases, with suggestions that the values should be adjusted to 0.2 for G and C, and 0.3 for A and T, based on the independence of base pairs.

PREREQUISITES
  • Understanding of Shannon entropy and its formula
  • Familiarity with DNA base pairs (A, T, G, C)
  • Knowledge of GC content and its significance in DNA
  • Basic probability concepts
NEXT STEPS
  • Study the application of Shannon entropy in bioinformatics
  • Learn about the significance of GC content in genetic stability
  • Explore the concept of base pair probabilities in DNA sequences
  • Investigate the implications of autocorrelation in DNA sequence analysis
USEFUL FOR

Bioinformaticians, geneticists, and researchers analyzing DNA sequences and their information content.

GravityX
Messages
19
Reaction score
1
Homework Statement
Calculate the information content of a DNA base pair
Relevant Equations
##I(A)=-\sum\limits_{x=A}^{}P_xlog_2(P_x)##
Unfortunately, I have problems with the following task

Bildschirmfoto 2023-01-10 um 16.07.44.png

For task 1, I proceeded as follows. Since the four bases have the same probability, this is ##P=\frac{1}{4}## I then simply used this probability in the formula for the Shannon entropy:

$$I=-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})=2$$

Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:

$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
 
Physics news on Phys.org
GravityX said:
Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
Not familiar with the topic but if the formula in 'Relevant Equations' is correct then, for task 2, the four probabilties should presumably be:
P(G) = 0.2 (i.e. not 0.4)
P(C) = 0.2 (i.e. not 0.4)
P(A) = 0.3 (i.e. not 0.6)
P(T) = 0.3 (i.e. not 0.6)
(They have to add-up to 1.)
 
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
But treating it as a matter of base pairs would yield ##-\frac 12\ln(\frac 12)-\frac 12\ln(\frac 12)=1##.
The question ought to read "for a single DNA base" (i.e., per base in a single strand).

This may be what fooled you into using 0.4 and 0.6 instead of 0.2 and 0.3 in the second question. But note that we can only get 0.2 and 0.3 by assuming that the orientations of the base pairs (which base is in which strand) are independent. One could imagine some sort of autocorrelation instead.

As to whether it would be surprising, that might depend whether we consider also the relative stabilities of the base pairs and the scheme that maps codons to amino acids.
 
Last edited:

haruspex said:
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand. See link below.

With this convention:
‘A’ represents adenine-thymine on the double strand.
‘C’ represents cytosine-guanine on the double strand.
‘G’ represents guanine- cytosine on the double strand.
‘T’ represents thymine- adenine on the double strand.

For example (using lower case for the bases) the sequence TATAGC represents the double strand: tatagc atatcg
From https://www.futurelearn.com/info/courses/bacterial-genomes-bioinformatics/0/steps/47002:
“Despite being a double helix of complementary DNA sequences, DNA is almost always represented as a single sequence.”
 
Steve4Physics said:
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand.
Ok, thanks.
 
Thank you Steve4Physics and haruspex for your help 👍, I had completely forgotten that the 0.4 was for the pair and not one base alone.
 

Similar threads

Replies
10
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
Replies
28
Views
2K
Replies
3
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 9 ·
Replies
9
Views
4K
Replies
2
Views
1K
Replies
8
Views
2K
Replies
5
Views
6K