Calculating Shannon Entropy of DNA Sequences

In summary: The calculation then is as follows:$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$In summary, the person is struggling with two tasks and is not sure about the calculation for task 2. They use a GC content to help determine what AT and GC percentages are needed.
  • #1
GravityX
19
1
Homework Statement
Calculate the information content of a DNA base pair
Relevant Equations
##I(A)=-\sum\limits_{x=A}^{}P_xlog_2(P_x)##
Unfortunately, I have problems with the following task

Bildschirmfoto 2023-01-10 um 16.07.44.png

For task 1, I proceeded as follows. Since the four bases have the same probability, this is ##P=\frac{1}{4}## I then simply used this probability in the formula for the Shannon entropy:

$$I=-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})-\frac{1}{4}log_2(\frac{1}{4})=2$$

Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:

$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
 
Physics news on Phys.org
  • #2
GravityX said:
Unfortunately, I am not quite sure about task 2, but a GC content indicates how high the proportion of GC is in the DNA, so it means that AT must be present at 60 % and GC at 40 %. Is the calculation then as follows:$$I=-0.4log_2(0.4)-0.4log_2(0.4)-0.6log_2(0.6)-0.6log_2(0.6)=1.94$$
Not familiar with the topic but if the formula in 'Relevant Equations' is correct then, for task 2, the four probabilties should presumably be:
P(G) = 0.2 (i.e. not 0.4)
P(C) = 0.2 (i.e. not 0.4)
P(A) = 0.3 (i.e. not 0.6)
P(T) = 0.3 (i.e. not 0.6)
(They have to add-up to 1.)
 
  • #3
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
But treating it as a matter of base pairs would yield ##-\frac 12\ln(\frac 12)-\frac 12\ln(\frac 12)=1##.
The question ought to read "for a single DNA base" (i.e., per base in a single strand).

This may be what fooled you into using 0.4 and 0.6 instead of 0.2 and 0.3 in the second question. But note that we can only get 0.2 and 0.3 by assuming that the orientations of the base pairs (which base is in which strand) are independent. One could imagine some sort of autocorrelation instead.

As to whether it would be surprising, that might depend whether we consider also the relative stabilities of the base pairs and the scheme that maps codons to amino acids.
 
Last edited:
  • #4

haruspex said:
Not sure I understand the first question. It lists A, C, G and T as the four "base pairs". No, they are bases, the pairs being GC, AT.
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand. See link below.

With this convention:
‘A’ represents adenine-thymine on the double strand.
‘C’ represents cytosine-guanine on the double strand.
‘G’ represents guanine- cytosine on the double strand.
‘T’ represents thymine- adenine on the double strand.

For example (using lower case for the bases) the sequence TATAGC represents the double strand: tatagc atatcg
From https://www.futurelearn.com/info/courses/bacterial-genomes-bioinformatics/0/steps/47002:
“Despite being a double helix of complementary DNA sequences, DNA is almost always represented as a single sequence.”
 
  • Like
Likes haruspex
  • #5
Steve4Physics said:
By convention, when writing base-pair sequences, to be concise we use a single letter for each base pair. The letter is the base present on the ‘forward’ strand.
Ok, thanks.
 
  • #6
Thank you Steve4Physics and haruspex for your help 👍, I had completely forgotten that the 0.4 was for the pair and not one base alone.
 

1. What is Shannon Entropy and why is it important in DNA sequencing?

Shannon Entropy is a measure of the uncertainty or randomness in a sequence of symbols. In DNA sequencing, it is used to measure the diversity or complexity of a DNA sequence. This is important because it can provide insights into the functional significance of different regions of the DNA and help identify important genetic information.

2. How is Shannon Entropy calculated for DNA sequences?

Shannon Entropy is calculated using the formula H = -∑(pi * log2(pi)), where pi is the frequency of each nucleotide in the sequence. This formula takes into account both the number and distribution of nucleotides in the sequence to determine the overall entropy value.

3. What is the range of values for Shannon Entropy in DNA sequences?

The range of values for Shannon Entropy in DNA sequences is 0 to log2(n), where n is the number of possible symbols in the sequence. In DNA sequences, n is typically 4 (representing the four nucleotides A, C, G, and T), so the maximum entropy value is log2(4) = 2.

4. How can Shannon Entropy be used to compare DNA sequences?

Shannon Entropy can be used to compare DNA sequences by calculating the entropy value for each sequence and then comparing them. A higher entropy value indicates a more diverse or complex sequence, while a lower entropy value indicates a less diverse or simpler sequence.

5. Are there any limitations to using Shannon Entropy in DNA sequencing?

Yes, there are some limitations to using Shannon Entropy in DNA sequencing. It does not take into account the specific order or arrangement of nucleotides in a sequence, only the overall diversity. Additionally, it may not be as useful for highly repetitive sequences or sequences with a small number of nucleotides.

Similar threads

  • Introductory Physics Homework Help
Replies
10
Views
952
  • Introductory Physics Homework Help
Replies
7
Views
862
  • Introductory Physics Homework Help
Replies
28
Views
371
  • Advanced Physics Homework Help
Replies
2
Views
830
  • Biology and Medical
Replies
1
Views
966
  • Calculus and Beyond Homework Help
Replies
5
Views
199
  • Advanced Physics Homework Help
Replies
3
Views
814
  • Programming and Computer Science
Replies
9
Views
3K
  • Advanced Physics Homework Help
Replies
9
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
17
Views
850
Back
Top