Text file that is read into an array

  • Thread starter Sue Parks
  • Start date
  • #26
38
0
subroutine countAminoAcids:
input: array (baseq)
output: count of total amino acids

subroutine AAFrequencyTitin
input: array(baseq)
output: frequency count of all amino acids
 
  • #27
jbriggs444
Science Advisor
Homework Helper
10,239
4,860
subroutine countAminoAcids:
input: array (baseq)
output: count of total amino acids

subroutine AAFrequencyTitin
input: array(baseq)
output: frequency count of all amino acids
For documentation, I was looking for something more like...

baseq is a character string containing a list of amino acid bases coded with one character for each base in the sequence. "A" for adenine, "C" for cytosine, etc. It may be padded with blanks. It is case sensitive. The A's, C's, etc must all be in upper case.

SumAminoAcids is the number of amino acids in the chain encoded by baseq, not counting any trailing blanks.

CountA, CountC, etc are the number of the "A"s, "C"s, etc.
 
  • #28
35,287
7,140
Granted, the code below is Python, but my intent is to show how to write the code in a modular fashion, with each function carrying out a specific task. Each function is passed the information it needs through the parameter list, and, for some functions, returns the processed information for other parts of the code to use.
Python:
def readfile(file_name):
  # Read the given file line by line, and store each line of the file in a list of lines.
  # The file header is skipped.
  # Returns a list containing the lines of the file.
  infile = open(file_name, "r")
  while (True):
     line = infile.readline()
     if line[0:1] == '>': continue  # Skip the header
     if line == "":  # We've hit the end of the file
         infile.close()
         return lines
         break
     lines.append(line)

def printfile(lines):
  # Print the list of lines of the file.
  for line in lines:
  print (line)


def processData(lines):
  # Process the data by storing amino acid frequencies in a dictionary, a Python data type that consists of (key, value) pairs.
  # Each amino acid is represent by a one-letter key; e.g. 'A'.
  # Each occurrence of a particular amino acid causes the value portion to be incremented.
  # Characters that don't match any amino acid are tracked by a catchall key, Misc.
  # Returns the dictionary and the total count. A peculiarity of Python is its ability to return
  #   more than one thing.
  dataDict = dict(A=0,C=0,D=0,E=0, F=0, G=0, H=0, I=0, K=0, L=0, M=0, N=0, P=0, Q=0, T=0, V=0, W=0, Y=0, Misc=0)
  count = 0

  for line in lines:
     for i in range(len(line)):
        key = line[i:i+1]
        if key in dataDict:
            dataDict[key] += 1
        else:
           dataDict['Misc'] += 1
        count += 1
  return dataDict, count

def printSummary(dataDict, count):
  # Print a summary, with each amino acid, how often it occurred, and its relative proportion overall.
   for key in dataDict:
      print("Amino acid: ", key, "\tCount: ", dataDict[key], "\tProportion: ", dataDict[key]/count * 100, "%")


# main program
fn = "TitinFasta.txt"
lines = []
lines = readfile(fn)
print(lines[0])
dataDict, count = processData(lines)
printSummary(dataDict, count)
print(count)

Output from the above:
Code:
Amino acid:  M  Count:  398  Proportion:  1.1396827214936143 %
Amino acid:  I  Count:  2062  Proportion:  5.904587366130233 %
Amino acid:  F  Count:  908  Proportion:  2.600080178683924 %
Amino acid:  Q  Count:  942  Proportion:  2.697440009163278 %
Amino acid:  K  Count:  2943  Proportion:  8.427352385315846 %
Amino acid:  G  Count:  2066  Proportion:  5.916041463833686 %
Amino acid:  H  Count:  478  Proportion:  1.3687646755626826 %
Amino acid:  V  Count:  3184  Proportion:  9.117461771948914 %
Amino acid:  P  Count:  2517  Proportion:  7.207490979898059 %
Amino acid:  E  Count:  3193  Proportion:  9.143233491781686 %
Amino acid:  Y  Count:  999  Proportion:  2.8606609014374893 %
Amino acid:  D  Count:  1720  Proportion:  4.925262012484966 %
Amino acid:  L  Count:  2117  Proportion:  6.0620812095527175 %
Amino acid:  N  Count:  1111  Proportion:  3.1813756371341846 %
Amino acid:  A  Count:  2084  Proportion:  5.967584903499227 %
Amino acid:  C  Count:  513  Proportion:  1.4689880304679 %
Amino acid:  Misc  Count:  4675  Proportion:  13.386976690911172 %
Amino acid:  W  Count:  466  Proportion:  1.3344023824523223 %
Amino acid:  T  Count:  2546  Proportion:  7.290533188248095 %
34922
 
Last edited:
  • #29
38
0
How could I count the number of characters in my variable baseq?
 
  • #30
35,287
7,140
How could I count the number of characters in my variable baseq?
baseq(i:i) is the one-character substring that starts at index i.
baseq(i:i+1) would be the two-character substring starting at index i.
 
  • Like
Likes Sue Parks
  • #31
35,287
7,140
In post #24 you said this:
! Here is my Data (list of ALL possible amino acids
DATA /'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,E,A,T,V,W,Y'/
Might have been a copy/paste mistake, but the list above has duplicates for E and A, and is missing R and S. The data file that you attached at the beginning of this thread contains numerous R's and S's. The output from my Python code in post #28 doesn't have entries for R and S, which I presume are valid amino acids. I didn't include these two because you didn't list them above.

I have to ask: Is there some reason you're doing this with Fortran? To me, using Fortran in the context of this problem is something like trying to make fine furniture using only a hammer.
 
  • #32
35,287
7,140
Adding R and S as possible amino acids, and tweaking the output format slightly, this is what I'm now getting. The 'Misc' category now consists only of the newline characters that were in your input textfile.
Code:
Amino acid  Count Proportion
  E    3193  9.143 %
  N    1111  3.181 %
  H    478  1.369 %
  M    398  1.140 %
  W    466  1.334 %
  I    2062  5.905 %
  Q    942  2.697 %
  A    2084  5.968 %
  K    2943  8.427 %
  Y    999  2.861 %
  D    1720  4.925 %
  L    2117  6.062 %
  C    513  1.469 %
  S    2463  7.053 %
  Misc   572  1.638 %
  G    2066  5.916 %
  T    2546  7.291 %
  F    908  2.600 %
  P    2517  7.207 %
  V    3184  9.117 %
  R    1640  4.696 %
Cumulative total percentages: 100.00%
Characters processed:  34922
 
  • #33
38
0
This is a practice simulation in fortran. I have a good foundation in Python. We (YOU & I) know Fortran is not the best way to go about solving this problem, but it can be done.
 
  • #34
35,287
7,140
This is a practice simulation in fortran. I have a good foundation in Python. We (YOU & I) know Fortran is not the best way to go about solving this problem, but it can be done.
Sure.

Here's what you posted earlier:
Fortran:
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')
        countA = countA + 1
        averageA = countA/sumAminoAcids
       
    else if (String == 'D')
        countD = countD + 1
        averageD = countD/sumAminoAcids
    
end do
end subroutine AAFrequencyTitin[/quote]
Something like this will work, but it needs some work.
1. The subroutine should have at least one parameter, the string (CHARACTER* xxxxx) that was read earlier in another subroutine.
2. You could use countA, countC, countD, etc to store the counts of the various amino acids, but you DON'T need separate variables for the relative frequencies. Just keep track of the total number of amino acids, and display countA/totalCount for the relative proportion of A, and so on.
3. The string (passed as a parameter) can be read one character at a time, by baseq[i:i]. Get the character in the i-th position, and run it through a chain of IF... ELSE IF ... ELSE IF ... statements, incrementing the appropriate countX when the IF clause is matched.
Once you have cycled through the string, and all of the countX variables are set, you could store these values in an array of suitable size (one-dimensional, with one cell for each amino acid count). That array could be an OUT parameter in your function, that could be used by some other subroutine, similar to what I did in my Python code.
 

Related Threads on Text file that is read into an array

Replies
4
Views
10K
Replies
3
Views
2K
Replies
6
Views
3K
Replies
12
Views
14K
  • Last Post
Replies
4
Views
2K
Replies
1
Views
3K
Top