Text file that is read into an array

  • Thread starter Sue Parks
  • Start date
  • #1
38
0
I have a program "probe.f90" that reads in a text file. It runs. I am having trouble reading in the second text file (when prompted by user).

How could I turn this bit of code into a function? I was thinking about a function that strips the header from the fasta file. Then read into an array. Then compute amino acid frequency (last)!

I looked at some examples of character functions in fortran 90, but I see mostly integer functions.
Mod note: Edited the code so that the indentation matches the program structure.
Fortran:
    PROGRAM PROBE 
    implicit none
    character(len=200) :: string
    character, allocatable :: baseq(:), tmpstr(:), newstr(:)
    integer :: nbase, errstat, tmplen, i

    nbase = 0
    errstat = 0

    allocate( baseq(0) )

    open(unit=101, file="TitinFastaFormat.txt", form="formatted", status="old")
    do
       read (101,*,iostat=errstat) string
       if (errstat .ne. 0) exit
       if ( string(1:1) .ne. ">" ) then
            tmplen = len_trim(string)
            nbase = nbase + tmplen
            allocate ( tmpstr( tmplen ) )
            do i=1,tmplen
               tmpstr(i) = string(i:i)
            end do
            allocate( newstr( nbase ))
            newstr(1:size(baseq)) = baseq
            call move_alloc(newstr,baseq)
            baseq(nbase-tmplen+1:nbase) = tmpstr
            deallocate ( tmpstr )
       end if
    end do
    close(101)
    write (*,"(A6, tr1, 999999(A1))"),baseq
    !write (*,"(A6, tr1, i0)") "NBASE:", nbase
    deallocate (baseq)

end program probe
 

Attachments

  • TitinFastaFormat.txt
    34.7 KB · Views: 452
Last edited by a moderator:

Answers and Replies

  • #2
35,287
7,140
I have a program "probe.f90" that reads in a text file. It runs. I am having trouble reading in the second text file (when prompted by user).

How could I turn this bit of code into a function? I was thinking about a function that strips the header from the fasta file. Then read into an array. Then compute amino acid frequency (last)!
I don't recommend doing this. A function (or subroutine) should essentially do one thing, and the name should be chosen so that the one thing it does is obvious from the name. A better strategy, I believe, would be to write one subroutine that strips the header from the file, and a second subroutine that reads data from the file into an array, and finally, a third subroutine that does the amino acid frequency calculations.

I see all of these as subroutines, rather than functions, because each one performs some operation, but doesn't return a value, as a function should do. Each sub performs an action, so the name of each should be a verb, such as StripHeader for the first, and ReadData for the second, and maybe CalcAminoAcidFreq for the third.

If you decide to go this route, you'll then need to figure out what parameters each sub needs so that it can perform its task efficiently and effectively.
Sue Parks said:
I looked at some examples of character functions in fortran 90, but I see mostly integer functions.
Mod note: Edited the code so that the indentation matches the program structure.
Fortran:
    PROGRAM PROBE
    implicit none
    character(len=200) :: string
    character, allocatable :: baseq(:), tmpstr(:), newstr(:)
    integer :: nbase, errstat, tmplen, i

    nbase = 0
    errstat = 0

    allocate( baseq(0) )

    open(unit=101, file="TitinFastaFormat.txt", form="formatted", status="old")
    do
       read (101,*,iostat=errstat) string
       if (errstat .ne. 0) exit
       if ( string(1:1) .ne. ">" ) then
            tmplen = len_trim(string)
            nbase = nbase + tmplen
            allocate ( tmpstr( tmplen ) )
            do i=1,tmplen
               tmpstr(i) = string(i:i)
            end do
            allocate( newstr( nbase ))
            newstr(1:size(baseq)) = baseq
            call move_alloc(newstr,baseq)
            baseq(nbase-tmplen+1:nbase) = tmpstr
            deallocate ( tmpstr )
       end if
    end do
    close(101)
    write (*,"(A6, tr1, 999999(A1))"),baseq
    !write (*,"(A6, tr1, i0)") "NBASE:", nbase
    deallocate (baseq)

end program probe
 
  • Like
Likes Sue Parks
  • #3
38
0
Three subroutines makes more sense. Question: could I have two more subroutines that add the characters to an array. What would be the best way to sort through the parameters?
 
  • #4
35,287
7,140
Three subroutines makes more sense. Question: could I have two more subroutines that add the characters to an array. What would be the best way to sort through the parameters?
My suggestion of three subs came from you description of what seemed to me to be three separate tasks, although, after taking another look at your file, I don't think it would hurt to combine the header stripping and reading into an array into one sub, especially since Fortran doesn't really treat files as first-class objects. By that, I mean, I don't think there's an elegant way to pass the file into and out of a subroutine. Many other programming languages have that ability

By "sorting through the parameters" I don't think you really mean actually sorting through them -- instead, just figuring out what they are. In each case, just identify what information the sub needs to do its job.

So far, the main tasks seem to be 1) pulling the data out of the file and storing it in an array, and 2) calculating the amino acid frequencies.
For the first task, one parameter (in) would be a string that contains the name of the file, and another would be the array (out) into which the data would be stored (omitting the header, which seems to be the first line of the text file). For the second task, the subroutine would need the array (in) , and possibly another array (out) with the frequencies.
 
  • Like
Likes Sue Parks
  • #5
38
0
Thank you!
 
  • #6
35,287
7,140
Is there some reason you're using Fortran for this problem? For working with strings, many other languages are better choices. That's not to say you can't work with strings in Fortran, but it's easier to do this kind of work in many other languages.

Out of curiosity, what are you planning to do with the text file? I've copied the first four lines of the file you provided. What sort of analysis is your program supposed to do with the alphabet soup of the 2nd and following lines? (I'm assuming that the first line is the header.)

Code:
>sp|Q8WZ42|TITIN_HUMAN Titin OS=Homo sapiens GN=TTN PE=1 SV=4
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLKATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGDLYSLLIAEAYPEDSGTYSVNATN
 
  • #7
38
0
I am using fortran for the first time. Typically, I would do this type of problem in Python or BioPython. As you can see, the first line is a header. That is the first 4 lines of the FASTA file of the human gene Titin. I have another FASTA file (with a probe). I would like to read these "sequences" into an array. Find the frequency and also find the number of matches of the probe for the Titan gene. Technically, the "alphabet soup" is an amino acid sequence that codes for the Titin gene.
 
  • #8
38
0
I am having some difficulty taking my first program and converting that to a subroutine. I was only working the text file. I was trying to get the sequence to print to the screen.

Fortran:
        program aa
   
        character:: filename*20 
        character, allocatable :: baseq(:)
        call readfile("TitinFastaFormat.txt")
       
    end program aa

      subroutine readfile(filename) 
          integer ::eof, tempLength, nBase, i 
          character, allocatable :: baseq(:), tempString(:), newString(:) 
          character:: filename*20 
        character :: stringFile*200
     
      eof=0 
          open (102, file=filename, Form="formatted", status="old" ) 
          Read(102,*) 
          do
              !Read(102,*  iostate = eof) stringFile  ! <---------------------
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(stringFile)
              nBase = nBase + tempLength
          allocate ( tempString(tempLength))
     
          do i=1, tempLength
              tempString(i) = stringFile(i:i)
          end do
     
          allocate (newString (nBase))
          newString (1:size(baseq)) = baseq
          call move_alloc(newString, baseq)
          baseq(nbase - tempLength+1:nBase) = tempString
          deallocate(tempString)
     
          end if 
        end do 
        close(101) 
       
        write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq 
        write (*,"(A6, tr1, i0)") "NBASE:", nbase 

        deallocate (baseq)
   
    end
 
  • #9
38
0
I know the subroutine is long. How do I manipulate the intent to get my array to print?



I fixed this command
Fortran:
end program aa
.
.
.
Read(102,  iostat = eof) stringFile
 
  • #10
35,287
7,140
I am using fortran for the first time. Typically, I would do this type of problem in Python or BioPython.
I think Python would be a good choice. I don't know anything about BioPython, so don't have any knowledge of its capabilities.

I'll take a look at the latest code you included, and make some comments.
Sue Parks said:
As you can see, the first line is a header. That is the first 4 lines of the FASTA file of the human gene Titin. I have another FASTA file (with a probe). I would like to read these "sequences" into an array. Find the frequency and also find the number of matches of the probe for the Titan gene. Technically, the "alphabet soup" is an amino acid sequence that codes for the Titin gene.
 
  • #11
35,287
7,140
I know the subroutine is long.
It's not that long...
Sue Parks said:
How do I manipulate the intent to get my array to print?



I fixed this command
Fortran:
end program aa
.
.
.
Read(102,  iostat = eof) stringFile
Let me address your readfile subroutine first, and then I'll come back to this.

Fortran:
subroutine readfile(filename)
    integer ::eof, tempLength, nBase, i
   character, allocatable :: baseq(:), tempString(:), newString(:)
   character:: filename*20
   character :: stringFile*200   ! *** stringFile is declared to hold 200 characters. That's definitely not enough to hold the entire contents of the file I saw.
     
   eof=0
   open (102, file=filename, Form="formatted", status="old" )
   Read(102,*)    ! ***  ??? This doesn't do anything -- there's no variable to read into
   do
          !Read(102,*  iostate = eof) stringFile  ! <---------------------
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(stringFile)
              nBase = nBase + tempLength
              allocate ( tempString(tempLength))
    
              do i=1, tempLength
                  tempString(i) = stringFile(i:i)
              end do
    
              allocate (newString (nBase))
              newString (1:size(baseq)) = baseq
              call move_alloc(newString, baseq)
              baseq(nbase - tempLength+1:nBase) = tempString
              deallocate(tempString)
          end if
    end do
   close(101)
       
   write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
   write (*,"(A6, tr1, i0)") "NBASE:", nbase

   deallocate (baseq)

end  ! *** Needs to be end subroutine
I added a few comments, the ones that start with ! ***

The subroutine should have two parameters: a string to hold the filename, and an array that will hold the contents of the file. That array would be declared in your main program. When the readfile sub finishes, the array will be initialized appropriately.

I don't know how big the array should be, but the file you attached is about 36 KB in size. I don't know whether you program needs to be able to hold the entire contents of this textfile or not.
 
  • Like
Likes Sue Parks
  • #12
35,287
7,140
Python:
def ReadFile(file_name):
    infile = open(file_name, "r")
    while (True):
       line = infile.readline()
       if line[0:1] == '>': continue  # Skip the header
       if line == "":
           infile.close()
           return lines
           break
       lines.append(line)

def PrintFile(lines):
    for line in lines:
       print (line)


inFile = "TitinFasta.txt"
lines = []
lines = ReadFile(inFile)
PrintFle(lines)
The "main" part of this Python program is the four lines at the bottom. The lines variable is a list (like an array, but not exactly the same) will hold the lines of text from the input file.

The ReadFile() function takes as its parameter the name of the file to open, and returns the filled-in list. This function opens the file in read mode, as a text file, and then reads lines of text from the input file. It skips the entire first line. Each line that is read is appended to the lists variable. When the final line is read (when line == "", or empty string), the function returns the lines list.

The PrintFile() function has a list parameter (lines). For each line of text in the lines list, it prints that line.

I'm hopeful that you'll be able to follow this code to see what your Fortran program needs to do (or even abandon your attempt to use Fortran to do this...)

The code above is short and sweet, and seems to work well.
 
  • Like
Likes Sue Parks
  • #13
38
0
I was under the assumption that Fortran read files line by line. The empty Read(102,*) was to skip that line of the file. I may be wrong.
 
  • #14
35,287
7,140
I was under the assumption that Fortran read files line by line. The empty Read(102,*) was to skip that line of the file. I may be wrong.
I could be wrong (it's been many years since I had a Fortran compiler), but I don't believe that the statement above does anything if there's no variable at the end of the read statement. You could have this
read(102, *) str
where str is declared as a CHARACTER string variable that is large enough to hold the longest line in your file. You don't need to do anything with the str variable, so you're effectively discarding the first line.
 
  • Like
Likes Sue Parks
  • #15
jbriggs444
Science Advisor
Homework Helper
10,239
4,860
My years-old recollection is that the iolist can be empty. That matches the Fortran77 documentation that I can quickly google up.

"Input List
iolist can be empty or can contain input items or implied DO lists"

https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vn79/index.html

A Fortran READ will generally read as many lines as it takes to fill in the variables. But it will always read at least one line. [I found myself often reading a line and then using internal reads to parse the result]
 
  • Like
Likes Sue Parks
  • #16
35,287
7,140
My years-old recollection is that the iolist can be empty. That matches the Fortran77 documentation that I can quickly google up.

"Input List
iolist can be empty or can contain input items or implied DO lists"

https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vn79/index.html

A Fortran READ will generally read as many lines as it takes to fill in the variables. But it will always read at least one line. [I found myself often reading a line and then using internal reads to parse the result]
Good to know. Thanks for setting me straight @jbriggs444...
 
  • Like
Likes Sue Parks
  • #17
38
0
I'm not sure how to use the subroutine to take in the file and put the characters in an array. I may have some problems declaring the array in the subroutine and in the main program
 
  • #18
35,287
7,140
I'm not sure how to use the subroutine to take in the file and put the characters in an array. I may have some problems declaring the array in the subroutine and in the main program
Something like this...
Note that the names of the actual arguments (the variables in main and used in the call to the subroutine) can be the same as the names of the formal parameters in the subroutine, but don't have to be.
Fortran:
! Main program
! Declare variables
CHARACTER * 20 :: fileName
CHARACTER * 10000 :: data  ! Can contain 10,000 characters, maybe not enough for your application
.
.
.
! Get a value for fileName
.
.
.
CALL ReadFile(fileName, data)
.
.
.
END PROGRAM

SUBROUTINE ReadFile(fileName, data)
IMPLICIT NONE
CHARACTER * 20, INTENT(IN) :: fileName
CHARACTER * 10000, INTENT(OUT) :: data

   ! Open file
   ! Strip off header line
   ! Read data from file into array

END SUBROUTINE ReadFile
 
  • Like
Likes Sue Parks
  • #19
38
0
ok. Let me see. I have been going in circles.
 
  • #20
38
0
I'm stuck

Fortran:
    program aa
   
        character*20:: filename  = "TitinFastaFormat.txt"
        character*10000 :: baseq
        call readfile(filename, baseq)
        !deallocate
   
    end program aa
   
      subroutine readfile(filename, baseq) 
      implicit none
          integer ::eof,  nBase, i, tempLength 
          character, allocatable ::  tempString(:), newString(:)
          character, intent(in):: filename*20
          character, intent(out):: baseq*10000
        character  :: stringFile*10000
     
          eof=0 
          allocate( baseq)
          open (102, file=filename, Form="formatted", status="old" ) 
          Read(102,*) stringFile
          !Read(102,  iostate = eof) stringFile 
          do
          !Read(102,  iostate = eof) stringFile  
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(string(1:1))
              nBase = nBase + tempLength
              allocate ( tempString(tempLength))
   
              do i=1, tempLength
                  tempString(i) = stringFile(:)
              end do
   
              allocate (newString (nBase))
              newString (1:size(baseq)) = baseq
              call move_alloc(newString, baseq)
              baseq(nbase - tempLength+1:nBase) = tempString
              deallocate(tempString)
          end if
    end do
   close(102)
       
   !write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
   !write (*,"(A6, tr1, i0)") "NBASE:", nbase

   deallocate (baseq)

end  subroutine read file
 
  • #21
35,287
7,140
I can't tell what you're trying to do in your readfile subroutine. The only things that it should are 1) open the file, and 2) read the data from the file into an array. After that it's done.

One mistake I see is that you have a line -- nBase = nBase + tempLength
nBase is an uninitialized variable. So the line of code above adds tempLength to a garbage value (nBase) and stores that in nBase, which results in a garbage value.

You have a lot of stuff in your code whose purpose escapes me. What I had in mind is more like the following. Note that I changed the size of the data array to 40000. You need to make the same change for the corresponding array.

I think the following will work (i.e., the single read statement will suck up all of the data), but I don't have a Fortran compiler to test it, so caveat emptor.

Fortran:
subroutine ReadFile(filename, baseq)
   implicit none
   integer ::eof
   character, intent(in):: filename*20
   character, intent(out):: baseq*40000
   eof=0
   open (102, file=filename, Form="formatted", status="old" )
   Read(102,*) baseq
end subroutine ReadFile
 
  • #22
38
0
I was trying to remove the header (first line of the file). How would I call the subroutine from the main. I just want to make sure the characters are in an array. I'm close!

output:
Sues-MacBook-Air:fortran sueparks$ gfortran a.f90
Sues-MacBook-Air:fortran sueparks$ ./a.out
Backtrace for this error:
#0 0x10f112092
#1 0x10f1113b0
#2 0x7fff8767ff19
Segmentation fault: 11
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.



Fortran:
program a
   
        character*20:: filename  
        character, allocatable :: baseq*40000
        call ReadFile("TitinFastaFormat.txt", baseq)
        Write (102,*) baseq ! <-------
   
   
    end program a

    subroutine ReadFile(filename, baseq)
          implicit none
          !integer ::eof
          character, intent(in):: filename*20
          character, intent(out):: baseq*40000

          open (102, file=filename, Form="formatted", status="old" )
          Read(102,*) baseq
          Write (102, *) baseq
          close(102)
end subroutine ReadFile
 
  • #23
35,287
7,140
I believe your problem comes from not actually allocating storage for your baseq array.
Fortran:
program a
  character*20:: filename 
  character, allocatable :: baseq*40000  ! <--- this line
  call ReadFile("TitinFastaFormat.txt", baseq)
  Write (102,*) baseq !
end program a
Merely stating that it is "allocatable" doesn't allocate any storage. To do that, you need to have a statement something like
allocate(baseq(40000)) following your declaration of the baseq variable.

Ideally your code would determine the size of the file, and then allocate the correct amount of storage (using allocate(baseq(size)) ).

I don't know if that approach is feasible, given your current working knowledge of Fortran, so IMO using "allocatable" and "allocate" seems like extra complication for no gain. Instead of using dynamic array allocation (i.e., using allocatable and allocate), just declare an array that is big enough, like so:
Fortran:
   character*40000 :: baseq
 
  • Like
Likes Sue Parks
  • #24
38
0
That was extremely helpful! I have another question. I have a short algorithm to help determine the Amino acid frequency. Could you take a look? I am just giving you an example of what I am thinking, mostly because I am not 100% sure how to extract the information from the array (baseq in the case).

Fortran:
String = "ABDJFAAUGBJDK"   ! Just found an example to determine functionality
Write (*,*) String ! checking 

! seems plausible to have some sort of subroutine to count the "elements"
    subroutine countAminoAcids(sumAminoAcids, baseq)
Integer :: sumAminoAcids
! intent (in) :: baseq
! intent (out) sumAminoAcids
    sum all elements
    do i=1, len(array)
        sumAminoAcids = sumAminoAcids + 1
    end do
    end subroutine countAminoAcids
       



! Here is my Data (list of ALL possible amino acids
DATA /'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,E,A,T,V,W,Y'/


! seems plausible to loop through the array to sum (countA) and avg(averageA)
! Is seems easier to do this is one subroutine
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')
        countA = countA + 1
        averageA = countA/sumAminoAcids
       
    else if (String == 'D')
        countD = countD + 1
        averageD = countD/sumAminoAcids
    
end do
end subroutine AAFrequencyTitin
 
  • #25
jbriggs444
Science Advisor
Homework Helper
10,239
4,860
That was extremely helpful! I have another question. I have a short algorithm to help determine the Amino acid frequency. Could you take a look? I am just giving you an example of what I am thinking, mostly because I am not 100% sure how to extract the information from the array (baseq in the case).

Fortran:
! seems plausible to have some sort of subroutine to count the "elements"
    subroutine countAminoAcids(sumAminoAcids, baseq)
Integer :: sumAminoAcids
! intent (in) :: baseq
! intent (out) sumAminoAcids
    sum all elements
    do i=1, len(array)
        sumAminoAcids = sumAminoAcids + 1
    end do
   end subroutine countAminoAcids
You never use baseq. As written, this is equivalent to "sumAminoAcids = len(array)"

Documentation is important. What is this subroutine supposed to do? What are its inputs. What are its outputs?

Fortran:
! seems plausible to loop through the array to sum (countA) and avg(averageA)
! Is seems easier to do this is one subroutine
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')
If you want to access character number i in String, that would be String(i:i)

averageD = countD/sumAminoAcids
You've declared both countD and sumAminoAcids as integers. What happens in Fortran when an integer division would give a result that is between zero and one?
 
  • Like
Likes Sue Parks

Related Threads on Text file that is read into an array

Replies
4
Views
10K
Replies
3
Views
2K
Replies
6
Views
3K
Replies
12
Views
14K
  • Last Post
Replies
4
Views
2K
Replies
1
Views
3K
Top