# Text file that is read into an array

1. Dec 2, 2015

### Sue Parks

I have a program "probe.f90" that reads in a text file. It runs. I am having trouble reading in the second text file (when prompted by user).

How could I turn this bit of code into a function? I was thinking about a function that strips the header from the fasta file. Then read into an array. Then compute amino acid frequency (last)!

I looked at some examples of character functions in fortran 90, but I see mostly integer functions.
Mod note: Edited the code so that the indentation matches the program structure.
Code (Fortran):

PROGRAM PROBE
implicit none
character(len=200) :: string
character, allocatable :: baseq(:), tmpstr(:), newstr(:)
integer :: nbase, errstat, tmplen, i

nbase = 0
errstat = 0

allocate( baseq(0) )

open(unit=101, file="TitinFastaFormat.txt", form="formatted", status="old")
do
if (errstat .ne. 0) exit
if ( string(1:1) .ne. ">" ) then
tmplen = len_trim(string)
nbase = nbase + tmplen
allocate ( tmpstr( tmplen ) )
do i=1,tmplen
tmpstr(i) = string(i:i)
end do
allocate( newstr( nbase ))
newstr(1:size(baseq)) = baseq
call move_alloc(newstr,baseq)
baseq(nbase-tmplen+1:nbase) = tmpstr
deallocate ( tmpstr )
end if
end do
close(101)
write (*,"(A6, tr1, 999999(A1))"),baseq
!write (*,"(A6, tr1, i0)") "NBASE:", nbase
deallocate (baseq)

end program probe

#### Attached Files:

• ###### TitinFastaFormat.txt
File size:
34.7 KB
Views:
85
Last edited by a moderator: Dec 3, 2015
2. Dec 3, 2015

### Staff: Mentor

I don't recommend doing this. A function (or subroutine) should essentially do one thing, and the name should be chosen so that the one thing it does is obvious from the name. A better strategy, I believe, would be to write one subroutine that strips the header from the file, and a second subroutine that reads data from the file into an array, and finally, a third subroutine that does the amino acid frequency calculations.

I see all of these as subroutines, rather than functions, because each one performs some operation, but doesn't return a value, as a function should do. Each sub performs an action, so the name of each should be a verb, such as StripHeader for the first, and ReadData for the second, and maybe CalcAminoAcidFreq for the third.

If you decide to go this route, you'll then need to figure out what parameters each sub needs so that it can perform its task efficiently and effectively.

3. Dec 3, 2015

### Sue Parks

Three subroutines makes more sense. Question: could I have two more subroutines that add the characters to an array. What would be the best way to sort through the parameters?

4. Dec 3, 2015

### Staff: Mentor

My suggestion of three subs came from you description of what seemed to me to be three separate tasks, although, after taking another look at your file, I don't think it would hurt to combine the header stripping and reading into an array into one sub, especially since Fortran doesn't really treat files as first-class objects. By that, I mean, I don't think there's an elegant way to pass the file into and out of a subroutine. Many other programming languages have that ability

By "sorting through the parameters" I don't think you really mean actually sorting through them -- instead, just figuring out what they are. In each case, just identify what information the sub needs to do its job.

So far, the main tasks seem to be 1) pulling the data out of the file and storing it in an array, and 2) calculating the amino acid frequencies.
For the first task, one parameter (in) would be a string that contains the name of the file, and another would be the array (out) into which the data would be stored (omitting the header, which seems to be the first line of the text file). For the second task, the subroutine would need the array (in) , and possibly another array (out) with the frequencies.

5. Dec 3, 2015

Thank you!

6. Dec 3, 2015

### Staff: Mentor

Is there some reason you're using Fortran for this problem? For working with strings, many other languages are better choices. That's not to say you can't work with strings in Fortran, but it's easier to do this kind of work in many other languages.

Out of curiosity, what are you planning to do with the text file? I've copied the first four lines of the file you provided. What sort of analysis is your program supposed to do with the alphabet soup of the 2nd and following lines? (I'm assuming that the first line is the header.)

Code (Text):

>sp|Q8WZ42|TITIN_HUMAN Titin OS=Homo sapiens GN=TTN PE=1 SV=4
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLKATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGDLYSLLIAEAYPEDSGTYSVNATN

7. Dec 3, 2015

### Sue Parks

I am using fortran for the first time. Typically, I would do this type of problem in Python or BioPython. As you can see, the first line is a header. That is the first 4 lines of the FASTA file of the human gene Titin. I have another FASTA file (with a probe). I would like to read these "sequences" into an array. Find the frequency and also find the number of matches of the probe for the Titan gene. Technically, the "alphabet soup" is an amino acid sequence that codes for the Titin gene.

8. Dec 3, 2015

### Sue Parks

I am having some difficulty taking my first program and converting that to a subroutine. I was only working the text file. I was trying to get the sequence to print to the screen.

Code (Fortran):

program aa

character:: filename*20
character, allocatable :: baseq(:)

end program aa

integer ::eof, tempLength, nBase, i
character, allocatable :: baseq(:), tempString(:), newString(:)
character:: filename*20
character :: stringFile*200

eof=0
open (102, file=filename, Form="formatted", status="old" )
do
!Read(102,*  iostate = eof) stringFile  ! <---------------------
if (eof .ne. 0) exit
if(stringFile(1:1) .ne. ">") then
tempLength = len_trim(stringFile)
nBase = nBase + tempLength
allocate ( tempString(tempLength))

do i=1, tempLength
tempString(i) = stringFile(i:i)
end do

allocate (newString (nBase))
newString (1:size(baseq)) = baseq
call move_alloc(newString, baseq)
baseq(nbase - tempLength+1:nBase) = tempString
deallocate(tempString)

end if
end do
close(101)

write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
write (*,"(A6, tr1, i0)") "NBASE:", nbase

deallocate (baseq)

end

9. Dec 3, 2015

### Sue Parks

I know the subroutine is long. How do I manipulate the intent to get my array to print?

I fixed this command
Code (Fortran):

end program aa
.
.
.

10. Dec 3, 2015

### Staff: Mentor

I think Python would be a good choice. I don't know anything about BioPython, so don't have any knowledge of its capabilities.

I'll take a look at the latest code you included, and make some comments.

11. Dec 3, 2015

### Staff: Mentor

It's not that long...

Code (Fortran):

integer ::eof, tempLength, nBase, i
character, allocatable :: baseq(:), tempString(:), newString(:)
character:: filename*20
character :: stringFile*200   ! *** stringFile is declared to hold 200 characters. That's definitely not enough to hold the entire contents of the file I saw.

eof=0
open (102, file=filename, Form="formatted", status="old" )
Read(102,*)    ! ***  ??? This doesn't do anything -- there's no variable to read into
do
!Read(102,*  iostate = eof) stringFile  ! <---------------------
if (eof .ne. 0) exit
if(stringFile(1:1) .ne. ">") then
tempLength = len_trim(stringFile)
nBase = nBase + tempLength
allocate ( tempString(tempLength))

do i=1, tempLength
tempString(i) = stringFile(i:i)
end do

allocate (newString (nBase))
newString (1:size(baseq)) = baseq
call move_alloc(newString, baseq)
baseq(nbase - tempLength+1:nBase) = tempString
deallocate(tempString)
end if
end do
close(101)

write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
write (*,"(A6, tr1, i0)") "NBASE:", nbase

deallocate (baseq)

end  ! *** Needs to be end subroutine

The subroutine should have two parameters: a string to hold the filename, and an array that will hold the contents of the file. That array would be declared in your main program. When the readfile sub finishes, the array will be initialized appropriately.

I don't know how big the array should be, but the file you attached is about 36 KB in size. I don't know whether you program needs to be able to hold the entire contents of this textfile or not.

12. Dec 4, 2015

### Staff: Mentor

Code (Python):

infile = open(file_name, "r")
while (True):
if line[0:1] == '>': continue  # Skip the header
if line == "":
infile.close()
return lines
break
lines.append(line)

def PrintFile(lines):
for line in lines:
print (line)

inFile = "TitinFasta.txt"
lines = []
PrintFle(lines)
The "main" part of this Python program is the four lines at the bottom. The lines variable is a list (like an array, but not exactly the same) will hold the lines of text from the input file.

The ReadFile() function takes as its parameter the name of the file to open, and returns the filled-in list. This function opens the file in read mode, as a text file, and then reads lines of text from the input file. It skips the entire first line. Each line that is read is appended to the lists variable. When the final line is read (when line == "", or empty string), the function returns the lines list.

The PrintFile() function has a list parameter (lines). For each line of text in the lines list, it prints that line.

I'm hopeful that you'll be able to follow this code to see what your Fortran program needs to do (or even abandon your attempt to use Fortran to do this...)

The code above is short and sweet, and seems to work well.

13. Dec 4, 2015

### Sue Parks

I was under the assumption that Fortran read files line by line. The empty Read(102,*) was to skip that line of the file. I may be wrong.

14. Dec 4, 2015

### Staff: Mentor

I could be wrong (it's been many years since I had a Fortran compiler), but I don't believe that the statement above does anything if there's no variable at the end of the read statement. You could have this
where str is declared as a CHARACTER string variable that is large enough to hold the longest line in your file. You don't need to do anything with the str variable, so you're effectively discarding the first line.

15. Dec 4, 2015

### jbriggs444

My years-old recollection is that the iolist can be empty. That matches the Fortran77 documentation that I can quickly google up.

"Input List
iolist can be empty or can contain input items or implied DO lists"

https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vn79/index.html

A Fortran READ will generally read as many lines as it takes to fill in the variables. But it will always read at least one line. [I found myself often reading a line and then using internal reads to parse the result]

16. Dec 5, 2015

### Staff: Mentor

Good to know. Thanks for setting me straight @jbriggs444...

17. Dec 5, 2015

### Sue Parks

I'm not sure how to use the subroutine to take in the file and put the characters in an array. I may have some problems declaring the array in the subroutine and in the main program

18. Dec 5, 2015

### Staff: Mentor

Something like this...
Note that the names of the actual arguments (the variables in main and used in the call to the subroutine) can be the same as the names of the formal parameters in the subroutine, but don't have to be.
Code (Fortran):

! Main program
! Declare variables
CHARACTER * 20 :: fileName
CHARACTER * 10000 :: data  ! Can contain 10,000 characters, maybe not enough for your application
.
.
.
! Get a value for fileName
.
.
.
.
.
.
END PROGRAM

IMPLICIT NONE
CHARACTER * 20, INTENT(IN) :: fileName
CHARACTER * 10000, INTENT(OUT) :: data

! Open file
! Read data from file into array

19. Dec 5, 2015

### Sue Parks

ok. Let me see. I have been going in circles.

20. Dec 5, 2015

### Sue Parks

I'm stuck

Code (Fortran):

program aa

character*20:: filename  = "TitinFastaFormat.txt"
character*10000 :: baseq
!deallocate

end program aa

implicit none
integer ::eof,  nBase, i, tempLength
character, allocatable ::  tempString(:), newString(:)
character, intent(in):: filename*20
character, intent(out):: baseq*10000
character  :: stringFile*10000

eof=0
allocate( baseq)
open (102, file=filename, Form="formatted", status="old" )
do
if (eof .ne. 0) exit
if(stringFile(1:1) .ne. ">") then
tempLength = len_trim(string(1:1))
nBase = nBase + tempLength
allocate ( tempString(tempLength))

do i=1, tempLength
tempString(i) = stringFile(:)
end do

allocate (newString (nBase))
newString (1:size(baseq)) = baseq
call move_alloc(newString, baseq)
baseq(nbase - tempLength+1:nBase) = tempString
deallocate(tempString)
end if
end do
close(102)

!write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
!write (*,"(A6, tr1, i0)") "NBASE:", nbase

deallocate (baseq)

21. Dec 5, 2015

### Staff: Mentor

I can't tell what you're trying to do in your readfile subroutine. The only things that it should are 1) open the file, and 2) read the data from the file into an array. After that it's done.

One mistake I see is that you have a line -- nBase = nBase + tempLength
nBase is an uninitialized variable. So the line of code above adds tempLength to a garbage value (nBase) and stores that in nBase, which results in a garbage value.

You have a lot of stuff in your code whose purpose escapes me. What I had in mind is more like the following. Note that I changed the size of the data array to 40000. You need to make the same change for the corresponding array.

I think the following will work (i.e., the single read statement will suck up all of the data), but I don't have a Fortran compiler to test it, so caveat emptor.

Code (Fortran):
implicit none
integer ::eof
character, intent(in):: filename*20
character, intent(out):: baseq*40000
eof=0
open (102, file=filename, Form="formatted", status="old" )

22. Dec 5, 2015

### Sue Parks

I was trying to remove the header (first line of the file). How would I call the subroutine from the main. I just want to make sure the characters are in an array. I'm close!

output:
Sues-MacBook-Air:fortran sueparks$gfortran a.f90 Sues-MacBook-Air:fortran sueparks$ ./a.out
Backtrace for this error:
#0 0x10f112092
#1 0x10f1113b0
#2 0x7fff8767ff19
Segmentation fault: 11
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Code (Fortran):

program a

character*20:: filename
character, allocatable :: baseq*40000
Write (102,*) baseq ! <-------

end program a

implicit none
!integer ::eof
character, intent(in):: filename*20
character, intent(out):: baseq*40000

open (102, file=filename, Form="formatted", status="old" )
Write (102, *) baseq
close(102)

23. Dec 6, 2015

### Staff: Mentor

I believe your problem comes from not actually allocating storage for your baseq array.
Code (Fortran):
program a
character*20:: filename
character, allocatable :: baseq*40000  ! <--- this line
Write (102,*) baseq !
end program a
Merely stating that it is "allocatable" doesn't allocate any storage. To do that, you need to have a statement something like
allocate(baseq(40000)) following your declaration of the baseq variable.

Ideally your code would determine the size of the file, and then allocate the correct amount of storage (using allocate(baseq(size)) ).

I don't know if that approach is feasible, given your current working knowledge of Fortran, so IMO using "allocatable" and "allocate" seems like extra complication for no gain. Instead of using dynamic array allocation (i.e., using allocatable and allocate), just declare an array that is big enough, like so:
Code (Fortran):

character*40000 :: baseq

24. Dec 7, 2015

### Sue Parks

That was extremely helpful! I have another question. I have a short algorithm to help determine the Amino acid frequency. Could you take a look? I am just giving you an example of what I am thinking, mostly because I am not 100% sure how to extract the information from the array (baseq in the case).

Code (Fortran):

String = "ABDJFAAUGBJDK"   ! Just found an example to determine functionality
Write (*,*) String ! checking

! seems plausible to have some sort of subroutine to count the "elements"
subroutine countAminoAcids(sumAminoAcids, baseq)
Integer :: sumAminoAcids
! intent (in) :: baseq
! intent (out) sumAminoAcids
sum all elements
do i=1, len(array)
sumAminoAcids = sumAminoAcids + 1
end do
end subroutine countAminoAcids

! Here is my Data (list of ALL possible amino acids
DATA /'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,E,A,T,V,W,Y'/

! seems plausible to loop through the array to sum (countA) and avg(averageA)
! Is seems easier to do this is one subroutine
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
if ( String == 'A')
countA = countA + 1
averageA = countA/sumAminoAcids

else if (String == 'D')
countD = countD + 1
averageD = countD/sumAminoAcids

end do
end subroutine AAFrequencyTitin

25. Dec 7, 2015

### jbriggs444

You never use baseq. As written, this is equivalent to "sumAminoAcids = len(array)"

Documentation is important. What is this subroutine supposed to do? What are its inputs. What are its outputs?

If you want to access character number i in String, that would be String(i:i)

You've declared both countD and sumAminoAcids as integers. What happens in Fortran when an integer division would give a result that is between zero and one?