Fortran [FORTRAN] How did I screw up my code?

Click For Summary
A user is seeking help to debug their FORTRAN code, which has been problematic for two weeks. The code involves complex calculations related to electron and baryon densities, mixing angles, and mass parameters in a neutrino sphere context. Key issues include the initialization of matrices and the handling of input data files, particularly for electron fraction and density profiles. The user has provided extensive code snippets, indicating a focus on ensuring correct calculations and matrix properties. The discussion highlights the challenges of debugging scientific computation code in FORTRAN.
  • #31
Some old advice on debugging screwy fortran problems:

(0) if its not your code, make it your code ie reformat it, add comments and debugging print statements at the entry points of subroutines
(1) If it worked reasonably well before then look closely at your changes
(2) Use a debugger and step through your code and take notes as to what is happening
(3) Use an IDE as it may help with debugging, refactoring and reformatting code (Eclipse IDE supports Fortran with GCC)

Where bugs usually crop up:

http://www.fortrantutorial.com/debugging-tips/index.php

In my experience:
(1) Look at your conditionals and make sure they do what you want. Add prints so you know which way the conditional branched and what conditions were true or false.
(2) Are you passing any constants to subroutines which in turn may change the value of the argument?
Years ago I passed an integer 3 into a subroutine and inadvertently changed it to an integer 5 and because argument passing was done as call by reference all numeric 3 constants in my program changed to 5 resulting in a very difficult bug to find. I think this has been fixed in later compiler design to no longer link all literal references back to the same memory location.
(3) Are you using any uninitialized variables? Your strategically placed prints should root these problems out.
(4) Are you using any common blocks? Make sure the variables in each block match up with other related blocks in datatype and for your sanity in name.
One piece of inherited code had changed the names of the block variables and my new code changed the datatype causing a hidden mismatch which took a few days to uncover even though I had searched for all references I didn't make the common block connection and didn't notice the name change.

The key point was that I had to question every aspect of my changes to find the problem.

Create a small set of logging functions instead of using prints or writes as it makes it easier to turn logging on and off.

Lastly, from what I've seen here is that no one can really help you. You have the code, the runtime environment and now you must really dig down deep to find the source of your problem. It may be in your changes, it may be a bug that's been there for ever, it may hidden in a support library that you use or even an odd compiler/runtime issue.

The key here is that until you can isolate it down to one line. We won't be able to really help debug this remotely.
 
Last edited:
Technology news on Phys.org
  • #32
One other long range thought is to convert your code to Julia or Python and in doing so begin to understand how it processes your data.

Julia is perhaps the better choice as it works well with fortran, runs at comparable speeds and has a MATLAB like syntax which many students learn nowadays.

In addition, the IJulia tool would allow you to debug parts of your code as you go along.
 
  • #33
jedishrfu said:
Some old advice on debugging screwy fortran problems:

(0) if its not your code, make it your code ie reformat it, add comments and debugging print statements at the entry points of subroutines
I would add: use implicit none and declare all variables by hand. It can be a lot of work with long programs, but you eliminate a great source of bugs, and you also learn some things about the program when you have to list all variables.
 
  • Like
Likes jedishrfu
  • #34
Mark44 said:
Using a small set of data (like your input file with 6 elements, I think you said), tell us what you expect the code to produce as output, and what it actually produces as output. The more precise information about what is happening you can provide, the better chance we have of figuring out what the problem is.

My code outputs a lot of stuff, but the main output that I can read on my end is (sort of) an average over a bunch of energy and angle bins of a ton of wave-functions (actually the output absolute squares the wave function so what I'm seeing are actual probabilities) which basically started out as (1,0,0,0,0,0) or (0,1,0,0,0,0) or (0,0,1,0,0,0) etc. So it outputs a 6x6 grid (for the physics, the dimension is 6 due to neutrino-anti-neutrino mixing). It's very very long, but if you want me to copy paste the output here I can. The gist of it is, I take a look at what happens to these wave functions as a neutrino propagates out from a supernova. Let's say I take a look at the probability that an electron neutrino turns into a muon neutrino. This is basically the number that appears in the second slot of the (1,0,0,0,0,0) wave function. The "usual output" (omitting a bunch of data) looks something like this:
Radius: Probability:
900, 0
901, 1E-9
902, 1E-9
950, 1E-8
1000, 1E-6
1500, 1E-3 etc.

The bugged output looks like this:

Radius: Probability:

900, 0
900.001, 1E-9
900.01, 1E-3
901, 1E-1
1000, 1E-1
1500, 1E-1 etc.

As you can see, the probability jumps up very very quickly in the bugged output. And in fact, no matter what I manually set Yevar to, this output will not change as long as the lines reading in ryegrid and yegrid are included in the code.

Is this specific enough?

Earlier I asked you this:

Your reply was this:

I said I didn't understand the last part of what you said. What does "I don't actually know of one a chunk of code like this" mean?

Just about any debugger would let you set a breakpoint in your main program (or anywhere else in any of the called subroutines and functions), and then start single-stepping through your code.

Also, what compiler are you using?

The only time previously where I've used a debugger has been on some kind of platform like Microsoft Visual Studio (for C++) or Silverfrost. I used a program where I wrote the code inside the program and then it has a "debug" button at top which I can click and it walks through the code looking at the variables I want to look at. I've only done this for small pieces of code (e.g. 200-300 lines).

I run this large fortran code in an ubuntu terminal. I installed gfortran to compile it, and just run it inside the terminal. I don't know how to debug something inside a terminal, and I also am not familiar with what to do once a ton of modules and subroutines and functions etc. come into play.
 
  • #35
Checkout gdb it should allow you debug in a trrminal with simple commands to step through or set breakpoints.
 
  • #36
Could it be possible to see the code where you call varelec_init?
 
  • #37
Here's a link to a man page on gdb, the debugger that jedishrfu suggested. http://man.yolinux.com/cgi-bin/man2html?cgi_command=gdb&cgi_section=&cgi_keyword=m
The page here says that "support for Fortran will be added when a GNU Fortran compiler is ready." The page was written in 2002, and I'm pretty sure that gfortran is a GNU Fortran compiler.

The best advice we can give you is to jump right in with some debugger -- pretty much any debugger. It looks pretty straightforward to set breakpoints - set one in any function that calculates a probability and see why the flaky numbers are showing up.
 
  • #38
And here's a tutorial to go along with Mark44's man page:

http://people.sc.fsu.edu/~jburkardt/f_src/gdb/gdb.html

However, in your case, I suspect you'll need logging of intermediate results so you can track through the code and verify that things are working as intended until you find the source of your error.

I recently had a similar problem with some Java code where my new version of a program's output didn't match the old version. I could swear I got it right and dreaded having to add extensive logging in both old and new programs to see where things varied and then I saw the problem. When mine launched it didn't get the last bearing value and so assumed zero degrees. Basically the programs were looking in different directions only.
 
  • #39
DrClaude said:
Could it be possible to see the code where you call varelec_init?

varelec_init is called in the main program, which is several thousand lines long. Do you want to see the call line itself, or like the whole code? o.o

To be clear, I can leave in the call to varelec_init, and as long as I comment out the 5 lines which actually read in ryegrid and yegrid inside varelec_init, the bug will disappear.

@Mark44 & @jedishrfu Thanks for the advice, I will take a look and try.
 
  • #40
Matterwave said:
varelec_init is called in the main program, which is several thousand lines long.
Not a reflection on you, Matterwave, but whoever wrote this code seems to have been completely unaware of one of the main principles of good programming practice - modularization, breaking a large program into smaller, more easily manageable chunks.
 
  • Like
Likes jedishrfu
  • #41
Mark44 said:
Not a reflection on you, Matterwave, but whoever wrote this code seems to have been completely unaware of one of the main principles of good programming practice - modularization, breaking a large program into smaller, more easily manageable chunks.

I mean, the code is comprised of 6-8 modules too...they are all a bit long XD
 
  • #42
Oh, I've been giving you guys the wrong information up to now...the bug is still there even if the read statements for ryegrid and yegrid are commented out. The only way I can remove the bug is to comment out the call to varelec_init totally. I thought I had commented that back in when I did the test runs with the ryegrid and yegrid read statements commented out...-.- Sorry. So the problem seems to be coming from simply calling varelec_init.

The call statement looks like this:

Fortran:
call varelec_init(varelecfrac,elecprofile,effilename,elecfracname)

I don't think I got this wrong...hopefully.
 
  • #43
Matterwave said:
So the problem seems to be coming from simply calling varelec_init.

The call statement looks like this:

Fortran:
call varelec_init(varelecfrac,elecprofile,effilename,elecfracname)
Can you show the lines where varelecfrac,elecprofile,effilename,elecfracname are declared in the part of the code where this call is?
 
  • #44
DrClaude said:
Can you show the lines where varelecfrac,elecprofile,effilename,elecfracname are declared in the part of the code where this call is?

Certainly. This stuff is read in from a file, the code there looks like this:

Fortran:
if(myid.eq.0) then
  read(5,*) irestart,isconsist,iprofile,spinco,varelecfrac,rhofilename,effilename
  read(5,*) dm21sq,dm32sq,theta12,theta13,theta23
  read(5,*) rsrad,drsinit,dr0,rfinal
  read(5,*) lum(1:nflavor,1:2)
  read(5,*) errtol,drwrite,itwrite
  read(5,*) drwvwrite,itwvwrite
  read(5,*) timemax
  read(5,*) ranrhoescale,ranrhodrscale

  dm21sq=dm21sq*1.d-12
  dm32sq=dm32sq*1.d-12

  do iphi=1,nphi
  do idr=1,nranrhodr
  ranrhoe(idr,iphi)=1.d0+ranrhoescale*rannyu(0)
  enddo
  enddo

!  convert maximum time to hours
  timemax(:)=timemax(:)*3600.d0
  endif

I didn't read in elecprofile because that's only used if varelecfrac=1...it is uninitialized. Maybe that's a problem? But it's not used right now anyways...elecfracname is just passed to the subroutine as a variable and it will be declared in the subroutine itself if certain conditionals are met (which they are not right now). I'm actually...not sure why this is... it was this way with helec_init, so I just kept the same structure as that chunk of code.

Uhm, some updates: editing out this whole chunk of code:

Fortran:
!  read data file
  iend=0
  iegrid=0
  do while (iend.eq.0)
  iegrid=iegrid+1
  read(21,*,iostat=ierr) ryegrid(iegrid), yegrid(iegrid)
  if(ierr.ne.0) then
  iend=1
  iegrid=iegrid-1
  close(21)
  endif
  enddo

  write(6,*) ' Electron fraction grid contains ',iegrid,'values'
  write(6,*) ' Beginning of Y_e grid '
  do i=1,min(5,iegrid)
  write(6,'(f15.7,e18.9)') ryegrid(i),yegrid(i)
  enddo
  write(6,*) ' End of Y_e grid '
  do i=max(1,iegrid-4),iegrid
  write(6,'(f15.7,e18.9)') ryegrid(i),yegrid(i)
  enddo

Does remove the bug... so it seems the bug is in these lines...maybe...I will keep testing.
 
  • #45
I was asking for the part of the code where the variables are declared, meaning where their type is defined.

Your bug is reminiscent of out-of-bounds array assignments. It seems that part of the memory is being modified when it shouldn't, so I would like to see what which are in the call varelec_init are actually like.
 
  • #46
DrClaude said:
I was asking for the part of the code where the variables are declared, meaning where their type is defined.

Your bug is reminiscent of out-of-bounds array assignments. It seems that part of the memory is being modified when it shouldn't, so I would like to see what which are in the call varelec_init are actually like.

Ah ok, they are declared in different places. varelecfrac is declared in module params:
Fortran:
integer :: varelecfrac
Elecprofile is declared in the main code (I might move it to params...):
Fortran:
integer :: elecprofile
effilename and elecfrac name are also decalred in the main code:
Fortran:
character*80 helname,rhofilename,effilename,elecfracname
 
  • #47
UPDATE: Ok, something funky is happening with iegrid. I can change the output of the code by changing iegrid from 0 (regular bug) to 30 (new bug) to 292 (works like usual). It must be messing around with helec_init...and this is probably another instance where an implicit variable is screwing me. :( I think that's what's messing the code up. I'm going to go ahead and rename this variable and see what happens.

EDIT UPDATE: Yes...renaming iegrid to iegrid2 has removed the bug...I don't know...what...the program was doing with it though... .____. I thought it was just an index for a do-loop like "i" or "j" that's frequently used throughout the code...
 
Last edited:
  • #48
Matterwave said:
UPDATE: Ok, something funky is happening with iegrid.
I'm not at all surprised.

Looking at the code again for the varelec_init() sub I see the following:
1. The elecprofilef parameter is never used in the subroutine. The question naturally arises, why is it a parameter?
2. A number of variables appear (as if by magic!) in varelec_init(). These are myvid, icomments, iend, iegrid, ryegrid, and yegrid. They are either local variables that aren't declared, or (much worse) variables that are declared in a module. Either way, these aren't good news.
Matterwave said:
I can change the output of the code by changing iegrid from 0 (regular bug) to 30 (new bug) to 292 (works like usual). It must be messing around with helec_init...and this is probably another instance where an implicit variable is screwing me. :( I think that's what's messing the code up. I'm going to go ahead and rename this variable and see what happens.
 
Last edited:
  • #49
iend and iegrid were never declared as far as I can tell, they appear like all other do loop indices out of thin air...I have since declared them at the top of varelec_init. ryegrid and yegrid are declared at the top of the module just like rhogrid,rhobgrid, rgrid, etc., which are used in helec_init, are declared at the top of the module... myid...I have to go look for where that is defined...

Oh wait, iegrid was declared at the top of the module...uh...
 
  • #50
Mark44 said:
2. A number of variables appear (as if by magic!) in varelec_init(). These are myvid, icomments, iend, iegrid, ryegrid, and yegrid.
I see ryegrid and yegrid declarations at the top of the helectron module. I can't find one for myvid. The helectron module uses two other modules, params and parallel, so maybe myvid is declared in one of these.
 
  • #51
It might be useful to run a fortran lint command against your source code to see what other odd things lurk therein.

Some fortran tools to check out:

http://www.dmoz.org/Computers/Programming/Languages/Fortran/Tools/Code_Analysis/
 
  • #52
Ok, after extensive testing, changing iegrid to iegrid2 (having declared it at the top of varelec_init) has solved the bug issue it seems!

myid was declared in the module parallel...I suppose it has something to do with parallelization of the code. I have not had to work with multiple processors as of yet.
 
Last edited:
  • #53
Working with multiple processors and/or threads can be very tricky, make sure you read up on it and look at simple examples and learn about ways to synchronize your code within a fortran environment. I found these examples:

http://people.sc.fsu.edu/~jburkardt/f_src/openmp/openmp.html

and this:

https://computing.llnl.gov/tutorials/parallel_comp/
 
  • #54
Just a thought, Matterwave. Maybe you could put all of your code in github so that anyone interested could look at the whole code. At least temporarily (it's pretty easy to delete the github repository). In the readme file you could put an example of the input and the expected output.
If you're interested let us know and we'll give you the commands to enter in the terminal.
 
  • #55
fluidistic said:
Just a thought, Matterwave. Maybe you could put all of your code in github so that anyone interested could look at the whole code. At least temporarily (it's pretty easy to delete the github repository). In the readme file you could put an example of the input and the expected output.
If you're interested let us know and we'll give you the commands to enter in the terminal.

This should be checked with his supervisor as some colleges and universities look dimly on posting proprietary code online.
 

Similar threads

  • · Replies 16 ·
Replies
16
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
5K
  • · Replies 6 ·
Replies
6
Views
5K
  • · Replies 1 ·
Replies
1
Views
11K
  • · Replies 6 ·
Replies
6
Views
2K