Fortran Input/Output error with error code -5

Click For Summary
The discussion centers on troubleshooting a FORTRAN program that runs successfully on a personal computer but encounters random input/output errors when executed on a high-performance computing cluster. The user reports that the program stops with an error message related to writing output, specifically during "write" or "print" functions, and the errors occur at different lines and time steps in each run. Despite having sufficient disk space on the scratch disk, the issue persists.Key points raised include the importance of ensuring that multiple nodes are not writing to the same file simultaneously, which could lead to conflicts. Suggestions for troubleshooting involve checking the program's compilation environment, verifying that it is compatible with the cluster, and considering whether the program can run in a single instance without utilizing the cluster's full capabilities. Additionally, the potential for buffer overflows or network-related issues is mentioned. The recommendation is to consult cluster support staff for further assistance, as they would have better access to the system's specifics.
kelvin490
Gold Member
Messages
227
Reaction score
3
I got a problem running my FORTRAN program in high performance computer cluster. It runs well in my PC but I want to have mass production of data with different initial conditions so I put it in a cluster node with eight cores, simulate eight sets of data.

The program can run without problem in home directory but since I need extra memory space a scratch hard disk is added and I run the programs in this disk.

After a while the program stopped and there is an error message:

PGFIO/stdio: Input/output error
PGFIO-F-/formatted write/unit=6/error code returned by host stdio - 5.
File name = stdout formatted, sequential access record = 181
In source file TipNew8.f90, at line number 2018
FORTRAN STOP

I have run it several times, similar error occurs but the error occurs at different lines. Also it stopped at different time steps each time I run it. This kind of error seems quite random since it occurs at different steps and different lines. Every time it occurs at lines with "write" or "print" function. It runs without problem when I run it in my PC using Microsoft Visual Studio with PGI compiler.

Does anyone have ideas what's wrong with the program?
 
Technology news on Phys.org
Is your program writing to a disk file? Do you have enough disk space on this scratch disk?

You could check with the "df -h" command if this is linux.
 
jedishrfu said:
Is your program writing to a disk file? Do you have enough disk space on this scratch disk?

You could check with the "df -h" command if this is linux.

I have checked and there is enough space.
 
I suggest you contact the support staff responsible for the cluster. I don't think there is much we can do to help you without access to the system.
 
I don't exactly know what "cluster" means, but here are some ideas, maybe...
  • Are you compiling your program in your PC and then running it in the cluster?
  • Can you run your program in the cluster without taking advantage of the cluster aspect of it? like just one instance of it? does it run this way?
  • Is there such a thing as compiling your program in the cluster? for assured compatiblity?
  • What does cluster mean? Many independent instances of the same program? Are they all writing to the exact same file? or are the file names different?
 
kelvin490 said:
I have checked and there is enough space.
Space shouldn't matter if your program has checked before doing the heavy processing, in which case that is one termination mode.
Are all channel resources made to be sure to be allocated before the run.
Program error handling ...

Sounds though something similar is happening, such as a buffer overflow somewhere, or a node conflict and timeout to disk access.

Is that your software or from the cluster I don't know enough about it. Is it from the network links - is that a possibility.

Random means that the error is indeterminate - ie works really well until the error occurs and you have complete collapse, such as adding the scratch disk has led to an overwhelming accumulation of data.

that;s about all I know.
 
  • Like
Likes kelvin490
256bits said:
or a node conflict and timeout to disk access.
Now that you mention it, this is what I would investigate first. You should be careful that different nodes are not trying to write to a file at the same time. It is very good practice to have one node handle all input/output.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 17 ·
Replies
17
Views
7K
Replies
8
Views
4K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
8K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K