Input/Output error with error code -5

Click For Summary

Discussion Overview

The discussion revolves around an input/output error encountered while running a FORTRAN program on a high-performance computing cluster. Participants explore potential causes and solutions related to disk space, program compatibility, and resource management in a multi-node environment.

Discussion Character

  • Technical explanation
  • Debate/contested

Main Points Raised

  • One participant describes encountering a random input/output error when running a FORTRAN program on a cluster, noting that the error occurs at different lines and time steps each time.
  • Another participant asks if the program is writing to a disk file and suggests checking disk space using the "df -h" command.
  • A subsequent reply confirms that there is enough disk space available.
  • One participant recommends contacting the support staff for the cluster, indicating that assistance may be limited without access to the system.
  • Another participant expresses uncertainty about the concept of a cluster and poses questions regarding program compilation and execution, including whether multiple instances are writing to the same file.
  • A participant suggests that the error might be related to resource allocation or a buffer overflow, speculating about potential conflicts with node access to the disk.
  • Another reply emphasizes the importance of ensuring that different nodes do not attempt to write to the same file simultaneously, advocating for a single node to manage all input/output operations.

Areas of Agreement / Disagreement

Participants express various hypotheses about the cause of the error, with no consensus reached on a specific solution or underlying issue. Multiple competing views regarding the nature of the problem and potential resolutions remain present.

Contextual Notes

There are unresolved questions regarding the program's compatibility with the cluster environment, the handling of file I/O across multiple nodes, and the specifics of error handling within the program.

kelvin490
Gold Member
Messages
227
Reaction score
3
I got a problem running my FORTRAN program in high performance computer cluster. It runs well in my PC but I want to have mass production of data with different initial conditions so I put it in a cluster node with eight cores, simulate eight sets of data.

The program can run without problem in home directory but since I need extra memory space a scratch hard disk is added and I run the programs in this disk.

After a while the program stopped and there is an error message:

PGFIO/stdio: Input/output error
PGFIO-F-/formatted write/unit=6/error code returned by host stdio - 5.
File name = stdout formatted, sequential access record = 181
In source file TipNew8.f90, at line number 2018
FORTRAN STOP

I have run it several times, similar error occurs but the error occurs at different lines. Also it stopped at different time steps each time I run it. This kind of error seems quite random since it occurs at different steps and different lines. Every time it occurs at lines with "write" or "print" function. It runs without problem when I run it in my PC using Microsoft Visual Studio with PGI compiler.

Does anyone have ideas what's wrong with the program?
 
Technology news on Phys.org
Is your program writing to a disk file? Do you have enough disk space on this scratch disk?

You could check with the "df -h" command if this is linux.
 
jedishrfu said:
Is your program writing to a disk file? Do you have enough disk space on this scratch disk?

You could check with the "df -h" command if this is linux.

I have checked and there is enough space.
 
I suggest you contact the support staff responsible for the cluster. I don't think there is much we can do to help you without access to the system.
 
I don't exactly know what "cluster" means, but here are some ideas, maybe...
  • Are you compiling your program in your PC and then running it in the cluster?
  • Can you run your program in the cluster without taking advantage of the cluster aspect of it? like just one instance of it? does it run this way?
  • Is there such a thing as compiling your program in the cluster? for assured compatiblity?
  • What does cluster mean? Many independent instances of the same program? Are they all writing to the exact same file? or are the file names different?
 
kelvin490 said:
I have checked and there is enough space.
Space shouldn't matter if your program has checked before doing the heavy processing, in which case that is one termination mode.
Are all channel resources made to be sure to be allocated before the run.
Program error handling ...

Sounds though something similar is happening, such as a buffer overflow somewhere, or a node conflict and timeout to disk access.

Is that your software or from the cluster I don't know enough about it. Is it from the network links - is that a possibility.

Random means that the error is indeterminate - ie works really well until the error occurs and you have complete collapse, such as adding the scratch disk has led to an overwhelming accumulation of data.

that;s about all I know.
 
  • Like
Likes   Reactions: kelvin490
256bits said:
or a node conflict and timeout to disk access.
Now that you mention it, this is what I would investigate first. You should be careful that different nodes are not trying to write to a file at the same time. It is very good practice to have one node handle all input/output.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 17 ·
Replies
17
Views
7K
Replies
8
Views
4K
  • · Replies 5 ·
Replies
5
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 2 ·
Replies
2
Views
8K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K