MCNP6 with mpi failed with signal 11 (Segmentation fault)

Click For Summary

Discussion Overview

The discussion revolves around troubleshooting a segmentation fault encountered while running MCNP6 with MPI. Participants explore potential causes, resource management issues, and input file considerations related to memory usage and process management.

Discussion Character

  • Technical explanation
  • Debate/contested
  • Experimental/applied

Main Points Raised

  • One participant reports a segmentation fault during execution, noting that the input file runs normally outside of the MPI context and that there is ample memory available.
  • Another participant suggests that the segmentation fault could stem from various causes and requests a simplified version of the input file for further analysis. They also question the appropriateness of using 50 processes given the machine's core count.
  • A later reply mentions that the issue may not be related to the input file but rather to resource reservation and management, suggesting that MCNP may fail to secure necessary resources if other processes are running concurrently.
  • One participant shares their experience with similar issues on a UNIX system, indicating that resource contention could lead to failures during synchronization of data batches, and describes a solution involving reserving entire servers for exclusive use by MCNP.

Areas of Agreement / Disagreement

Participants express varying opinions on the causes of the segmentation fault, with no consensus on a single solution or definitive cause. Multiple competing views on resource management and input file relevance remain present.

Contextual Notes

Participants highlight potential limitations related to resource allocation and management, as well as the complexity of diagnosing segmentation faults in MPI environments. Specific details about system configurations and input file characteristics are noted but not resolved.

Albert ZHANG
Messages
3
Reaction score
0
I use Python scripts to run mcnp.mpi like
mpirun -np 50 i=inp_...
And I encountered this bug report
Primary job terminated normally,but 1 process returned
a non-zero exit code.Per user-direction,the job has been aborted.
mpirun noticed that process rank 31 with PID 0 on node Ubuntu exited on signal 11 (Segmentation fault).
The scipts has run normally for a few hours. I extracted the inp file and it can be run normally.
I searched on Internet and found it seems to be the problem related to memory, but i checked the log, there's still 100+G available. So I don't really know how to solve the problem
 
Engineering news on Phys.org
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
 
  • Like
Likes   Reactions: Albert ZHANG
Alex A said:
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
Thanks for you reply, Alex. My script updates the input file and it may be not the problem of input file, I searched some other results and find PRDMP may help but I am still not sure. The input file focus on radiation shielding so I didnt do any mesh settings.
I run my mcnp on a server and it has 32 cores and 128 processors.
 
Sorry to do the old post thing. Maybe it helps somebody else.

I experienced similar things on a UNIX system running MCNP using MPI. The problem turned out to be because MCNP is not very clever about reserving the resources it needs. Memory and connections between nodes and such. So, if some other process started that took one of those resources, MCNP might look around and fail to get that resource. And this might happen after many particles had been run. For example, when nodes did a synch-up with their data batches, and attempted to start a new batch of particles.

If that's the problem, how you fix it may be dependent on details of your system.

We managed to solve the problem by reserving entire servers for MCNP, and not letting anything else run on those servers. Literally nothing else, not even the sys-admin logging in, was permitted on those servers during our runs. And we had to do some script hacking to make sure that MCNP only ran on our reserved servers and not on any of the rest of the system.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
3K
Replies
8
Views
13K
  • · Replies 5 ·
Replies
5
Views
4K
Replies
6
Views
2K
  • · Replies 12 ·
Replies
12
Views
4K
  • · Replies 5 ·
Replies
5
Views
5K