MCNP6 with mpi failed with signal 11 (Segmentation fault)

Click For Summary
SUMMARY

The forum discussion centers on a segmentation fault (signal 11) encountered while running MCNP6 with MPI on a server using the command mpirun -np 50 i=inp_.... Users reported that the issue may stem from resource allocation problems, particularly memory and node connections, rather than the input file itself. One user resolved the issue by reserving entire servers exclusively for MCNP, ensuring no other processes interfered during execution. The discussion highlights the importance of resource management in high-performance computing environments.

PREREQUISITES
  • Familiarity with MCNP6 and its MPI implementation
  • Understanding of parallel computing concepts
  • Knowledge of resource management in high-performance computing
  • Basic proficiency in Python scripting for automation
NEXT STEPS
  • Investigate memory management techniques in MCNP6
  • Learn about resource reservation strategies for high-performance computing
  • Explore the use of PRDMP for debugging MCNP6 issues
  • Study the impact of node synchronization on parallel processing
USEFUL FOR

Researchers, computational physicists, and system administrators working with MCNP6 in high-performance computing environments who need to troubleshoot segmentation faults and optimize resource allocation.

Albert ZHANG
Messages
3
Reaction score
0
I use Python scripts to run mcnp.mpi like
mpirun -np 50 i=inp_...
And I encountered this bug report
Primary job terminated normally,but 1 process returned
a non-zero exit code.Per user-direction,the job has been aborted.
mpirun noticed that process rank 31 with PID 0 on node Ubuntu exited on signal 11 (Segmentation fault).
The scipts has run normally for a few hours. I extracted the inp file and it can be run normally.
I searched on Internet and found it seems to be the problem related to memory, but i checked the log, there's still 100+G available. So I don't really know how to solve the problem
 
Engineering news on Phys.org
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
 
  • Like
Likes   Reactions: Albert ZHANG
Alex A said:
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
Thanks for you reply, Alex. My script updates the input file and it may be not the problem of input file, I searched some other results and find PRDMP may help but I am still not sure. The input file focus on radiation shielding so I didnt do any mesh settings.
I run my mcnp on a server and it has 32 cores and 128 processors.
 
Sorry to do the old post thing. Maybe it helps somebody else.

I experienced similar things on a UNIX system running MCNP using MPI. The problem turned out to be because MCNP is not very clever about reserving the resources it needs. Memory and connections between nodes and such. So, if some other process started that took one of those resources, MCNP might look around and fail to get that resource. And this might happen after many particles had been run. For example, when nodes did a synch-up with their data batches, and attempted to start a new batch of particles.

If that's the problem, how you fix it may be dependent on details of your system.

We managed to solve the problem by reserving entire servers for MCNP, and not letting anything else run on those servers. Literally nothing else, not even the sys-admin logging in, was permitted on those servers during our runs. And we had to do some script hacking to make sure that MCNP only ran on our reserved servers and not on any of the rest of the system.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
3K
Replies
8
Views
13K
  • · Replies 5 ·
Replies
5
Views
4K
Replies
6
Views
2K
  • · Replies 12 ·
Replies
12
Views
4K
  • · Replies 5 ·
Replies
5
Views
5K