MCNP6 with mpi failed with signal 11 (Segmentation fault)

AI Thread Summary
The user encountered a segmentation fault while running MCNP6 with MPI, specifically noting that one process returned a non-zero exit code after several hours of normal operation. Despite having ample memory available, the issue may stem from resource reservation conflicts, as MCNP can struggle to secure the necessary resources when other processes are active. Suggestions include sharing a simplified input file for troubleshooting and considering the high number of processes (50) in relation to the server's 32 cores. Another user resolved similar issues by reserving entire servers for MCNP runs, ensuring no other processes interfered. This approach highlights the importance of resource management when running computationally intensive simulations.
Albert ZHANG
Messages
3
Reaction score
0
I use Python scripts to run mcnp.mpi like
mpirun -np 50 i=inp_...
And I encountered this bug report
Primary job terminated normally,but 1 process returned
a non-zero exit code.Per user-direction,the job has been aborted.
mpirun noticed that process rank 31 with PID 0 on node Ubuntu exited on signal 11 (Segmentation fault).
The scipts has run normally for a few hours. I extracted the inp file and it can be run normally.
I searched on Internet and found it seems to be the problem related to memory, but i checked the log, there's still 100+G available. So I don't really know how to solve the problem
 
Engineering news on Phys.org
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
 
  • Like
Likes Albert ZHANG
Alex A said:
So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?
Thanks for you reply, Alex. My script updates the input file and it may be not the problem of input file, I searched some other results and find PRDMP may help but I am still not sure. The input file focus on radiation shielding so I didnt do any mesh settings.
I run my mcnp on a server and it has 32 cores and 128 processors.
 
Sorry to do the old post thing. Maybe it helps somebody else.

I experienced similar things on a UNIX system running MCNP using MPI. The problem turned out to be because MCNP is not very clever about reserving the resources it needs. Memory and connections between nodes and such. So, if some other process started that took one of those resources, MCNP might look around and fail to get that resource. And this might happen after many particles had been run. For example, when nodes did a synch-up with their data batches, and attempted to start a new batch of particles.

If that's the problem, how you fix it may be dependent on details of your system.

We managed to solve the problem by reserving entire servers for MCNP, and not letting anything else run on those servers. Literally nothing else, not even the sys-admin logging in, was permitted on those servers during our runs. And we had to do some script hacking to make sure that MCNP only ran on our reserved servers and not on any of the rest of the system.
 
Hello everyone, I am currently working on a burnup calculation for a fuel assembly with repeated geometric structures using MCNP6. I have defined two materials (Material 1 and Material 2) which are actually the same material but located in different positions. However, after running the calculation with the BURN card, I am encountering an issue where all burnup information(power fraction(Initial input is 1,but output file is 0), burnup, mass, etc.) for Material 2 is zero, while Material 1...
Back
Top