MCNP6 with mpi failed with signal 11 (Segmentation fault)

Albert ZHANG · Aug 7, 2022

I use Python scripts to run mcnp.mpi like

mpirun -np 50 i=inp_...

And I encountered this bug report

Primary job terminated normally,but 1 process returned
a non-zero exit code.Per user-direction,the job has been aborted.
mpirun noticed that process rank 31 with PID 0 on node Ubuntu exited on signal 11 (Segmentation fault).

The scipts has run normally for a few hours. I extracted the inp file and it can be run normally.
I searched on Internet and found it seems to be the problem related to memory, but i checked the log, there's still 100+G available. So I don't really know how to solve the problem

Alex A · Aug 8, 2022

So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?

Albert ZHANG · Aug 9, 2022

Alex A said:

So much causes mcnp to segfault it's difficult to say. Can you share the input file, or a cut down version of the input file that causes the same error? PID of 0 seems weird, a quick google says that is the paging process. Are you doing any big mesh tallies?

The only other thing I can think of is 50 seems like a lot of copies. How many cores does the machine have?

Thanks for you reply, Alex. My script updates the input file and it may be not the problem of input file, I searched some other results and find PRDMP may help but I am still not sure. The input file focus on radiation shielding so I didnt do any mesh settings.
I run my mcnp on a server and it has 32 cores and 128 processors.

BillOnne · Nov 1, 2022

Sorry to do the old post thing. Maybe it helps somebody else.

I experienced similar things on a UNIX system running MCNP using MPI. The problem turned out to be because MCNP is not very clever about reserving the resources it needs. Memory and connections between nodes and such. So, if some other process started that took one of those resources, MCNP might look around and fail to get that resource. And this might happen after many particles had been run. For example, when nodes did a synch-up with their data batches, and attempted to start a new batch of particles.

If that's the problem, how you fix it may be dependent on details of your system.

We managed to solve the problem by reserving entire servers for MCNP, and not letting anything else run on those servers. Literally nothing else, not even the sys-admin logging in, was permitted on those servers during our runs. And we had to do some script hacking to make sure that MCNP only ran on our reserved servers and not on any of the rest of the system.

MCNP6 with mpi failed with signal 11 (Segmentation fault)

Thread 'What type of energy is actually stored inside an atom?'

Similar threads

MCNP Code Error -- Need to find the cause of the error

MCNP: Is this a valid way to define "two sources with different sizes"?

Separate the inelastic and capture reactions in a tally

Creating the geometry using MNCPX VisEd

Atomic displacements, replacements, and radiation damage of structural materials

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers