OpenMPI perl script will not start MPI
Posted: 2017/09/12 23:06:14
Centos 6.7 Final, RPMs installed openmpi-1.8-1.8.1-5.el6.x86_64 and openmpi-1.8-devel-1.8.1-5.el6.x86_64
A program called "maker" was built using this MPI. At least that is what I believe I told it to do like this:
When an attempt is made to run maker
It does this (with or without the --prefix):
There is another mpi installed in /usr/local. I didn't put it there and cannot remove it (lest it break something else).
The things it says it cannot find seem to be present:
maker itself is a perl script, and it runs on a multithreaded version of perl 5.20 which is configured by perlbrew. That perl
binary is also the one referenced on the first line of the script. With some DEBUG print statements the source all all those error messages was tracked down to this line:
which was preceded many lines above by:
The odd thing is that as far as I can tell those pieces are not installed in perl, they may be part of Maker. MPI_Init()
definitely is, it is in ~/src/maker/perl/lib/Parallel/Application/MPI.pm and consists of:
The problem is traced to C_MPI_Init(), which never returns. Elsewhere in MPI.pm one finds:
and DEBUG statements entered before the "bind" show that mpicc and mpidir point to /usr/lib64/openmpi/bin/mpicc and
/usr/include/openmpi-x86_64, which is correct, as far as I can tell.
Any suggestions on what is going wrong here and how to fix it?
A program called "maker" was built using this MPI. At least that is what I believe I told it to do like this:
Code: Select all
perl Build.PL 2>&1 | tee ../build_pl.log
#query:prompt relative to MPI installation
#mpi query: Y
#specify path to mpicc: /usr/lib64/openmpi/bin/mpicc
#specify path to mpi.h: /usr/include/openmpi-x86_64
./Build install 2>&1 | tee maker_install.log
Code: Select all
module add openmpi-x86_64
echo $LD_LIBRARY_PATH
/usr/lib64/openmpi/lib
echo $PATH
/usr/lib64/openmpi/bin:/home/mathog/perl5/perlbrew/bin:/home/mathog/perl5/perlbrew/perls/perl-5.20.0t/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/mathog/bin
nice /usr/lib64/openmpi/bin/mpiexec --prefix /usr/lib64/openmpi -n 1 \
/home/mathog/src/maker/bin/maker \
</dev/null >try_maker_1.log 2>&1 &
Code: Select all
[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_init failed
--> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[machinename:70819] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_sysv: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
opal_shmem_base_select failed
--> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_mpi_init: ompi_rte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[10350,1],0]
Exit code: 1
--------------------------------------------------------------------------
The things it says it cannot find seem to be present:
Code: Select all
ls -al /usr/lib64/openmpi/lib/openmpi/mca_shmem*
-rwxr-xr-x 1 root root 12352 May 11 2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap.so
-rwxr-xr-x 1 root root 11304 May 11 2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_posix.so
-rwxr-xr-x 1 root root 8832 May 11 2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_sysv.so
binary is also the one referenced on the first line of the script. With some DEBUG print statements the source all all those error messages was tracked down to this line:
Code: Select all
MPI_Init();
Code: Select all
use Process::MpiChunk;
use Process::MpiTiers;
use Parallel::Application::MPI qw(:all);
definitely is, it is in ~/src/maker/perl/lib/Parallel/Application/MPI.pm and consists of:
Code: Select all
sub MPI_Init {
my $stat = 0;
if($$ != 0 && !$INITIALIZED && _load()){
# allow signals to interupt blocked MPI calls
UNSAFE_SIGNALS {
$stat = C_MPI_Init();
};
$INITIALIZED = 1;
}
return $stat;
}
Code: Select all
eval{
#this comment is just a way to force Inline::C to recompile on changing MPICC and MPIDIR
my $comment = "void _comment() {\nchar comment[] = \"MPICC=$mpicc, MPIDIR=$mpidir, CCFLAGSEX=$extra\";\n}\n";
Inline->bind(C => $CODE . $comment,
NAME => 'Parallel::Application::MPI',
DIRECTORY => $loc,
CC => $mpicc,
LD => $mpicc,
CCFLAGSEX => $extra,
INC => '-I'.$mpidir,);
};
/usr/include/openmpi-x86_64, which is correct, as far as I can tell.
Any suggestions on what is going wrong here and how to fix it?