OpenMPI perl script will not start MPI

Issues related to applications and software problems
Post Reply
mathog
Posts: 258
Joined: 2008/07/09 23:52:06

OpenMPI perl script will not start MPI

Post by mathog » 2017/09/12 23:06:14

Centos 6.7 Final, RPMs installed openmpi-1.8-1.8.1-5.el6.x86_64 and openmpi-1.8-devel-1.8.1-5.el6.x86_64

A program called "maker" was built using this MPI. At least that is what I believe I told it to do like this:

Code: Select all

perl Build.PL 2>&1 | tee ../build_pl.log
#query:prompt relative to MPI installation
#mpi query: Y
#specify path to mpicc: /usr/lib64/openmpi/bin/mpicc
#specify path to mpi.h: /usr/include/openmpi-x86_64
./Build install 2>&1 | tee maker_install.log
When an attempt is made to run maker

Code: Select all

module add  openmpi-x86_64
echo $LD_LIBRARY_PATH
/usr/lib64/openmpi/lib
echo $PATH
/usr/lib64/openmpi/bin:/home/mathog/perl5/perlbrew/bin:/home/mathog/perl5/perlbrew/perls/perl-5.20.0t/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/mathog/bin
nice /usr/lib64/openmpi/bin/mpiexec --prefix /usr/lib64/openmpi -n 1 \
  /home/mathog/src/maker/bin/maker \
   </dev/null >try_maker_1.log 2>&1 &
It does this (with or without the --prefix):

Code: Select all

[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_posix: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[machinename:70819] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[machinename:70819] mca: base: component_find: unable to open /usr/lib64/openmpi/lib/openmpi/mca_shmem_sysv: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10350,1],0]
  Exit code:    1
--------------------------------------------------------------------------
There is another mpi installed in /usr/local. I didn't put it there and cannot remove it (lest it break something else).
The things it says it cannot find seem to be present:

Code: Select all

ls -al /usr/lib64/openmpi/lib/openmpi/mca_shmem*
-rwxr-xr-x 1 root root 12352 May 11  2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_mmap.so
-rwxr-xr-x 1 root root 11304 May 11  2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_posix.so
-rwxr-xr-x 1 root root  8832 May 11  2016 /usr/lib64/openmpi/lib/openmpi/mca_shmem_sysv.so
maker itself is a perl script, and it runs on a multithreaded version of perl 5.20 which is configured by perlbrew. That perl
binary is also the one referenced on the first line of the script. With some DEBUG print statements the source all all those error messages was tracked down to this line:

Code: Select all

MPI_Init();
which was preceded many lines above by:

Code: Select all

use Process::MpiChunk;
use Process::MpiTiers;
use Parallel::Application::MPI qw(:all);
The odd thing is that as far as I can tell those pieces are not installed in perl, they may be part of Maker. MPI_Init()
definitely is, it is in ~/src/maker/perl/lib/Parallel/Application/MPI.pm and consists of:

Code: Select all

sub MPI_Init {
    my $stat = 0;
    if($$ != 0 && !$INITIALIZED && _load()){
        # allow signals to interupt blocked MPI calls
        UNSAFE_SIGNALS {
            $stat = C_MPI_Init();
        };
        $INITIALIZED = 1;
    }
    return $stat;
}
The problem is traced to C_MPI_Init(), which never returns. Elsewhere in MPI.pm one finds:

Code: Select all

    eval{
	#this comment is just a way to force Inline::C to recompile on changing MPICC and MPIDIR
	my $comment = "void _comment() {\nchar comment[] = \"MPICC=$mpicc, MPIDIR=$mpidir, CCFLAGSEX=$extra\";\n}\n"; 
	Inline->bind(C => $CODE . $comment,
		     NAME => 'Parallel::Application::MPI',
		     DIRECTORY => $loc,
		     CC => $mpicc,
		     LD => $mpicc,
		     CCFLAGSEX => $extra,
		     INC => '-I'.$mpidir,);
    };
and DEBUG statements entered before the "bind" show that mpicc and mpidir point to /usr/lib64/openmpi/bin/mpicc and
/usr/include/openmpi-x86_64, which is correct, as far as I can tell.

Any suggestions on what is going wrong here and how to fix it?

mathog
Posts: 258
Joined: 2008/07/09 23:52:06

Re: OpenMPI perl script will not start MPI

Post by mathog » 2017/09/13 00:58:42

Tried again on a different machine. This environment differed in that perl 5.20 was not compiled with threads and it was Centos 6.9
instead of 6.7. The exact same results were obtained, other than the process ID's and time stamps changing.

mathog
Posts: 258
Joined: 2008/07/09 23:52:06

Re: OpenMPI perl script will not start MPI

Post by mathog » 2017/09/13 16:55:05

Tested the installation with an mpi_hello_world from
http://mpitutorial.com/tutorials/mpi-hello-world/
and it ran normally with

module add openmpi-x86_64
mpicc -o mpi_hello_world mpi_hello_world.c
mpirun -n 28 ./mpi_hello_world

modified that program to spin in a loop for about a minute

Code: Select all

diff -u mpi_hello_world.c mpi_hello_world_spin.c
--- mpi_hello_world.c   2017-09-13 09:39:07.245510349 -0700
+++ mpi_hello_world_spin.c      2017-09-13 09:49:12.744755640 -0700
@@ -17,11 +17,17 @@
     char processor_name[MPI_MAX_PROCESSOR_NAME];
     int name_len;
     MPI_Get_processor_name(processor_name, &name_len);
+    
+    long long i;
+    unsigned long long sum=0;
+    for (i=0;i<20000000000;i++){
+       sum+=i;
+    }
 
     // Print off a hello world message
     printf("Hello world from processor %s, rank %d"
-           " out of %d processors\n",
-           processor_name, world_rank, world_size);
+           " out of %d processors sum %llu\n",
+           processor_name, world_rank, world_size, sum);
 
     // Finalize the MPI environment.
     MPI_Finalize();


and verified with "top" that it was actually using CPU time on the appropriate number of processors.

So the issue must be with the perl program, not OpenMPI or CentOS. Will pursue this issue further there.

mathog
Posts: 258
Joined: 2008/07/09 23:52:06

Re: OpenMPI perl script will not start MPI

Post by mathog » 2017/09/13 17:32:18

mathog wrote:Tested the installation with an mpi_hello_world from
http://mpitutorial.com/tutorials/mpi-hello-world/
and it ran normally with

module add openmpi-x86_64
mpicc -o mpi_hello_world mpi_hello_world.c
mpirun -n 28 ./mpi_hello_world

modified that program to spin in a loop for about a minute

Code: Select all

diff -u mpi_hello_world.c mpi_hello_world_spin.c
--- mpi_hello_world.c   2017-09-13 09:39:07.245510349 -0700
+++ mpi_hello_world_spin.c      2017-09-13 09:49:12.744755640 -0700
@@ -17,11 +17,17 @@
     char processor_name[MPI_MAX_PROCESSOR_NAME];
     int name_len;
     MPI_Get_processor_name(processor_name, &name_len);
+    
+    long long i;
+    unsigned long long sum=0;
+    for (i=0;i<20000000000;i++){
+       sum+=i;
+    }
 
     // Print off a hello world message
     printf("Hello world from processor %s, rank %d"
-           " out of %d processors\n",
-           processor_name, world_rank, world_size);
+           " out of %d processors sum %llu\n",
+           processor_name, world_rank, world_size, sum);
 
     // Finalize the MPI environment.
     MPI_Finalize();


and verified with "top" that it was actually using CPU time on the appropriate number of processors.

I also added as the last line of .bash_profile

Code: Select all

module add openmpi-x86_64
Which made no difference whatsoever in the maker error messages.

So the issue must be with the perl program, not OpenMPI or CentOS. Will pursue this issue further there.

mathog
Posts: 258
Joined: 2008/07/09 23:52:06

Re: OpenMPI perl script will not start MPI

Post by mathog » 2017/09/13 17:33:48

mathog wrote:
mathog wrote:Tested the installation with an mpi_hello_world from
http://mpitutorial.com/tutorials/mpi-hello-world/
and it ran normally with

module add openmpi-x86_64
mpicc -o mpi_hello_world mpi_hello_world.c
mpirun -n 28 ./mpi_hello_world

modified that program to spin in a loop for about a minute

Code: Select all

diff -u mpi_hello_world.c mpi_hello_world_spin.c
--- mpi_hello_world.c   2017-09-13 09:39:07.245510349 -0700
+++ mpi_hello_world_spin.c      2017-09-13 09:49:12.744755640 -0700
@@ -17,11 +17,17 @@
     char processor_name[MPI_MAX_PROCESSOR_NAME];
     int name_len;
     MPI_Get_processor_name(processor_name, &name_len);
+    
+    long long i;
+    unsigned long long sum=0;
+    for (i=0;i<20000000000;i++){
+       sum+=i;
+    }
 
     // Print off a hello world message
     printf("Hello world from processor %s, rank %d"
-           " out of %d processors\n",
-           processor_name, world_rank, world_size);
+           " out of %d processors sum %llu\n",
+           processor_name, world_rank, world_size, sum);
 
     // Finalize the MPI environment.
     MPI_Finalize();


and verified with "top" that it was actually using CPU time on the appropriate number of processors.

I also added as the last line of .bash_profile

Code: Select all

module add openmpi-x86_64
Which made no difference whatsoever in the maker error messages. I verified that LD_LIBRARY_PATH=/usr/lib64/openmpi/lib
was set and that /usr/lib64/openmpi/bin was the first entry in PATH.

So the issue must be with the perl program, not OpenMPI or CentOS. Will pursue this issue further there.

mathog
Posts: 258
Joined: 2008/07/09 23:52:06

Re: OpenMPI perl script will not start MPI

Post by mathog » 2017/09/13 23:20:24

The problem was that LD_LIBRARY_PATH was not sufficient. It had to have LD_PRELOAD set with the full path to libmpi.so. Something about the Perl Inline::C module, or what that module builds during installation, apparently requires this. The installation notes for the software said so, but the wrong LD_* got stuck in my head.

Post Reply