человечность 16.04 slurm srun перестала работать с Intel mpi?

Я пытаюсь установить slurm в кластере рабочая человечность 16.04.

Я использую Intel mpi, и каталог установки расположен в главном узле/opt/intel/impi_5.01.

Согласно slurm инструкции, это должно экспортировать libpmi.so переменную. https://slurm.schedmd.com/mpi_guide.html#intel_mpi

Но, я установил slurm-llnl через человечность

sudo apt-get slurm-llnl

и я не уверен, где libpmi.so расположен? Так, я сделал поиск и нашел файл здесь, действительно ли это - файл, который я ищу?

/usr/lib/x86_64-linux-gnu/libpmi.so

Так или иначе я экспортирую переменную, и я попробовал

srun -p old -N3 -n24 hostname

Это возвращается,

rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01

Это кажется рабочим.

Но поскольку я выполняю свою задачу,

srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x

Это произвело ошибки,

mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

Я полагаю, что ошибочные подсказки происходят из-за выполнения mpiexec с intel-mpi, оно должно использовать mpirun вместо этого.

Как я могу исправить проблему?

Спасибо!

0
задан 14 January 2017 в 23:38

1 ответ

Я нашел свое решение.

1) sudo apt-get install mpich

2) srun --mpi=pmi2

3) mkl и связанные с Intel переменные окружения загружаются правильно.

я надеюсь, что это собирается помочь кому-то имеющему подобную проблему.

0
ответ дан 7 November 2019 в 04:00

Другие вопросы по тегам:

Похожие вопросы: