Я пытаюсь установить slurm в кластере рабочая человечность 16.04.
Я использую Intel mpi, и каталог установки расположен в главном узле/opt/intel/impi_5.01.
Согласно slurm инструкции, это должно экспортировать libpmi.so переменную. https://slurm.schedmd.com/mpi_guide.html#intel_mpi
Но, я установил slurm-llnl через человечность
sudo apt-get slurm-llnl
и я не уверен, где libpmi.so расположен? Так, я сделал поиск и нашел файл здесь, действительно ли это - файл, который я ищу?
/usr/lib/x86_64-linux-gnu/libpmi.so
Так или иначе я экспортирую переменную, и я попробовал
srun -p old -N3 -n24 hostname
Это возвращается,
rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01
Это кажется рабочим.
Но поскольку я выполняю свою задачу,
srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x
Это произвело ошибки,
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
Я полагаю, что ошибочные подсказки происходят из-за выполнения mpiexec с intel-mpi, оно должно использовать mpirun вместо этого.
Как я могу исправить проблему?
Спасибо!
Я нашел свое решение.
1) sudo apt-get install mpich
2) srun --mpi=pmi2
3) mkl и связанные с Intel переменные окружения загружаются правильно.
я надеюсь, что это собирается помочь кому-то имеющему подобную проблему.