NAME¶
srun_cr - run parallel jobs with checkpoint/restart support
SYNOPSIS¶
srun_cr [
OPTIONS...]
DESCRIPTION¶
The design of
srun_cr is inspired by
mpiexec_cr from MVAPICH2 and
cr_restart form BLCR. It is a wrapper around the
srun command to
enable batch job checkpoint/restart support when used with SLURM's
checkpoint/blcr plugin.
OPTIONS¶
The
srun_cr execute line options are identical to those of the
srun command. See "man srun" for details.
DETAILS¶
After initialization,
srun_cr registers a thread context callback
function. Then it forks a process and executes "cr_run --omit srun"
with its arguments.
cr_run is employed to exclude the
srun
process from being dumped upon checkpoint. All catchable signals except
SIGCHLD sent to
srun_cr will be forwarded to the child
srun
process. SIGCHLD will be captured to mimic the exit status of
srun when
it exits. Then
srun_cr loops waiting for termination of tasks being
launched from
srun.
The step launch logic of SLURM is augmented to check if
srun is running
under
srun_cr. If true, the environment variable
SLURM_SRUN_CR_SOCKET should be present, the value of which is the
address of a Unix domain socket created and listened to be
srun_cr.
After launching the tasks,
srun tries to connect to the socket and
sends the job ID, step ID and the nodes allocated to the step to
srun_cr.
Upon checkpoint, srun_cr checks to see if the tasks have been launched. If not
srun_cr first forwards the checkpoint request to the tasks by calling the
SLURM API
slurm_checkpoint_tasks() before dumping its process context.
Upon restart,
srun_cr checks to see if the tasks have been previously
launched and checkpointed. If true, the environment variable SLURM_RESTART_DIR
is set to the directory of the checkpoint image files of the tasks. Then
srun is forked and executed again. The environment variable will be
used by the
srun command to restart execution of the tasks from the
previous checkpoint.
COPYING¶
Copyright (C) 2009 National University of Defense Technology, China. Produced at
National University of Defense Technology, China (cf, DISCLAIMER).
This file is part of SLURM, a resource management program. For details, see
<
http://slurm.schedmd.com/>.
SLURM is free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
SLURM is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
SEE ALSO¶
srun(1)