.TH srun_cr "1" "Slurm Commands" "April 2015" "Slurm Commands" .SH "NAME" srun_cr \- run parallel jobs with checkpoint/restart support .SH SYNOPSIS \fBsrun_cr\fR [\fIOPTIONS\fR...] .SH DESCRIPTION The design of \fBsrun_cr\fR is inspired by \fBmpiexec_cr\fR from MVAPICH2 and \fBcr_restart\fR form BLCR. It is a wrapper around the \fBsrun\fR command to enable batch job checkpoint/restart support when used with Slurm's \fBcheckpoint/blcr\fR plugin. .SH "OPTIONS" The \fBsrun_cr\fR execute line options are identical to those of the \fBsrun\fR command. See "man srun" for details. .SH "DETAILS" After initialization, \fBsrun_cr\fR registers a thread context callback function. Then it forks a process and executes "cr_run \-\-omit srun" with its arguments. \fBcr_run\fR is employed to exclude the \fBsrun\fR process from being dumped upon checkpoint. All catchable signals except SIGCHLD sent to \fBsrun_cr\fR will be forwarded to the child \fBsrun\fR process. SIGCHLD will be captured to mimic the exit status of \fBsrun\fR when it exits. Then \fBsrun_cr\fR loops waiting for termination of tasks being launched from \fBsrun\fR. The step launch logic of Slurm is augmented to check if \fBsrun\fR is running under \fBsrun_cr\fR. If true, the environment variable \fBSLURM_SRUN_CR_SOCKET\fR should be present, the value of which is the address of a Unix domain socket created and listened to be \fBsrun_cr\fR. After launching the tasks, \fBsrun\fR tries to connect to the socket and sends the job ID, step ID and the nodes allocated to the step to \fBsrun_cr\fR. Upon checkpoint, \fRsrun_cr\fR checks to see if the tasks have been launched. If not \fRsrun_cr\fR first forwards the checkpoint request to the tasks by calling the Slurm API \fBslurm_checkpoint_tasks()\fR before dumping its process context. Upon restart, \fBsrun_cr\fR checks to see if the tasks have been previously launched and checkpointed. If true, the environment variable \fRSLURM_RESTART_DIR\fR is set to the directory of the checkpoint image files of the tasks. Then \fBsrun\fR is forked and executed again. The environment variable will be used by the \fBsrun\fR command to restart execution of the tasks from the previous checkpoint. .SH "COPYING" Copyright (C) 2009 National University of Defense Technology, China. Produced at National University of Defense Technology, China (cf, DISCLAIMER). .LP This file is part of Slurm, a resource management program. For details, see . .LP Slurm is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. .LP Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. .SH "SEE ALSO" \fBsrun\fR(1)