NAME¶
cr_restart - restarts a process, process group, or session from a checkpoint
file.
SYNOPSIS¶
cr_restart [
options] [
checkpoint_file]
DESCRIPTION¶
cr_restart restarts a process (or set of processes) from a checkpoint file
created with
cr_checkpoint(1).
A restarted process has all of the attributes they had at checkpoint time,
including its process id. If any needed resources cannot be attained for the
processes in a checkpoint file (ex: a pid is in use), cr_restart will fail. If
a process group or session is restarted, all parent/child relations and pipes,
etc., between the processes in the checkpoint will be correctly restored.
If the
stdin/
stdout/
stderr of any restarted process was
directed to a terminal at checkpoint time, it is redirected to the controlling
terminal of the cr_restart program.
The current working directory of a restarted process is the same as when it was
checkpointed, regardless of where the context file is located, or where
cr_restart is invoked.
The cr_restart process becomes the parent of the 'eldest' process in any
restarted job. This means that
getppid(2) may return a different value
to the eldest process after restart. When the eldest restarted process exits
(or dies from a signal), cr_restart will exit with the same error code (or
kill itself with the same signal), so it is largely invisible (it is necessary
to keep cr_restart `in-between' your shell and restarted processes, however,
as most Unix shells get quite confused if they observe their children changing
process ids).
Signals¶
By default restarted processes begin to run after the restart is complete.
Alternatively, you may specify that they be stopped (via
--stop), or
terminated/aborted/killed (via
--term,
--abort, or
--kill). This is done by sending the appropriate signal to every
process that is part of the restart. If the processes were stopped at the time
the checkpoint was requested, then
--cont may be used to send SIGCONT
to all processes after the restart is completed.
Error handling¶
By default cr_restart will block until the restarted process has completed, and
will exit with the same exit value as the restarted process (even if the
restarted process died with a fatal signal). This can make it nearly
impossible to determine if a non-zero exit from cr_restart is due to a failure
to restart, or is the exit code of a correctly restarted process. The simple
approach of looking for 'Restart failed:' is not reliable. Therefore, the
--run-on-* family of flags are available to supply alternative (or
supplementary) error handling. When any of the
--run-on-* flags is
passed, a hook is installed for the given category of failure (or success), as
defined below. When an error (or success) is detected and a corresponding hook
is installed, the hook is run via the
system(3) function. If the exit
code of the hook is non-zero, then cr_restart returns this value, suppressing
any error message that would otherwise be generated. If no hook is installed,
the hook is an empty string, or if the hook returns an exit code of zero, then
an explanatory error message is printed and an exit code related to the errno
value at the time of failure is returned.
- --run-on-success='cmd'
- Runs the given command as soon as the restarted process(es) are known to
be running. If the return value of 'cmd' is non-zero, this also results in
cr_restart terminating without waiting on termination of the restarted
process(es).
- --run-on-fail-args='cmd'
- Runs the given command if the arguments are invalid. This includes the
case in which the given context file is missing or unreadable.
- --run-on-fail-temp='cmd'
- Runs the given command if a "temporary" failure is detected.
This includes the case of a required pid being in use.
- --run-on-fail-perm='cmd'
- Runs the given command if a "permanent" failure is detected.
This is most commonly due to a corrupted context file.
- --run-on-fail-env='cmd'
- Runs the given command if an "environmental" failure is
detected. This includes when files required for restarting are missing or
inaccessible.
- --run-on-failure='cmd'
- This installs the given command for all of the --run-on-fail-*
hooks.
File relocation¶
By default, files and directories are saved `by reference', storing their full
pathname in the context file. This includes files associated with a process
via
open(2) and/or
mmap(2) and directories associated via
opendir(3) or as the current working directory. Use of
--relocate
oldpath=newpath allows remapping of such paths to new locations at
restart-time.
When parsing the
--relocate argument the sequences `\=' and `\\' are
interpreted as `=' and `\', respectively, to allow for paths that contain the
`=' character. The `\' character is not special in any other context. (Note
that command shells also have special treatment of `\' and you may therefore
need quotes or additional `\' characters to pass the argument you intend.)
When file or directory associations are restored, the
oldpath is compared
to the saved fullpath of each file or directory. If it matches the leading
components of the path, the matching portion is replaced by the value of
newpath. Note that
oldpath must match
entire path
components, and only
leading components. Therefore an
oldpath of
/tmp/foo will match
/tmp/foo or
/tmp/foo/1, but will
not match to
/tmp/fooz (not matching the full component
fooz) or to
/var/tmp/foo (not matching the leading component
/var.)
It is important to be aware the the saved fullpaths in a context file are the
canonical paths. Therefore the
oldpath you provide must also be a
canonical path, though the
newpath doesn't need to be. For instance, if
/tmp is a symbolic link to
/var/tmp, then if your application
opens the file
/tmp/work/1234 the path stored in the context file will
be
/var/tmp/work/1234. Therefore,
--relocate /tmp/work=
/tmp/play
would
not work as desired, but either of the following would:
--relocate /var/tmp/work=
/tmp/play
--relocate /var/tmp/work=
/var/tmp/play
If the
--relocate option is passed multiple times, all are applied to
restored file or directory associations, but only the first match is applied
to any given path. Currently a maximum of 16 relocations is supported.
By default, processes are restarted with the same pid and thread id (as returned
by
getpid(2), and
gettid(2) respectively). This default ensures
that processes and threads that signal each other and processes that wait on
children will continue to function correctly. However, this prevents
restarting concurrent instances of the same context file.
By default, the process group and session (as returned by
getpgrp(2), and
getsid(2)) are set to those of the cr_restart program. This ensures
that job control via the requester's session leader (typically a login shell)
will continue to function correctly. However, this interferes with any job
control or process group signaling that may be take place among the restarted
processes.
There are options to individually control whether the pid, process group and
session are restored to their saved values or assume new values (the process
group and session inherited from cr_restart and a fresh pid obtained from
fork(2)). There is no separate control for the thread ids, as they must
always follow the same policy as the pid. The following describes each option,
along with outlining some of the risks associated with the non-default ones:
- --restore-pid
- (default) This causes pid and thread ids to be restored to their saved
values.
- --no-restore-pid
- This causes pid and thread ids to assume new values. Any multi-threaded
process has the possibility of using functions like tkill(2) which
will not behave as desired if the thread ids are not restored. Similarly,
any multi-process application may make use of kill(2) or
waitpid(2), among others, that require restored pids for correct
operation. It is also worth noting that many versions of glibc will cache
the result of getpid(), which may result in calls after restore returning
the original value, even though the pid was changed by the restart.
- --restore-pgid
- This causes the process group ids to be restored to their saved values.
This is required for correct operation of any multi-process application
that may perform signal or wait operations on process groups (as by
passing a negative pid value to kill(2) or waitpid(2), among
others), or which uses process groups for POSIX job control operations.
This is NOT the default behavior because restoring the process group ids
will prevent job control by the requester's shell (or other controlling
process).
- --no-restore-pgid
- (default) This causes the restarted processes to join the process group of
the cr_restart process.
- --restore-sid
- This causes the session ids to be restored to their saved values. This is
required, for instance, for systems that are performing batch accounting
based on the session id.
- --no-restore-sid
- (default) This causes the restarted processes to join the session of the
cr_restart process.
Note that use of
--restore-pgid or
--restore-sid will produce an
error in the case that the required identifiers are in use in the system. This
includes the possibility that they conflict the the process group or session
of cr_restart.
OPTIONS¶
General options:¶
- -?, --help
- print this help message.
- -v, --version
- print version information.
- -q, --quiet
- suppress error/warning messages to stderr.
Options for source location of the checkpoint:¶
- -d, --dir DIR
- checkpoint read from directory DIR, with one 'context.ID' file per process
(unimplemented).
- -f, --file FILE
- checkpoint read from FILE.
- -F, --fd FD
- checkpoint read from an open file descriptor.
- Options in this group are mutually exclusive. If no option is given from
this group, the default is to take the final argument as FILE.
Options for signal sent to process(es) after restart:¶
- --run
- no signal sent: continue execution (default).
- -S, --signal NUM
- signal NUM sent to all processes/threads.
- --stop
- SIGSTOP sent to all processes.
- --term
- SIGTERM sent to all processes.
- --abort
- SIGABRT sent to all processes.
- --kill
- SIGKILL sent to all processes.
- --cont
- SIGCONT sent to all processes.
- Options in this group are mutually exclusive. If more than one is given
then only the last will be honored.
Options for checkpoints of restarted process(es):¶
- --omit-maybe
- use a heuristic to omit cr_restart from checkpoints (default)
- --omit-always
- always omit cr_restart from checkpoints
- --omit-never
- never omit cr_restart from checkpoints
Options for alternate error handling:¶
- --run-on-success='cmd'
- run the given command on success
- --run-on-fail-args='cmd'
- run the given command invalid arguments
- --run-on-fail-temp='cmd'
- run the given command on 'temporary' failure
- --run-on-fail-env='cmd'
- run the given command on 'environmental' failure
- --run-on-fail-perm='cmd'
- run the given command on 'permanent' failure
- --run-on-failure='cmd'
- run the given command on any failure
Options for relocation:¶
- --relocate OLDPATH=NEWPATH
- map paths of files and directories to new locations by prefix
replacement.
Options for restoring pid, process group and session ids
- --restore-pid
- restore pids to saved values (default).
- --no-restore-pid
- restart with new pids.
- --restore-pgid
- restore pgid to saved values.
- --no-restore-pgid
- restart with new pgids (default).
- --restore-sid
- restore sid to saved values.
- --no-restore-sid
- restart with new sids (default).
- Options in each restore/no-restore pair are mutually exclusive. If both
are given then only the last will be honored.
Options for kernel log messages (default is --kmsg-error):¶
- --kmsg-none
- don't report any kernel messages.
- --kmsg-error
- on restart failure, report on stderr any kernel messages associated with
the restart request.
- --kmsg-warning
- report on stderr any kernel messages associated with the restart request,
regardless of success or failure. Messages generated in the absence of
failure are considered to be warnings.
- Options in this group are mutually exclusive. If more than one is given
then only the last will be honored. Note that --quiet suppresses
all stderr output, including these messages.
AUTHORS¶
Jason Duell, Paul Hargrove, and Eric Roman, Lawrence Berkeley National
Laboratory.
REPORTING BUGS¶
Bug reports may be filed on the web at
http://mantis.lbl.gov/bugzilla.
SEE ALSO¶
cr_run(1),
cr_checkpoint(1),