NAME¶
checkpoint - Sun Grid Engine checkpointing environment configuration file format
DESCRIPTION¶
Checkpointing is a facility to save the complete status of an executing program
or job and to restore and restart from this so called checkpoint at a later
point of time if the original program or job was halted, e.g. through a system
crash.
Sun Grid Engine provides various levels of checkpointing support (see The
checkpointing environment described here is a means to configure the different
types of checkpointing in use for your Sun Grid Engine cluster or parts
thereof. For that purpose you can define the operations which have to be
executed in initiating a checkpoint generation, a migration of a checkpoint to
another host or a restart of a checkpointed application as well as the list of
queues which are eligible for a checkpointing method.
Supporting different operating systems may easily force Sun Grid Engine to
introduce operating system dependencies for the configuration of the
checkpointing configuration file and updates of the supported operating system
versions may lead to frequently changing implementation details. Please refer
to the <sge_root>/ckpt directory for more information.
Please use the
-ackpt,
-dckpt,
-mckpt or
-sckpt
options to the command to manipulate checkpointing environments from the
command-line or use the corresponding dialogue for X-Windows based interactive
configuration.
Note, Sun Grid Engine allows backslashes (\) be used to escape newline
(\newline) characters. The backslash and the newline are replaced with a space
(" ") character before any interpretation.
The format of a
checkpoint file is defined as follows:
ckpt_name¶
The name of the checkpointing environment as defined for
ckpt_name in To
be used in the
-ckpt switch or for the options mentioned above.
interface¶
The type of checkpointing to be used. Currently, the following types are valid:
- hibernator
- The Hibernator kernel level checkpointing is
interfaced.
- cpr
- The SGI kernel level checkpointing is used.
- cray-ckpt
- The Cray kernel level checkpointing is assumed.
- transparent
- Sun Grid Engine assumes that the jobs submitted with
reference to this checkpointing interface use a checkpointing library such
as provided by the public domain package Condor.
- userdefined
- Sun Grid Engine assumes that the jobs submitted with
reference to this checkpointing interface perform their private
checkpointing method.
- application-level
- Uses all of the interface commands configured in the
checkpointing object like in the case of one of the kernel level
checkpointing interfaces ( cpr, cray-ckpt, etc.) except for
the restart_command (see below), which is not used (even if it is
configured) but the job script is invoked in case of a restart
instead.
ckpt_command¶
A command-line type command string to be executed by Sun Grid Engine in order to
initiate a checkpoint.
migr_command¶
A command-line type command string to be executed by Sun Grid Engine during a
migration of a checkpointing job from one host to another.
restart_command¶
A command-line type command string to be executed by Sun Grid Engine when
restarting a previously checkpointed application.
clean_command¶
A command-line type command string to be executed by Sun Grid Engine in order to
cleanup after a checkpointed application has finished.
ckpt_dir¶
A file system location to which checkpoints of potentially considerable size
should be stored.
ckpt_signal¶
A Unix signal to be sent to a job by Sun Grid Engine to initiate a checkpoint
generation. The value for this field can either be a symbolic name from the
list produced by the
-l option of the command or an integer number
which must be a valid signal on the systems used for checkpointing.
when¶
The points of time when checkpoints are expected to be generated. Valid values
for this parameter are composed by the letters
s,
m,
x
and
r and any combinations thereof without any separating character in
between. The same letters are allowed for the
-c option of the command
which will overwrite the definitions in the used checkpointing environment.
The meaning of the letters is defined as follows:
- s
- A job is checkpointed, aborted and if possible migrated if
the corresponding is shut down on the job's machine.
- m
- Checkpoints are generated periodically at the
min_cpu_interval interval defined by the queue (see in which a job
executes.
- x
- A job is checkpointed, aborted and if possible migrated as
soon as the job gets suspended (manually as well as automatically).
- r
- A job will be rescheduled (not checkpointed) when the host
on which the job currently runs went into unknown state and the time
interval reschedule_unknown (see defined in the global/local
cluster configuration will be exceeded.
RESTRICTIONS¶
Note, that the functionality of any checkpointing, migration or restart
procedures provided by default with the Sun Grid Engine distribution as well
as the way how they are invoked in the
ckpt_command,
migr_command or
restart_command parameters of any default
checkpointing environments should not be changed or otherwise the
functionality remains the full responsibility of the administrator configuring
the checkpointing environment. Sun Grid Engine will just invoke these
procedures and evaluate their exit status. If the procedures do not perform
their tasks properly or are not invoked in a proper fashion, the checkpointing
mechanism may behave unexpectedly, Sun Grid Engine has no means to detect
this.
SEE ALSO¶
COPYRIGHT¶
See for a full statement of rights and permissions.