NAME¶
checkpoint - Grid Engine checkpointing environment configuration file format
DESCRIPTION¶
Checkpointing is a facility to save the complete status of an executing program
or job and to restore and restart from this so-called checkpoint at a later
point of time if the original program or job was halted, e.g. through a system
crash.
Grid Engine provides various levels of checkpointing support (see The
checkpointing environment described here is a means to configure the different
types of checkpointing in use for your Grid Engine cluster or parts thereof.
For that purpose you can define the operations which have to be executed in
initiating a checkpoint generation, a migration of a checkpoint to another
host, or a restart of a checkpointed application.
Supporting different operating systems may easily force Grid Engine to introduce
operating system dependencies for the configuration of the checkpointing
configuration file and updates of the supported operating system versions may
lead to frequently changing implementation details. Please refer to the
<sge_root>/ckpt directory for more information.
Please use the
-ackpt,
-dckpt,
-mckpt or
-sckpt
options to the command to manipulate checkpointing environments from the
command-line or use the corresponding dialogue for X-Windows based interactive
configuration.
Note, Grid Engine allows backslashes (\) be used to escape newline characters.
The backslash and the newline are replaced with a space character before any
interpretation.
The format of a
checkpoint file is defined as follows:
ckpt_name¶
The name of the checkpointing environment in the format for
ckpt_name in
To be used in the
-ckpt switch or for the options mentioned above.
interface¶
The type of checkpointing to be used. Currently, the following types are valid:
- hibernator
- The Hibernator kernel level checkpointing is interfaced.
- cpr
- The SGI kernel level checkpointing is used.
- transparent
- Grid Engine assumes that the jobs submitted with reference to this
checkpointing interface use a checkpointing library such as provided by
the free package Condor.
- userdefined
- Grid Engine assumes that the jobs submitted with reference to this
checkpointing interface perform their private checkpointing method.
- application-level
- Uses all of the interface commands configured in the checkpointing object
like in the case of one of the kernel level checkpointing interfaces (
cpr, etc.) except for the restart_command (see below), which
is not used (even if it is configured) but the job script is invoked in
case of a restart instead.
ckpt_command¶
A command-line type command string to be executed by Grid Engine in order to
initiate a checkpoint. The following pseudo-variables are available to be
substituted in the value:
- $host
- The name of the host on which the command is executed.
- $ja_task_id
- The array job task index (0 if not an array job).
- $job_owner
- The user name of the job owner.
- $job_id
- Grid Engine's unique job identification number.
- $job_name
- The name of the job.
- $queue
- The cluster queue name of the master queue instance, on which the command
is started.
- $job_pid
- The process id of the job/task to checkpoint.
- $ckpt_dir
- See ckpt_dir below.
- $ckpt_signal
- See signal below.
- $sge_cell
- The SGE_CELL environment variable (useful for locating files).
- $sge_root
- The SGE_ROOT environment variable (useful for locating files).
migr_command¶
A command-line type command string to be executed by Grid Engine during a
migration of a checkpointing job from one host to another. The same
pseudo-variables are available as for
ckpt_command. Note that the
command is expected to create a checkpoint itself - the checkpointing command
isn't called automatically on migration.
restart_command¶
A command-line type command string to be executed by Grid Engine when restarting
a previously checkpointed application. The same pseudo-variables are available
as for
ckpt_command.
clean_command¶
A command-line type command string to be executed by Grid Engine in order to
cleanup after a checkpointed application has finished. The same
pseudo-variables are available as for
ckpt_command.
ckpt_dir¶
A file system location to which checkpoints of potentially considerable size
should be stored.
signal¶
A Unix signal to be sent to a job by Grid Engine to initiate checkpoint
generation. The value for this field can either be a symbolic name from the
list produced by the
-l option of the command or an integer number
which must be a valid signal on the systems used for checkpointing.
when¶
The points of time when checkpoints are expected to be generated. Valid values
for this parameter are composed from the letters
s,
m,
x,
r, and any combinations thereof without any separating character in
between. The same letters are allowed for the
-c option of the command
which will overwrite the definitions in the checkpointing environment used.
The meaning of the letters is as follows:
- s
- A job is checkpointed, aborted and, if possible, migrated if the
corresponding is shut down on the job's host. This operation is handled by
the specified migr_command.
- m
- checkpoints are generated periodically at the min_cpu_interval
interval defined by the queue (see in which a job executes.
- x
- A job is checkpointed, aborted and, if possible, migrated as soon as the
job gets suspended (manually as well as automatically). This operation is
handled by the specified migr_command.
- r
- A job will be rescheduled (not checkpointed) when the host on which the
job currently runs goes into the "unknown" state and the time
interval reschedule_unknown (see defined in the global/local
cluster configuration is exceeded.
ENVIRONMENT VARIABLES¶
SGE_BINDING and
SGE_CKPT_DIR may be specified on job submission.
See
RESTRICTIONS¶
Note that the functionality of any checkpointing, migration or restart
procedures provided by default with the Grid Engine distribution, as well as
the way how they are invoked in the
ckpt_command,
migr_command
or
restart_command parameters of any default checkpointing
environments, should not be changed; otherwise the functionality remains the
full responsibility of the administrator configuring the checkpointing
environment. Grid Engine will just invoke these procedures and evaluate their
exit status. If the procedures do not perform their tasks properly, or are not
invoked in a proper fashion, the checkpointing mechanism may behave
unexpectedly; Grid Engine has no means to detect this - all exit codes are
treated as successful operation except for the case of kernel checkpointing.
See also the restrictions in
SEE ALSO¶
COPYRIGHT¶
See for a full statement of rights and permissions.