NAME¶
sge_ckpt - the Grid Engine checkpointing mechanism and checkpointing support
DESCRIPTION¶
Grid Engine supports two levels of checkpointing: the user level and an
operating system-provided transparent level. User level checkpointing refers
to applications which do their own checkpointing by writing restart files at
certain times or algorithmic steps and by properly processing these restart
files when restarted.
Transparent checkpointing has to be provided by the operating system and is
usually integrated in the operating system kernel. An example for a kernel
integrated checkpointing facility is the Hibernator package from Softway for
SGI IRIX platforms.
Checkpointing jobs need to be identified to the Grid Engine system by using the
-ckpt option of the command. The argument to this flag refers to a so
called checkpointing environment, which defines the attributes of the
checkpointing method to be used (see for details). Checkpointing environments
are setup by the options
-ackpt,
-dckpt,
-mckpt and
-sckpt. The option
-c can be used to overwrite the
when
attribute for the referenced checkpointing environment.
As opposed to the behavior for regular batch jobs, checkpointing jobs (see the
-ckpt option to are aborted under conditions for which batch or
interactive jobs are suspended or even stay unaffected. These conditions are:
- •
- Explicit suspension of the queue or job via by the cluster administration
or a queue owner if the x occasion specifier (see -c and was
assigned to the job.
- •
- A load average value exceeding the suspend threshold as configured for the
corresponding queues (see
- •
- Shutdown of the Grid Engine execution daemon being responsible for the
checkpointing job.
After they are aborted, jobs will migrate to other hosts, and possibly other
cluster queues, unless they were submitted to a specific one by an explicit
user request. The migration of jobs leads to a dynamic load balancing.
Note: Aborting checkpointed jobs will free all resources (memory, swap
space) which the job occupies at that time. This is opposed to the situation
for suspended regular jobs, which still use virtual memory and other
consumable resources.
RESTRICTIONS¶
When a job migrates to another machine, at present no files are transferred
automatically to that machine. This means that all files which are used
throughout the entire job, including restart files, executables, and scratch
files, must be visible or transferred explicitly (e.g. at the beginning of the
job script).
There are also some practical limitations regarding use of disk space for
transparently checkpointing jobs. Checkpoints of a transparently checkpointed
application are usually stored in a checkpoint file or directory by the
operating system. The file or directory contains all the text, data, and stack
space for the process, along with some additional control information. This
means jobs which use a very large virtual address space will generate very
large checkpoint files. Also the workstations on which the jobs will actually
execute may have little free disk space. Thus it is not always possible to
transfer a transparent checkpointing job to a machine, even though that
machine is idle. Since large virtual memory jobs must wait for a machine that
is both idle, and has a sufficient amount of free disk space, such jobs may
suffer long turnaround times.
There is currently no mechanism for restarting jobs with the same resources they
were granted originally. That might be important if they were submitted with a
choice or range of resources and start running in a particular way with what
they're given.
Similarly, with heterogeneous execution hosts, jobs may need to restart on a
host which supports a superset of the instruction set where the job originally
ran if the checkpoint mechanism (e.g. BLCR or DMTCP) dumps an image of the
running process. Runtime libraries, in particular, may initialize themselves
depending on details of the architecture they start up on - say to use a
specific type of vector unit. Then, they may fail if moved to an older host of
similar architecture which lacks that feature, even if they were compiled for
a common instruction set.
SEE ALSO¶
COPYRIGHT¶
See for a full statement of rights and permissions.