other versions
- wheezy 1.2.5-1
- wheezy-backports 2.3.1-6~bpo70+1
- jessie 2.3.1-6
- unstable 2.3.1-6
DMTCP(1) | Distributed MultiThreaded CheckPointing | DMTCP(1) |
NAME¶
dmtcp - Distributed MultiThreaded CheckpointingSYNOPSIS¶
dmtcp_coordinator [port]DESCRIPTION¶
DMTCP is a tool to transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program nor the operating system. MTCP is a standalone component of DMTCP available as a checkpointing library for a single process.OPTIONS¶
For each command, the --help or -h flag will show the command-line options. Most command line options can also be controlled through environment variables. These can be set in bash with "export NAME=value" or in tcsh with "setenv NAME value".- DMTCP_CHECKPOINT_INTERVAL=integer
- Time in seconds between automatic checkpoints. Checkpoints
can also be initiated manually by typing 'c' into the coordinator.
(default: 0, disabled; dmtcp_coordinator only)
- DMTCP_HOST=string
- Hostname where the cluster-wide coordinator is running.
(default: localhost; dmtcp_checkpoint, dmtcp_restart only)
- DMTCP_PORT=integer
- The port the cluster-wide coordinator listens on. (default:
7779)
- DMTCP_GZIP=(1|0)
- Set to "0" to disable compression of checkpoint
images. (default: 1, compression enabled; dmtcp_checkpoint only) WARNING:
gzip adds seconds. Without gzip, ckpt/restart is often less than 1 s
- DMTCP_CHECKPOINT_DIR=path
- Directory to store checkpoint images in. (default: ./)
- DMTCP_SIGCKPT=integer
- Internal signal number to use for checkpointing. Must not be used by the user program. (default: SIGUSR2; dmtcp_checkpoint only)
DMTCP_COORDINATOR¶
Each computation to be checkpointed must include a DMTCP coordinator process. One can explicitly start a coordinator through dmtcp_coordinator, or allow one to be started implicitly in background by either dmtcp_checkpoint or dmtcp_restart to operate. The address of the unique coordinator should be specified by dmtcp_checkpoint, dmtcp_restart, and dmtcp_command either through the --host and --port command-line flags or through the the DMTCP_HOST and DMTCP_PORT environment variables. If neither is given, the host-port pair defaults to localhost-7779. The host-port pair associated with a particular coordinator is given by the command-line flags used in the dmtcp_coordinator command, or the environment variables then in effect, or the default of localhost-7779.l : List connected nodes
s : Print status message
c : Checkpoint all nodes
f : Force a restart even if there are missing nodes (debugging)
k : Kill all nodes
q : Kill all nodes and quit
? : Show this message
DMTCP_INSPECTOR¶
dmtcp_inspector is a tool for offline checkpoint analysis. It provides information about socket connections and parent-child relations between processes of a distributed program. The output is in graphviz package format and can be rendered using graphviz tools like dot, neato, twopi, circo, fdp and sfdp. Command line options are following:- -o, --out <file>
- Write output to <file>
- -t, --tool
- Graphviz tool to use. By default no graphviz and output is in dot-like format
- -c, --cred
- Add information about parent-child relations on the graph
- -d, --no-sock
- Remove information about socket connections
- -a, --sock-all
- Add verbose information about socket connections
- -z, --sock-half
- Also represent half-connections (when some *.dmtcp files are missed
- -n, --node
- Verbose node names indication
- -h, --help
- Display this help
EXAMPLE USAGE¶
- 1. In a separate terminal window, start the dmtcp_coodinator.
- (See previous section.)
dmtcp_coordinator
- 2. In separate terminal(s), replace each command(s) with "dmtcp_checkpoint
- [command]". The checkpointed program will connect to
the coordinator specified by DMTCP_HOST and DMTCP_PORT. New threads will
be checkpointed as part of the process. Child processes will automatically
be checkpointed. Remote processes started via ssh will
automatically checkpointed. (Internally, DMTCP modifies the ssh
command line to call dmtcp_checkpoint on the remote host.)
dmtcp_checkpoint ./myprogram
- 3. To manually initiate a checkpoint, either run the command below
- or type "c" followed by <return> into the
coordinator. Checkpoint files for each process will be written to
DMTCP_CHECKPOINT_DIR. The dmtcp_coordinator will write
"dmtcp_restart_script.sh" to its working directory. This script
contains the necessary calls to dmtcp_restart to restart the entire
computation, including remote processes created via ssh.
dmtcp_command -c
- 4. To restart, one should execute dmtcp_restart_script.sh, which is
- created by the dmtcp_coordinator in its working directory
at the time of checkpoint. One can optionally edit this script to migrate
processes to different hosts. By default, only one restarted process will
be restarted in the foreground and receive the standard input. The script
may be edited to choose which process will be restarted in the foreground.
./dmtcp_restart_script.sh
DMTCPAWARE API¶
DMTCP provides a programming interface to allow checkpointed applications to interact with dmtcp. In the source distribution, see dmtcpaware/dmtcpaware.h for the functions available. See test/dmtcpaware[123].c for three example applications. For an example of its usage, try:cd test; rm dmtcpaware1; make dmtcpaware1; ./autotest -v dmtcpaware1
DMTCP PLUGIN MODULES¶
The source distribution includes a top-level plugin directory, with examples of how to write a plugin module for DMTCP. Further examples are in the test/plugin directory. The plugin feature adds three new user-programmable capabilities. A plugin may: add wrappers around system calls; take special actions at during certain events (e.g. pre-checkpoint, resume/post-checkpoint, restart); and may insert key-value pairs into a database at restart time that is then available to be queried by the restarted processes of a computation. (The events available to the plugin feature form a superset of the events available with the dmtcpaware interface.) One or more plugins are invoked via a list of colon-separated absolute pathnames.dmtcp_checkpoint --with-plugin PLUGIN1[:PLUGIN2]...
RETURN CODE¶
A target program under DMTCP control normally returns the same return code as if executed without DMTCP. However, if DMTCP fails (as opposed to the target program failing), DMTCP returns a DMTCP-specific return code, rc (or rc+1, rc+2 for two special cases), where rc is the integer value of the environment variable DMTCP_FAIL_RC if set, or else the default value, 99.SEE ALSO¶
Full documentation is available from http://dmtcp.sourceforge.net/AUTHORS¶
DMTCP and its standalone single-process compontent MTCP (MultiThreaded CheckPointing) were created and are maintained by Jason Ansel, Kapil Arya, Gene Cooperman, Artem Y. Polyakov, Mike Rieker, Ana-Maria Visan, and a series of newer contributors including Alex Brick, Tyler Denniston, Rohan Garg, Gregory Kerr, and others.June 17, 2008 | Jason Ansel |