'\" t .\" Title: pegasus-cluster .\" Author: [see the "Authors" section] .\" Generator: DocBook XSL Stylesheets v1.79.1 .\" Date: 11/09/2018 .\" Manual: Pegasus Manual .\" Source: Pegasus 4.4.0 .\" Language: English .\" .TH "PEGASUS\-CLUSTER" "1" "11/09/2018" "Pegasus 4\&.4\&.0" "Pegasus Manual" .\" ----------------------------------------------------------------- .\" * Define some portability stuff .\" ----------------------------------------------------------------- .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .\" http://bugs.debian.org/507673 .\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" ----------------------------------------------------------------- .\" * set default formatting .\" ----------------------------------------------------------------- .\" disable hyphenation .nh .\" disable justification (adjust text to left margin only) .ad l .\" ----------------------------------------------------------------- .\" * MAIN CONTENT STARTS HERE * .\" ----------------------------------------------------------------- .SH "NAME" pegasus-cluster \- run a list of applications .SH "SYNOPSIS" .sp .nf \fBpegasus\-cluster\fR [\fB\-d\fR] [\fB\-e\fR | \fB\-f\fR] [\fB\-S ec\fR] [\fB\-s fn\fR] [\fB\-R fn\fR] [\fB\-n nr\fR] [\fBinputfile\fR] .fi .SH "DESCRIPTION" .sp The \fBpegasus\-cluster\fR tool executes a list of application in the order specified (assuming sequential mode\&.) It is generally used to do horizontal clustering of independent application, and does not care about any application failures\&. Such failures should be caught by using \fBpegasus\-kickstart\fR to start application\&. .sp In vertical clustering mode, the \fIhard failure\fR mode is encouraged, ending execution as soon as one application fails\&. When running a complex workflow through \fBpegasus\-cluster\fR , the order of applications in the input file must be topologically sorted\&. .sp Applications are usually using \fBpegasus\-kickstart\fR to execute\&. In the \fBpegasus\-kickstart\fR case, all invocations of \fBpegasus\-kickstart\fR except the first should add the \fBpegasus\-kickstart\fR option \fI\-H\fR to supress repeating the XML preamble and certain other headers of no interest when repeated\&. .sp \fBpegasus\-cluster\fR permits shell\-style quoting\&. One level of quoting is removed from the arguments\&. Please note that \fBpegasus\-kickstart\fR will also remove one level of quoting\&. .SH "ARGUMENTS" .PP \fB\-d\fR .RS 4 This option increases the debug level\&. Debug message are generated on \fIstdout\fR \&. By default, debugging is minimal\&. .RE .PP \fB\-e\fR .RS 4 This flag turns on the old behavior of \fBpegasus\-cluster\fR to always run everything \fIand\fR return success no matter what\&. The \fB\-e\fR flag is mutually exclusive with the \fB\-f\fR flag\&. By default, all applications are executed regardles of failures\&. Any detected application failure results in a non\-zero exit status from \fBpegasus\-cluster\fR\&. .RE .PP \fB\-f\fR .RS 4 In hard failure mode, as soon as one application fails, either through a non\-zero exit code, or by dying on a signal, further execution is stopped\&. In parallel execution mode, one or more other applications later in the sequence file may have been started already by the time failure is detected\&. \fBPegasus\-cluster\fR will wait for the completion of these applications, but not start new ones\&. The \fB\-f\fR flag is mutually exclusive with the \fB\-e\fR flag\&. By default, all applications are executed regardless of failures\&. Any detected application failure results in a non\-zero exit status from \fBpegasus\-cluster\fR\&. .RE .PP \fB\-h\fR .RS 4 This option prints the help message and exits the program\&. .RE .PP \fB\-s fn\fR .RS 4 This option will send protocol message (for Mei) to the specified file\&. By default, all message are written to \fIstdout\fR \&. .RE .PP \fB\-R fn\fR .RS 4 The progress reporting feature, if turned on, will write one event record whenever an application is started, and one event record whenever an application finished\&. This is to enable tracking of jobs in progress\&. By default, track logs are not written, unless the environment variable \fISEQEXEC_PROGRESS_REPORT\fR is set\&. If set, progress reports are appended to the file pointed to by the environment variable\&. .RE .PP \fB\-S ec\fR .RS 4 This option is a multi\-option, which may be used multiple times\&. For each given non\-zero exit\-code of an application, mark it as a form of success\&. In \fB\-f\fR mode, this means that \fBpegasus\-cluster\fR will not fail when seeing this exit code from any application it runs\&. By default, all non\-zero exit code constitute failure\&. .RE .PP \fB\-n nr\fR .RS 4 This option determines the amount of parallel execution\&. Typically, parallel execution is only recommended on multi\-core systems, and must be deployed rather carefully, i\&.e\&. only completely independent jobs across of whole \fIinputfile\fR should ever be attempted to be run in parallel\&. The argument \fBnr\fR is the number of parallel jobs that should be used\&. In addition to a non\-negative integer, the word \fIauto\fR is also understood\&. When \fIauto\fR is specified, \fBpegasus\-cluster\fR will attempt to automatically determine the number of cores available in the system\&. Strictly sequential execution, as if \fInr\fR was 1, is the default\&. If the environment variable \fISEQEXEC_CPUS\fR is set, it will determine the default number of CPUs\&. .RE .PP \fBinputfile\fR .RS 4 The input file specifies a list of application to run, one per line\&. Comments and empty lines are permitted\&. The comment character is the octothorpe (#), and extends to the end of line\&. By default, \fBpegasus\-cluster\fR uses \fIstdin\fR to read the list of applications to execute\&. .RE .SH "RETURN VALUE" .sp The \fBpegasus\-cluster\fR tool returns 1, if an illegal option was used\&. It returns 2, if the status file from option \fB\-s\fR cannot be opened\&. It returns 3, if the input file cannot be opened\&. It does \fInot\fR return any failure for failed applications in old\-exit \fB\-e\fR mode\&. In \fIdefault\fR and hard failure \fB\-f\fR mode, it will return 5 for true failure\&. The determination of failure is modified by the \fB\-S\fR option\&. .sp All other internal errors being absent, \fBpegasus\-cluster\fR will always return 0 when run without \fB\-f\fR \&. Unlike shell, it will \fInot\fR return the last application\(cqs exit code\&. In \fIdefault\fR mode, it will return 5, if any application failed\&. Unlike shell, it will \fInot\fR return the last application\(cqs exit code\&. However, it will execute all applications\&. The determination of failure is modified by the \fB\-S\fR flag\&. In \fB\-f\fR mode, *pegasus\-cluster returns either 0 if all main sequence applications succeeded, or 5 if one failed; or more than one in parallel execution mode\&. It will run only as long as applications were successful\&. As before, the *\-S flag determines what constitutes a failure\&. .sp The \fBpegasus\-cluster\fR application will also create a small summary on \fIstdout\fR for each job, and one for itself, about the success and failure\&. The field \fBfailed\fR reports any exit code that was not zero or a signal of death termination\&. It does \fInot\fR include non\-zero exit codes that were marked as success using the \fB\-S\fR option\&. .SH "TASK SUMMARY" .sp Each task executed by \fBpegasus\-cluster\fR generates a record bracketed by square brackets like this (each entry is broken over two lines for readability): .sp .if n \{\ .RS 4 .\} .nf [cluster\-task id=1, start="2011\-04\-27T14:31:25\&.340\-07:00", duration=0\&.521, status=0, line=1, pid=18543, app="/bin/usleep"] [cluster\-task id=2, start="2011\-04\-27T14:31:25\&.342\-07:00", duration=0\&.619, status=0, line=2, pid=18544, app="/bin/usleep"] [cluster\-task id=3, start="2011\-04\-27T14:31:25\&.862\-07:00", duration=0\&.619, status=0, line=3, pid=18549, app="/bin/usleep"] .fi .if n \{\ .RE .\} .sp Each record is introduced by the string \fIcluster\-task\fR with the following constituents, where strings are quoted: .PP \fBid\fR .RS 4 This is a numerical value for main sequence application, indicating the application\(cqs place in the sequence file\&. The setup task uses the string \fIsetup\fR , and the cleanup task uses the string \fIcleanup\fR \&. .RE .PP \fBstart\fR .RS 4 is the ISO 8601 time stamp, with millisecond resolution, when the application was started\&. This string is quoted\&. .RE .PP \fBduration\fR .RS 4 is the application wall\-time duration in seconds, with millisecond resolution\&. .RE .PP \fBstatus\fR .RS 4 is the \fIraw\fR exit status as returned by the \fIwait\fR family of system calls\&. Typically, the exit code is found in the high byte, and the signal of death in the low byte\&. Typically, 0 indicates a successful execution, and any other value a problem\&. However, details could differ between systems, and exit codes are only meaningful on the same os and architecture\&. .RE .PP \fBline\fR .RS 4 is the line number where the task was found in the main sequence file\&. Setup\- and cleanup tasks don\(cqt have this attribute\&. .RE .PP \fBpid\fR .RS 4 is the process id under which the application had run\&. .RE .PP \fBapp\fR .RS 4 is the path to the application that was started\&. As with the progress record, any \fBpegasus\-kickstart\fR will be parsed out so that you see the true application\&. .RE .SH "PEGASUS\-CLUSTER SUMMARY" .sp The final summary of counts is a record bracketed by square brackets like this (broken over two lines for readability): .sp .if n \{\ .RS 4 .\} .nf [cluster\-summary stat="ok", lines=3, tasks=3, succeeded=3, failed=0, extra=0, duration=1\&.143, start="2011\-04\-27T14:31:25\&.338\-07:00", pid=18542, app="\&./seqexec"] .fi .if n \{\ .RE .\} .sp The record is introduced by the string \fIcluster\-summary\fR with the following constituents: .PP \fBstat\fR .RS 4 The string \fIfail\fR when \fBpegasus\-cluster\fR would return with an exit status of 5\&. Concretely, this is any failure in \fIdefault\fR mode, and first failure in \fB\-f\fR mode\&. Otherwise, it will always be the string \fIok\fR , if the record is produced\&. .RE .PP \fBlines\fR .RS 4 is the stopping line number of the input sequence file, indicating how far processing got\&. Up to the number of cores additional lines may have been parsed in case of \fB\-f\fR mode\&. .RE .PP \fBtasks\fR .RS 4 is the number of tasks processed\&. .RE .PP \fBsucceeded\fR .RS 4 is the number of main sequence jobs that succeeded\&. .RE .PP \fBfailed\fR .RS 4 is the number of main sequence jobs that failed\&. The failure condition depends on the \fB\-S\fR settings, too\&. .RE .PP \fBextra\fR .RS 4 is 0, 1 or 2, depending on the existence of setup\- and cleanup jobs\&. .RE .PP \fBduration\fR .RS 4 is the duration in seconds, with millisecond resolution, how long *pegasus\-cluster ran\&. .RE .PP \fBstart\fR .RS 4 is the start time of \fBpegasus\-cluster\fR as ISO 8601 time stamp\&. .RE .SH "SEE ALSO" .sp \fBpegasus\-kickstart(1)\fR .SH "CAVEATS" .sp The \fB\-S\fR option sets success codes globally\&. It is not possible to activate success codes only for one specific application, and doing so would break the shell compatibility\&. Due to the global nature, use success codes sparingly as last resort emergency handler\&. In better plannable environments, you should use an application wrapper instead\&. .SH "EXAMPLE" .sp The following shows an example input file to \fBpegasus\-cluster\fR making use of \fBpegasus\-kickstart\fR to track applications\&. .sp .if n \{\ .RS 4 .\} .nf # # mkdir /path/to/pegasus\-kickstart \-R HPC \-n mkdir /bin/mkdir \-m 2755 \-p split\-corpus split\-ne\-corpus # # drop\-dian /path/to/pegasus\-kickstart \-H \-R HPC \-n drop\-dian \-o \*(Aq^f\-new\&.plain\*(Aq /path/to/drop\-dian /path/to/f\-tok\&.plain /path/to/f\-tok\&.NE # # split\-corpus /path/to/pegasus\-kickstart \-H \-R HPC \-n split\-corpus /path/to/split\-seq\-new\&.pl 23 f\-new\&.plain split\-corpus/corpus\&. # # split\-corpus /path/to/pegasus\-kickstart \-H \-R HPC \-n split\-corpus /path/to/split\-seq\-new\&.pl 23 /path/to/f\-tok\&.NE split\-ne\-corpus/corpus\&. .fi .if n \{\ .RE .\} .SH "ENVIRONMENT VARIABLES" .sp A number of environment variables permits to influence the behavior of \fBpegasus\-cluster\fR during run\-time\&. .PP \fBSEQEXEC_PROGRESS_REPORT\fR .RS 4 If this variable is set, and points to a writable file location, progress report records are appended to the file\&. While care is taken to atomically append records to the log file, in case concurrent instances of \fBpegasus\-cluster\fR are running, broken Linux NFS may still garble some content\&. .RE .PP \fBSEQEXEC_CPUS\fR .RS 4 If this variable is set to a non\-negative integer, that many CPUs are attempted to be used\&. The special value \fIauto\fR permits to auto\-detect the number of CPUs available to \fBpegasus\-cluster\fR on the system\&. .RE .PP \fBSEQEXEC_SETUP\fR .RS 4 If this variable is set, and contains a single fully\-qualified path to an executable and arguments, this executable will be run before any jobs are started\&. The exit code of this setup job will have no effect upon the main job sequence\&. Success or failure will not be counted towards the summary\&. .RE .PP \fBSEQEXEC_CLEANUP\fR .RS 4 If this variable is set, and contains a single fully\-qualified path to an executable and arguments, this executable will be before \fBpegasus\-cluster\fR quits\&. Failure of any previous job will have no effect on the ability to run this job\&. The exit code of the cleanup job will have no effect on the overall success or failure state\&. Success or failure will not be counted towards the summary\&. .RE .SH "HISTORY" .sp As you may have noticed, \fBpegasus\-cluster\fR had the name \fBseqexec\fR in previous incantations\&. We are slowly moving to the new name to avoid clashes in a larger OS installation setting\&. However, there is no pertinent need to change the internal name, too, as no name clashes are expected\&. .SH "AUTHORS" .sp Jens\-S\&. Vöckler .sp Pegasus \fBhttp://pegasus\&.isi\&.edu/\fR