.TH "work_queue_factory" 1 "" "CCTools 8.0.0 DEVELOPMENT" "Cooperative Computing Tools" .SH NAME .LP \fBwork_queue_factory\fP - maintain a pool of Work Queue workers on a batch system. .SH SYNOPSIS .LP \FC\fBwork_queue_factory -M -T [options]\fP\FT .SH DESCRIPTION .LP \fBwork_queue_factory\fP submits and maintains a number of \fBwork_queue_worker(1)\fP processes on various batch systems, such as Condor and SGE. All the workers managed by a \fBwork_queue_factory\fP process will be directed to work for a specific manager, or any set of managers matching a given project name. \fBwork_queue_factory\fP will automatically determine the correct number of workers to have running, based on criteria set on the command line. The decision on how many workers to run is reconsidered once per minute. .PP By default, \fBwork_queue_factory\fP will run as many workers as the indicated managers have tasks ready to run. If there are multiple managers, then enough workers will be started to satisfy their collective needs. For example, if there are two managers with the same project name, each with 10 tasks to run, then \fBwork_queue_factory\fP will start a total of 20 workers. .PP If the number of needed workers increases, \fBwork_queue_factory\fP will submit more workers to meet the desired need. However, it will not run more than a fixed maximum number of workers, given by the -W option. .PP If the need for workers drops, \fBwork_queue_factory\fP does not remove them immediately, but waits to them to exit on their own. (This happens when the worker has been idle for a certain time.) A minimum number of workers will be maintained, given by the -w option. .PP If given the -c option, then \fBwork_queue_factory\fP will consider the capacity reported by each manager. The capacity is the estimated number of workers that the manager thinks it can handle, based on the task execution and data transfer times currently observed at the manager. With the -c option on, \fBwork_queue_factory\fP will consider the manager's capacity to be the maximum number of workers to run. .PP If \fBwork_queue_factory\fP receives a terminating signal, it will attempt to remove all running workers before exiting. .SH OPTIONS .LP General options: .LP .TP .B \ -T . Batch system type (required). One of: local, wq, condor, sge, pbs, lsf, torque, moab, mpi, slurm, chirp, amazon, amazon-batch, lambda, mesos, k8s, dryrun .TP .B \ -C . Use configuration file . .TP .B \ -M . Project name of managers to server, can be regex .TP .B \ -F . Foremen to serve, can be a regular expression. .TP .B \ --catalog= . Catalog server to query for managers. .TP .B \ -P . Password file for workers to authenticate. .TP .B \ -S . Use this scratch dir for factory. .TP .B \ (default:) /tmp/wq-factory-$uid . . .TP .B \ --run-factory-as-manager . Force factory to run itself as a manager. .TP .B \ --parent-death . Exit if parent process dies. .TP .B \ -d . Enable debugging for this subsystem. .TP .B \ -o . Send debugging to this file. .TP .B \ -O . Specify the size of the debug file. .TP .B \ -v . Show the version string. .TP .B \ -h . Show this screen. Concurrent control options: .LP .TP .B \ -w . Minimum workers running (default=5). .TP .B \ -W . Maximum workers running (default=100). .TP .B \ --workers-per-cycle . Max number of new workers per 30s (default=5) .TP .B \ -t . Workers abort after idle time (default=300). .TP .B \ --factory-timeout . Exit after no manager seen in seconds. .TP .B \ --tasks-per-worker . Average tasks per worker (default=one per core). .TP .B \ -c . Use worker capacity reported by managers. Resource management options: .LP .TP .B \ --cores= . Set the number of cores requested per worker. .TP .B \ --gpus= . Set the number of GPUs requested per worker. .TP .B \ --memory= . Set the amount of memory (in MB) per worker. .TP .B \ --disk= . Set the amount of disk (in MB) per worker. .TP .B \ --autosize . Autosize worker to slot (Condor, Mesos, K8S). Worker environment options: .LP .TP .B \ --env= . Environment variable to add to worker. .TP .B \ -E . Extra options to give to worker. .TP .B \ --worker-binary= . Alternate binary instead of work_queue_worker. .TP .B \ --wrapper . Wrap factory with this command prefix. .TP .B \ --wrapper-input . Add this input file needed by the wrapper. .TP .B \ --runos= . Use runos tool to create environment (ND only). .TP .B \ --python-package . Run each worker inside this python package. Options specific to batch systems: .LP .TP .B \ -B . Generic batch system options. .TP .B \ --amazon-config . Specify Amazon config file. .TP .B \ --condor-requirements . Set requirements for the workers as Condor jobs. .TP .B \ --mesos-master . Host name of mesos manager node.. .TP .B \ --mesos-path . Path to mesos python library.. .TP .B \ --mesos-preload . Libraries for running mesos. .TP .B \ --k8s-image . Container image for Kubernetes. .TP .B \ --k8s-worker-image . Container image with worker for Kubernetes. .SH EXIT STATUS .LP On success, returns zero. On failure, returns non-zero. .SH EXAMPLES .LP Suppose you have a Work Queue manager with a project name of "barney". To maintain workers for barney, do this: .fam C .nf .nh .IP "" 8 work_queue_factory -T condor -M barney .fi .hy .fam .P To maintain a maximum of 100 workers on an SGE batch system, do this: .fam C .nf .nh .IP "" 8 work_queue_factory -T sge -M barney -W 100 .fi .hy .fam .P To start workers such that the workers exit after 5 minutes (300s) of idleness: .fam C .nf .nh .IP "" 8 work_queue_factory -T condor -M barney -t 300 .fi .hy .fam .P If you want to start workers that match any project that begins with barney, use a regular expression: .fam C .nf .nh .IP "" 8 work_queue_factory -T condor -M barney.\* -t 300 .fi .hy .fam .P If running on condor, you may manually specify condor requirements: .fam C .nf .nh .IP "" 8 work_queue_factory -T condor -M barney --condor-requirements 'MachineGroup == "disc"' --condor-requirements 'has_matlab == true' .fi .hy .fam .P Repeated uses of \FCcondor-requirements\FT are and-ed together. The previous example will produce a statement equivalent to: \FCrequirements = ((MachineGroup == "disc") && (has_matlab == true))\FT Use the configuration file \fBmy_conf\fP: .fam C .nf .nh .IP "" 8 work_queue_factory -Cmy_conf .fi .hy .fam .P \fBmy_conf\fP should be a proper JSON document, as: .fam C .nf .nh .IP "" 8 { "manager-name": "my_manager.*", "max-workers": 100, "min-workers": 0 } .fi .hy .fam .P Valid configuration fields are: .fam C .nf .nh .IP "" 8 manager-name foremen-name min-workers max-workers workers-per-cycle task-per-worker timeout worker-extra-options condor-requirements cores memory disk .fi .hy .fam .P .SH KNOWN BUGS .LP The capacity measurement currently assumes single-core tasks running on single-core workers, and behaves unexpectedly with multi-core tasks or multi-core workers. .SH COPYRIGHT .LP The Cooperative Computing Tools are Copyright (C) 2005-2019 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details. .SH SEE ALSO .LP .IP \(bu 4 \fBCooperative Computing Tools Documentation\fP .IP \(bu 4 \fBWork Queue User Manual\fP .IP \(bu 4 \fBwork_queue_worker(1)\fP \fBwork_queue_status(1)\fP \fBwork_queue_factory(1)\fP \fBcondor_submit_workers(1)\fP \fBsge_submit_workers(1)\fP \fBtorque_submit_workers(1)\fP