|work_queue_factory(1)||Cooperative Computing Tools||work_queue_factory(1)|
work_queue_factory - maintain a pool of Work Queue workers on a batch system.
work_queue_factory -M <project-name> -T <batch-type> [options]
work_queue_factory submits and maintains a number of work_queue_worker(1) processes on various batch systems, such as Condor and SGE. All the workers managed by a work_queue_factory process will be directed to work for a specific master, or any set of masters matching a given project name. work_queue_factory will automatically determine the correct number of workers to have running, based on criteria set on the command line. The decision on how many workers to run is reconsidered once per minute.
By default, work_queue_factory will run as many workers as the indicated masters have tasks ready to run. If there are multiple masters, then enough workers will be started to satisfy their collective needs. For example, if there are two masters with the same project name, each with 10 tasks to run, then work_queue_factory will start a total of 20 workers.
If the number of needed workers increases, work_queue_factory will submit more workers to meet the desired need. However, it will not run more than a fixed maximum number of workers, given by the -W option.
If the need for workers drops, work_queue_factory does not remove them immediately, but waits to them to exit on their own. (This happens when the worker has been idle for a certain time.) A minimum number of workers will be maintained, given by the -w option.
If given the -c option, then work_queue_factory will consider the capacity reported by each master. The capacity is the estimated number of workers that the master thinks it can handle, based on the task execution and data transfer times currently observed at the master. With the -c option on, work_queue_factory will consider the master's capacity to be the maximum number of workers to run.
If work_queue_factory receives a terminating signal, it will attempt to remove all running workers before exiting.
- -M, --master-name=<project>
- Project name of masters to serve, can be a regular expression.
- -F, --foremen-name=<project>
- Foremen to serve, can be a regular expression.
- --catalog <catalog>
- Catalog server to query for masters (default: catalog.cse.nd.edu,backup-catalog.cse.nd.edu:9097).
- -T, --batch-type=<type>
- Batch system type (required). One of: local, wq, condor, sge, torque, mesos, k8s, moab, slurm, chirp, amazon, lambda, dryrun, amazon-batch
- -B, --batch-options=<options>
- Add these options to all batch submit files.
- -P, --password=<file>
- Password file for workers to authenticate to master.
- -C, --config-file=<file>
- Use the configuration file <file>.
- -w, --min-workers=<workers>
- Minimum workers running. (default=5)
- -W, --max-workers=<workers>
- Maximum workers running. (default=100)
- --workers-per-cycle <workers>
- Maximum number of new workers per 30 seconds. ( less than 1 disables limit, default=5)
- --tasks-per-worker <workers>
- Average tasks per worker (default=one task per core).
- -t, --timeout=<time>
- Workers abort after this amount of idle time (default=300).
- --env <variable=value>
- Environment variable that should be added to the worker (May be specified multiple times).
- -E, --extra-options=<options>
- Extra options that should be added to the worker.
- --cores <n>
- Set the number of cores requested per worker.
- --gpus <n>
- Set the number of GPUs requested per worker.
- --memory <mb>
- Set the amount of memory (in MB) requested per worker.
- --disk <mb>
- Set the amount of disk (in MB) requested per worker.
- Automatically size a worker to an available slot (Condor, Mesos, and Kubernetes).
- Set requirements for the workers as Condor jobs. May be specified several times with expresions and-ed together (Condor only).
- --factory-timeout <n>
- Exit after no master has been seen in <n> seconds.
- -S, --scratch-dir=<file>
- Use this scratch dir for temporary files (default is /tmp/wq-pool-$uid).
- -c, --capacity
- Use worker capacity reported by masters.
- -d, --debug=<subsystem>
- Enable debugging for this subsystem.
- Specify Amazon config file (for use with -T amazon).
- Wrap factory with this command prefix.
- Add this input file needed by the wrapper.
- --mesos-master <hostname>
- Specify the host name to mesos master node (for use with -T mesos).
- --mesos-path <filepath>
- Specify path to mesos python library (for use with -T mesos).
- --mesos-preload <library>
- Specify the linking libraries for running mesos (for use with -T mesos).
- Specify the container image for using Kubernetes (for use with -T k8s).
- Specify the container image that contains work_queue_worker availabe for using Kubernetes (for use with -T k8s).
- -o, --debug-file=<file>
- Send debugging to this file (can also be :stderr, :stdout, :syslog, or :journal).
- -O, --debug-file-size=<mb>
- Specify the size of the debug file (must use with -o option).
- --worker-binary <file>
- Specify the binary to use for the worker (relative or hard path). It should accept the same arguments as the default work_queue_worker.
- --runos <img>
- Will make a best attempt to ensure the worker will execute in the specified OS environment, regardless of the underlying OS.
- Force factory to run itself as a work queue master.
- -v, --version
- Show the version string.
- -h, --help
- Show this screen.
On success, returns zero. On failure, returns non-zero.
Suppose you have a Work Queue master with a project name of "barney". To maintain workers for barney, do this:
work_queue_factory -T condor -M barney
To maintain a maximum of 100 workers on an SGE batch system, do this:
work_queue_factory -T sge -M barney -W 100
To start workers such that the workers exit after 5 minutes (300s) of idleness:
work_queue_factory -T condor -M barney -t 300
If you want to start workers that match any project that begins with barney, use a regular expression:
work_queue_factory -T condor -M barney.-t 300
If running on condor, you may manually specify condor requirements:
work_queue_factory -T condor -M barney --condor-requirements 'MachineGroup == "disc"' --condor-requirements 'has_matlab == true'
Repeated uses of condor-requirements are and-ed together. The previous example will produce a statement equivalent to:
requirements = ((MachineGroup == "disc") && (has_matlab == true))
Use the configuration file my_conf:
my_conf should be a proper JSON document, as:
"min-workers": 0 }
Valid configuration fields are:
master-name foremen-name min-workers max-workers workers-per-cycle task-per-worker timeout worker-extra-options condor-requirements cores memory disk
The capacity measurement currently assumes single-core tasks running on single-core workers, and behaves unexpectedly with multi-core tasks or multi-core workers.
The Cooperative Computing Tools are Copyright (C) 2005-2019 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
|CCTools 7.1.2 FINAL|