NAME¶
strigger - Used set, get or clear Slurm trigger information.
SYNOPSIS¶
strigger --set [
OPTIONS...]
strigger --get [
OPTIONS...]
strigger --clear [
OPTIONS...]
DESCRIPTION¶
strigger is used to set, get or clear Slurm trigger information. Triggers
include events such as a node failing, a job reaching its time limit or a job
terminating. These events can cause actions such as the execution of an
arbitrary script. Typical uses include notifying system administrators of node
failures and gracefully terminating a job when it's time limit is approaching.
A hostlist expression for the nodelist or job ID is passed as an argument to
the program.
Trigger events are not processed instantly, but a check is performed for trigger
events on a periodic basis (currently every 15 seconds). Any trigger events
which occur within that interval will be compared against the trigger programs
set at the end of the time interval. The trigger program will be executed once
for any event occuring in that interval. The record of those events (e.g.
nodes which went DOWN in the previous 15 seconds) will then be cleared. The
trigger program must set a new trigger before the end of the next interval to
insure that no trigger events are missed. If desired, multiple trigger
programs can be set for the same event.
IMPORTANT NOTE: This command can only set triggers if run by the user
SlurmUser unless
SlurmUser is configured as user root. This is
required for the
slurmctld daemon to set the appropriate user and group
IDs for the executed program. Also note that the program is executed on the
same node that the
slurmctld daemon uses rather than some allocated
compute node. To check the value of
SlurmUser, run the command:
scontrol show config | grep SlurmUser
ARGUMENTS¶
- -a, --primary_slurmctld_failure
- Trigger an event when the primary slurmctld fails.
- -A,
--primary_slurmctld_resumed_operation
- Trigger an event when the primary slurmctld resuming
operation after failure.
- -b, --primary_slurmctld_resumed_control
- Trigger an event when primary slurmctld resumes control.
- --block_err
- Trigger an event when a BlueGene block enters an ERROR
state.
- -B, --backup_slurmctld_failure
- Trigger an event when the backup slurmctld fails.
- -c, --backup_slurmctld_resumed_operation
- Trigger an event when the backup slurmctld resumes
operation after failure.
- -C, --backup_slurmctld_assumed_control
- Trigger event when backup slurmctld assumes control.
- --clear
- Clear or delete a previously defined event trigger. The
--id, --jobid or --userid option must be specified to
identify the trigger(s) to be cleared.
- -d, --down
- Trigger an event if the specified node goes into a DOWN
state.
- -D, --drained
- Trigger an event if the specified node goes into a DRAINED
state.
- -e, --primary_slurmctld_acct_buffer_full
- Trigger an event when primary slurmctld accounting buffer
is full.
- -F, --fail
- Trigger an event if the specified node goes into a FAILING
state.
- -f, --fini
- Trigger an event when the specified job completes
execution.
- --front_end
- Trigger events based upon changes in state of front end
nodes rather than compute nodes. Applies to BlueGene and Cray
architectures only, where the slurmd daemon executes on front end nodes
rather than the compute nodes. Use this option with either the --up
or --down option.
- -g, --primary_slurmdbd_failure
- Trigger an event when the primary slurmdbd fails.
- -G, --primary_slurmdbd_resumed_operation
- Trigger an event when the primary slurmdbd resumes
operation after failure.
- --get
- Show registered event triggers. Options can be used for
filtering purposes.
- -h, --primary_database_failure
- Trigger an event when the primary database fails.
- -H, --primary_database_resumed_operation
- Trigger an event when the primary database resumes
operation after failure.
- -i, --id=id
- Trigger ID number.
- -I, --idle
- Trigger an event if the specified node remains in an IDLE
state for at least the time period specified by the --offset
option. This can be useful to hibernate a node that remains idle, thus
reducing power consumption.
- -j, --jobid=id
- Job ID of interest. NOTE: The --jobid option
can not be used in conjunction with the --node option. When the
--jobid option is used in conjunction with the --up or
--down option, all nodes allocated to that job will considered the
nodes used as a trigger event.
- -n, --node[=host]
- Host name(s) of interest. By default, all nodes associated
with the job (if --jobid is specified) or on the system are
considered for event triggers. NOTE: The --node option can
not be used in conjunction with the --jobid option. When the
--jobid option is used in conjunction with the --up,
--down or --drained option, all nodes allocated to that job
will considered the nodes used as a trigger event.
- -M, --clusters=<string>
- Clusters to issue commands to.
- -o, --offset=seconds
- The specified action should follow the event by this time
interval. Specify a negative value if action should preceded the event.
The default value is zero if no --offset option is specified. The
resolution of this time is about 20 seconds, so to execute a script not
less than five minutes prior to a job reaching its time limit, specify
--offset=320 (5 minutes plus 20 seconds).
- -p, --program=path
- Execute the program at the specified fully qualified
pathname when the event occurs. The program will be executed as the user
who sets the trigger. If the program fails to terminate within 5 minutes,
it will be killed along with any spawned processes.
- -Q, --quiet
- Do not report non-fatal errors. This can be useful to clear
triggers which may have already been purged.
- -r, --reconfig
- Trigger an event when the system configuration changes.
- --set
- Register an event trigger based upon the supplied options.
NOTE: An event is only triggered once. A new event trigger must be set
established for future events of the same type to be processed.
- -t, --time
- Trigger an event when the specified job's time limit is
reached. This must be used in conjunction with the --jobid option.
- -u, --up
- Trigger an event if the specified node is returned to
service from a DOWN state.
- --user=user_name_or_id
- Clear or get triggers associated with the specified user.
Specify either a user name or user ID.
- -v, --verbose
- Print detailed event logging. This includes time-stamps on
data structures, record counts, etc.
- -V , --version
- Print version information and exit.
OUTPUT FIELD DESCRIPTIONS¶
- TRIG_ID
- Trigger ID number.
- RES_TYPE
- Resource type: job or node
- RES_ID
- Resource ID: job ID or host names or "*" for any
host
- TYPE
- Trigger type: time or fini (for jobs only),
down or up (for jobs or nodes), or drained,
idle or reconfig (for nodes only)
- OFFSET
- Time offset in seconds. Negative numbers indicated the
action should occur before the event (if possible)
- USER
- Name of the user requesting the action
- PROGRAM
- Pathname of the program to execute when the event occurs
EXAMPLES¶
Execute the program "/usr/sbin/primary_slurmctld_failure" whenever the
primary slurmctld fails.
> cat /usr/sbin/primary_slurmctld_failure
#!/bin/bash
# Submit trigger for next primary slurmctld failure event
strigger --set --primary_slurmctld_failure \
--program=/usr/sbin/primary_slurmctld_failure
# Notify the administrator of the failure using by e-mail
/usr/bin/mail slurm_admin@site.com -s Primary_SLURMCTLD_FAILURE
> strigger --set --primary_slurmctld_failure \
--program=/usr/sbin/primary_slurmctld_failure
Execute the program "/usr/sbin/slurm_admin_notify" whenever any node
in the cluster goes down. The subject line will include the node names which
have entered the down state (passed as an argument to the script by SLURM).
> cat /usr/sbin/slurm_admin_notify
#!/bin/bash
# Submit trigger for next event
strigger --set --node --down \
--program=/usr/sbin/slurm_admin_notify
# Notify administrator using by e-mail
/usr/bin/mail slurm_admin@site.com -s NodesDown:$*
> strigger --set --node --down \
--program=/usr/sbin/slurm_admin_notify
Execute the program "/usr/sbin/slurm_suspend_node" whenever any node
in the cluster remains in the idle state for at least 600 seconds.
> strigger --set --node --idle --offset=600 \
--program=/usr/sbin/slurm_suspend_node
Execute the program "/home/joe/clean_up" when job 1234 is within 10
minutes of reaching its time limit.
> strigger --set --jobid=1234 --time --offset=-600 \
--program=/home/joe/clean_up
Execute the program "/home/joe/node_died" when any node allocated to
job 1234 enters the DOWN state.
> strigger --set --jobid=1234 --down \
--program=/home/joe/node_died
Show all triggers associated with job 1235.
> strigger --get --jobid=1235
TRIG_ID RES_TYPE RES_ID TYPE OFFSET USER PROGRAM
123 job 1235 time -600 joe /home/bob/clean_up
125 job 1235 down 0 joe /home/bob/node_died
Delete event trigger 125.
> strigger --clear --id=125
Execute /home/joe/job_fini upon completion of job 1237.
> strigger --set --jobid=1237 --fini --program=/home/joe/job_fini
COPYING¶
Copyright (C) 2007 The Regents of the University of California. Copyright (C)
2008-2010 Lawrence Livermore National Security. Produced at Lawrence Livermore
National Laboratory (cf, DISCLAIMER). CODE-OCEC-09-009. All rights reserved.
This file is part of SLURM, a resource management program. For details, see
<
http://www.schedmd.com/slurmdocs/>.
SLURM is free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version.
SLURM is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
SEE ALSO¶
scontrol(1),
sinfo(1),
squeue(1)