NAME¶
sched_conf - Sun Grid Engine default scheduler configuration file
DESCRIPTION¶
sched_conf defines the configuration file format for Sun Grid Engine's
scheduler. In order to modify the configuration, use the graphical user's
interface or the
-msconf option of the command. A default configuration
is provided together with the Sun Grid Engine distribution package.
Note, Sun Grid Engine allows backslashes (\) be used to escape newline
(\newline) characters. The backslash and the newline are replaced with a space
(" ") character before any interpretation.
The following parameters are recognized by the Sun Grid Engine scheduler if
present in
sched_conf:
algorithm¶
Note: Deprecated, may be removed in future release.
Allows for the selection of alternative scheduling algorithms.
Currently
default is the only allowed setting.
A simple algebraic expression used to derive a single weighted load value from
all or part of the load parameters reported by for each host and from all or
part of the consumable resources (see being maintained for each host. The load
formula expression syntax is that of a summation weighted load values, that
is:
{w1|load_val1[*w1]}[{+|-}{w2|load_val2[*w2]}[{+|-}...]]
Note, no blanks are allowed in the load formula.
The load values and consumable resources (load_val1, ...) are specified by the
name defined in the complex (see
Note: Administrator defined load values (see the
load_sensor
parameter in for details) and consumable resources available for all hosts
(see may be used as well as Sun Grid Engine default load parameters.
The weighting factors (w1, ...) are positive integers. After the expression is
evaluated for each host the results are assigned to the hosts and are used to
sort the hosts corresponding to the weighted load. The sorted host list is
used to sort queues subsequently.
The default load formula is "np_load_avg".
job_load_adjustments¶
The load, which is imposed by the Sun Grid Engine jobs running on a system
varies in time, and often, e.g. for the CPU load, requires some amount of time
to be reported in the appropriate quantity by the operating system.
Consequently, if a job was started very recently, the reported load may not
provide a sufficient representation of the load which is already imposed on
that host by the job. The reported load will adapt to the real load over time,
but the period of time, in which the reported load is too low, may already
lead to an oversubscription of that host. Sun Grid Engine allows the
administrator to specify
job_load_adjustments which are used in the Sun
Grid Engine scheduler to compensate for this problem.
The
job_load_adjustments are specified as a comma separated list of
arbitrary load parameters or consumable resources and (separated by an equal
sign) an associated load correction value. Whenever a job is dispatched to a
host by the scheduler, the load parameter and consumable value set of that
host is increased by the values provided in the
job_load_adjustments
list. These correction values are decayed linearly over time until after
load_adjustment_decay_time from the start the corrections reach the
value 0. If the
job_load_adjustments list is assigned the special
denominator NONE, no load corrections are performed.
The adjusted load and consumable values are used to compute the combined and
weighted load of the hosts with the
load_formula (see above) and to
compare the load and consumable values against the load threshold lists
defined in the queue configurations (see If the
load_formula consists
simply of the default CPU load average parameter
np_load_avg, and if
the jobs are very compute intensive, one might want to set the
job_load_adjustments list to
np_load_avg=1.00, which means that
every new job dispatched to a host will require 100 % CPU time, and thus the
machine's load is instantly increased by 1.00.
load_adjustment_decay_time¶
The load corrections in the "
job_load_adjustments" list above
are decayed linearly over time from the point of the job start, where the
corresponding load or consumable parameter is raised by the full correction
value, until after a time period of "
load_adjustment_decay_time", where the correction becomes 0.
Proper values for "
load_adjustment_decay_time" greatly
depend upon the load or consumable parameters used and the specific operating
system(s). Therefore, they can only be determined on-site and experimentally.
For the default
np_load_avg load parameter a "
load_adjustment_decay_time" of 7 minutes has proven to yield
reasonable results.
maxujobs¶
The maximum number of jobs any user may have running in a Sun Grid Engine
cluster at the same time. If set to 0 (default) the users may run an arbitrary
number of jobs.
schedule_interval¶
At the time the scheduler thread initially registers at the event master thread
in
schedule_interval is used to set the time interval in which the
event master thread sends scheduling event updates to the scheduler thread. A
scheduling event is a status change that has occurred within which may trigger
or affect scheduler decisions (e.g. a job has finished and thus the allocated
resources are available again).
In the Sun Grid Engine default scheduler the arrival of a scheduling event
report triggers a scheduler run. The scheduler waits for event reports
otherwise.
Schedule_interval is a time value (see for a definition of the syntax of
time values).
queue_sort_method¶
This parameter determines in which order several criteria are taken into account
to product a sorted queue list. Currently, two settings are valid:
seqno and
load. However in both cases, Sun Grid Engine attempts
to maximize the number of soft requests (see
-s option) being fulfilled
by the queues for a particular as the primary criterion.
Then, if the
queue_sort_method parameter is set to
seqno, Sun Grid
Engine will use the
seq_no parameter as configured in the current queue
configurations (see as the next criterion to sort the queue list. The
load_formula (see above) has only a meaning if two queues have equal
sequence numbers. If
queue_sort_method is set to
load the load
according the
load_formula is the criterion after maximizing a job's
soft requests and the sequence number is only used if two hosts have the same
load. The sequence number sorting is most useful if you want to define a fixed
order in which queues are to be filled (e.g. the cheapest resource first).
The default for this parameter is
load.
halftime¶
When executing under a share based policy, the scheduler "ages" (i.e.
decreases) usage to implement a sliding window for achieving the share
entitlements as defined by the share tree. The
halftime defines the
time interval in which accumulated usage will have been decayed to half its
original value. Valid values are specified in hours or according to the time
format as specified in
If the value is set to 0, the usage is not decayed.
usage_weight_list¶
Sun Grid Engine accounts for the consumption of the resources CPU-time, memory
and IO to determine the usage which is imposed on a system by a job. A single
usage value is computed from these three input parameters by multiplying the
individual values by weights and adding them up. The weights are defined in
the
usage_weight_list. The format of the list is
where wcpu, wmem and wio are the configurable weights. The weights are real
number. The sum of all tree weights should be 1.
compensation_factor¶
Determines how fast Sun Grid Engine should compensate for past usage below of
above the share entitlement defined in the share tree. Recommended values are
between 2 and 10, where 10 means faster compensation.
weight_user¶
The relative importance of the user shares in the functional policy. Values are
of type real.
weight_project¶
The relative importance of the project shares in the functional policy. Values
are of type real.
weight_department¶
The relative importance of the department shares in the functional policy.
Values are of type real.
weight_job¶
The relative importance of the job shares in the functional policy. Values are
of type real.
weight_tickets_functional¶
The maximum number of functional tickets available for distribution by Sun Grid
Engine. Determines the relative importance of the functional policy. See under
for an overview on job priorities.
weight_tickets_share¶
The maximum number of share based tickets available for distribution by Sun Grid
Engine. Determines the relative importance of the share tree policy. See under
for an overview on job priorities.
weight_deadline¶
The weight applied on the remaining time until a jobs latest start time.
Determines the relative importance of the deadline. See under for an overview
on job priorities.
weight_waiting_time¶
The weight applied on the jobs waiting time since submission. Determines the
relative importance of the waiting time. See under for an overview on job
priorities.
weight_urgency¶
The weight applied on jobs normalized urgency when determining priority finally
used. Determines the relative importance of urgency. See under for an overview
on job priorities.
weight_priority¶
The weight applied on jobs normalized POSIX priority when determining priority
finally used. Determines the relative importance of POSIX priority. See under
for an overview on job priorities.
weight_ticket¶
The weight applied on normalized ticket amount when determining priority finally
used. Determines the relative importance of the ticket policies. See under for
an overview on job priorities.
flush_finish_sec¶
The parameters are provided for tuning the system's scheduling behavior. By
default, a scheduler run is triggered in the scheduler interval. When this
parameter is set to 1 or larger, the scheduler will be triggered x seconds
after a job has finished. Setting this parameter to 0 disables the flush after
a job has finished.
flush_submit_sec¶
The parameters are provided for tuning the system's scheduling behavior. By
default, a scheduler run is triggered in the scheduler interval. When this
parameter is set to 1 or larger, the scheduler will be triggered x seconds
after a job was submitted to the system. Setting this parameter to 0 disables
the flush after a job was submitted.
schedd_job_info¶
The default scheduler can keep track why jobs could not be scheduled during the
last scheduler run. This parameter enables or disables the observation. The
value
true enables the monitoring
false turns it off.
It is also possible to activate the observation only for certain jobs. This will
be done if the parameter is set to
job_list followed by a comma
separated list of job ids.
The user can obtain the collected information with the command qstat -j.
params¶
This is foreseen for passing additional parameters to the Sun Grid Engine
scheduler. The following values are recognized:
- DURATION_OFFSET
- If set, overrides the default of value 60 seconds. This
parameter is used by the Sun Grid Engine scheduler when planning resource
utilization as the delta between net job runtimes and total time until
resources become available again. Net job runtime as specified with -l
h_rt=... or -l s_rt=... or default_duration always differs from
total job runtime due to delays before and after actual job start and
finish. Among the delays before job start is the time until the end of a
schedule_interval, the time it takes to deliver a job to and the
delays caused by prolog in , start_proc_args in and
starter_method in . The delays after job finish include delays due
to a forced job termination ( notify, terminate_method or
checkpointing), procedures run after actual job finish, such as
stop_proc_args in or epilog in , and the delay until a new
schedule_interval.
If the offset is too low, resource reservations (see max_reservation)
can be delayed repeatedly due to an overly optimistic job circulation
time.
- JC_FILTER
- Note: Deprecated, may be removed in future release.
If set to true, the scheduler limits the number of jobs it looks at during a
scheduling run. At the beginning of the scheduling run it assigns each job
a specific category, which is based on the job's requests, priority
settings, and the job owner. All scheduling policies will assign the same
importance to each job in one category. Therefore the number of jobs per
category have a FIFO order and can be limited to the number of free slots
in the system.
A exception are jobs, which request a resource reservation. They are
included regardless of the number of jobs in a category.
This setting is turned off per default, because in very rare cases, the
scheduler can make a wrong decision. It is also advised to turn
report_pjob_tickets off. Otherwise qstat -ext can report outdated ticket
amounts. The information shown with a qstat -j for a job, that was
excluded in a scheduling run, is very limited.
- PROFILE
- If set equal to 1, the scheduler logs profiling information
summarizing each scheduling run.
- MONITOR
- If set equal to 1, the scheduler records information for
each scheduling run allowing to reproduce job resources utilization in the
file <sge_root>/<cell>/common/schedule.
- PE_RANGE_ALG
- This parameter sets the algorithm for the pe range
computation. The default is automatic, which means that the scheduler will
select the best one, and it should not be necessary to change it to a
different setting in normal operation. If a custom setting is needed, the
following values are available:
auto : the scheduler selects the best algorithm
least : starts the resource matching with the lowest slot amount first
bin : starts the resource matching in the middle of the pe slot range
highest : starts the resource matching with the highest slot amount
first
Changing
params will take immediate effect. The default for
params
is none.
reprioritize_interval¶
Interval (HH:MM:SS) to reprioritize jobs on the execution hosts based on the
current ticket amount for the running jobs. If the interval is set to 00:00:00
the reprioritization is turned off. The default value is 00:00:00. The
reprioritization tickets are calculated by the scheduler and update events for
running jobs are only sent after the scheduler calculated new values. How
often the schedule should calculate the tickets is defined by the
reprioritize_interval. Because the scheduler is only triggered in a specific
interval (scheduler_interval) this means the reprioritize_interval has only a
meaning if set greater than the scheduler_interval. For example, if the
scheduler_interval is 2 minutes and reprioritize_interval is set to 10
seconds, this means the jobs get re-prioritized every 2 minutes.
report_pjob_tickets¶
This parameter allows to tune the system's scheduling run time. It is used to
enable / disable the reporting of pending job tickets to the qmaster. It does
not influence the tickets calculation. The sort order of jobs in qstat and
qmon is only based on the submit time, when the reporting is turned off.
The reporting should be turned off in a system with a very large amount of jobs
by setting this parameter to "false".
halflife_decay_list¶
The halflife_decay_list allows to configure different decay rates for the
"finished_jobs usage types, which is used in the pending job ticket
calculation to account for jobs which have just ended. This allows the user
the pending jobs algorithm to count finished jobs against a user or project
for a configurable decayed time period. This feature is turned off by default,
and the halftime is used instead.
The halflife_decay_list also allows one to configure different decay rates for
each usage type being tracked (cpu, io, and mem). The list is specified in the
following format:
<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>[:<USAGE_TYPE>=<TIME>]]
<Usage_TYPE> can be one of the following: cpu, io, or mem.
<TIME> can be -1, 0 or a timespan specified in minutes. If <TIME> is
-1, only the usage of currently running jobs is used. 0 means that the usage
is not decayed.
policy_hierarchy¶
This parameter sets up a dependency chain of ticket based policies. Each ticket
based policy in the dependency chain is influenced by the previous policies
and influences the following policies. A typical scenario is to assign
precedence for the override policy over the share-based policy. The override
policy determines in such a case how share-based tickets are assigned among
jobs of the same user or project. Note that all policies contribute to the
ticket amount assigned to a particular job regardless of the policy hierarchy
definition. Yet the tickets calculated in each of the policies can be
different depending on "
POLICY_HIERARCHY".
The "
POLICY_HIERARCHY" parameter can be a up to 3 letter
combination of the first letters of the 3 ticket based policies S(hare-based),
F(unctional) and O(verride). So a value "OFS" means that the
override policy takes precedence over the functional policy, which finally
influences the share-based policy. Less than 3 letters mean that some of the
policies do not influence other policies and also are not influenced by other
policies. So a value of "FS" means that the functional policy
influences the share-based policy and that there is no interference with the
other policies.
The special value "NONE" switches off policy hierarchies.
share_override_tickets¶
If set to "true" or "1", override tickets of any override
object instance are shared equally among all running jobs associated with the
object. The pending jobs will get as many override tickets, as they would
have, when they were running. If set to "false" or "0",
each job gets the full value of the override tickets associated with the
object. The default value is "true".
share_functional_shares¶
If set to "true" or "1", functional shares of any functional
object instance are shared among all the jobs associated with the object. If
set to "false" or "0", each job associated with a
functional object, gets the full functional shares of that object. The default
value is "true".
max_functional_jobs_to_schedule¶
The maximum number of pending jobs to schedule in the functional policy. The
default value is 200.
max_pending_tasks_per_job¶
The maximum number of subtasks per pending array job to schedule. This parameter
exists in order to reduce scheduling overhead. The default value is 50.
max_reservation¶
The maximum number of reservations scheduled within a schedule interval. When a
runnable job can not be started due to a shortage of resources a reservation
can be scheduled instead. A reservation can cover consumable resources with
the global host, any execution host and any queue. For parallel jobs
reservations are done also for slots resource as specified in As job runtime
the maximum of the time specified with -l h_rt=... or -l s_rt=... is assumed.
For jobs that have neither of them the default_duration is assumed.
Reservations prevent jobs of lower priority as specified in from utilizing the
reserved resource quota during the time of reservation. Jobs of lower priority
are allowed to utilize those reserved resources only if their prospective job
end is before the start of the reservation (backfilling). Reservation is done
only for non-immediate jobs (-now no) that request reservation (-R y). If
max_reservation is set to "0" no job reservation is done.
Note, that reservation scheduling can be performance consuming and hence
reservation scheduling is switched off by default. Since reservation
scheduling performance consumption is known to grow with the number of pending
jobs, the use of -R y option is recommended only for those jobs actually
queuing for bottleneck resources. Together with the max_reservation parameter
this technique can be used to narrow down performance impacts.
default_duration¶
When job reservation is enabled through max_reservation parameter the default
duration is assumed as runtime for jobs that have neither -l h_rt=... nor -l
s_rt=... specified. In contrast to a h_rt/s_rt time limit the default_duration
is not enforced.
FILES¶
<sge_root>/<cell>/common/sched_configuration
scheduler thread configuration
SEE ALSO¶
Sun Grid Engine Installation and Administration Guide
COPYRIGHT¶
See for a full statement of rights and permissions.