table of contents
- NAME
- 1. Overview
- 1.1 Problem
- 1.2. Design Requirements
- 1.3. Hardware Considerations and Requirements
- 1.3.1. Concurrent, Synchronous, Read/Write Access
- 1.3.2. Bargain-basement JBODs need not apply
- 1.3.3. Fencing is Required
- 1.4. Limitations
- 2. Algorithms
- 2.1. Heartbeating & Liveliness Determination
- 2.2. Scoring & Heuristics
- 2.3. Master Election
- 2.4. Master Duties
- 2.5. How it All Ties Together
- 3. Configuration
- 3.1. The <quorumd> tag
- 3.3.1. Quorum Disk Timings
- 3.2. The <heuristic> tag
- 3.3. Examples
- 3.3.1. 3 cluster nodes & 3 routers
- 3.3.2. 2 cluster nodes & 1 IP tiebreaker
- 3.4. Heuristic score considerations
- 3.5. Creating a quorum disk partition
- SEE ALSO
QDisk(5) | Cluster Quorum Disk | QDisk(5) |
NAME¶
qdisk - a disk-based quorum daemon for CMAN / Linux-Cluster1. Overview¶
1.1 Problem¶
In some situations, it may be necessary or desirable to sustain a majority node failure of a cluster without introducing the need for asymmetric cluster configurations (e.g. client-server, or heavily-weighted voting nodes).1.2. Design Requirements¶
* Ability to sustain 1..(n-1)/n simultaneous node failures, without the danger of a simple network partition causing a split brain. That is, we need to be able to ensure that the majority failure case is not merely the result of a network partition.1.3. Hardware Considerations and Requirements¶
1.3.1. Concurrent, Synchronous, Read/Write Access¶
This quorum daemon requires a shared block device with concurrent read/write access from all nodes in the cluster. The shared block device can be a multi-port SCSI RAID array, a Fiber-Channel RAID SAN, a RAIDed iSCSI target, or even GNBD. The quorum daemon uses O_DIRECT to write to the device.1.3.2. Bargain-basement JBODs need not apply¶
There is a minimum performance requirement inherent when using disk-based cluster quorum algorithms, so design your cluster accordingly. Using a cheap JBOD with old SCSI2 disks on a multi-initiator bus will cause problems at the first load spike. Plan your loads accordingly; a node's inability to write to the quorum disk in a timely manner will cause the cluster to evict the node. Using host-RAID or multi-initiator parallel SCSI configurations with the qdisk daemon is unlikely to work, and will probably cause administrators a lot of frustration. That having been said, because the timeouts are configurable, most hardware should work if the timeouts are set high enough.1.3.3. Fencing is Required¶
In order to maintain data integrity under all failure scenarios, use of this quorum daemon requires adequate fencing, preferably power-based fencing. Watchdog timers and software-based solutions to reboot the node internally, while possibly sufficient, are not considered 'fencing' for the purposes of using the quorum disk.1.4. Limitations¶
* At this time, this daemon supports a maximum of 16 nodes. This is primarily a scalability issue: As we increase the node count, we increase the amount of synchronous I/O contention on the shared quorum disk.2. Algorithms¶
2.1. Heartbeating & Liveliness Determination¶
Nodes update individual status blocks on the quorum disk at a user- defined rate. Each write of a status block alters the timestamp, which is what other nodes use to decide whether a node has hung or not. If, after a user-defined number of 'misses' (that is, failure to update a timestamp), a node is declared offline. After a certain number of 'hits' (changed timestamp + "i am alive" state), the node is declared online.- Timestamp
2.2. Scoring & Heuristics¶
The administrator can configure up to 10 purely arbitrary heuristics, and must exercise caution in doing so. At least one administrator- defined heuristic is required for operation, but it is generally a good idea to have more than one heuristic. By default, only nodes scoring over 1/2 of the total maximum score will claim they are available via the quorum disk, and a node (master or otherwise) whose score drops too low will remove itself (usually, by rebooting).2.3. Master Election¶
Only one master is present at any one time in the cluster, regardless of how many partitions exist within the cluster itself. The master is elected by a simple voting scheme in which the lowest node which believes it is capable of running (i.e. scores high enough) bids for master status. If the other nodes agree, it becomes the master. This algorithm is run whenever no master is present.2.4. Master Duties¶
The master node decides who is or is not in the master partition, as well as handles eviction of dead nodes (both via the quorum disk and via the linux-cluster fencing system by using the cman_kill_node() API).2.5. How it All Ties Together¶
When a master is present, and if the master believes a node to be online, that node will advertise to CMAN that the quorum disk is available. The master will only grant a node membership if:(a) CMAN believes the node to be online, and
to the quorum disk, and
(c) the node has a high enough score to consider itself online.
3. Configuration¶
3.1. The <quorumd> tag¶
This tag is a child of the top-level <cluster> tag.<quorumd
interval="1"
This is the frequency of read/write cycles, in seconds.
tko="10"
This is the number of cycles a node must miss in order to be declared dead. The default for this number is dependent on the configured token timeout.
tko_up="X"
This is the number of cycles a node must be seen in order to be declared online. Default is floor(tko/3).
upgrade_wait="2"
This is the number of cycles a node must wait before initiating a bid for master status after heuristic scoring becomes sufficient. The default is 2. This can not be set to 0, and should not exceed tko.
master_wait="X"
This is the number of cycles a node must wait for votes before declaring itself master after making a bid. Default is floor(tko/2). This can not be less than 2, must be greater than tko_up, and should not exceed tko.
votes="3"
This is the number of votes the quorum daemon advertises to CMAN when it has a high enough score. The default is the number of nodes in the cluster minus 1. For example, in a 4 node cluster, the default is 3. This value may change during normal operation, for example when adding or removing a node from the cluster.
log_level="4"
This controls the verbosity of the quorum daemon in the system logs. 0 = emergencies; 7 = debug. This option is deprecated.
log_facility="daemon"
This controls the syslog facility used by the quorum daemon when logging. For a complete list of available facilities, see syslog.conf(5). The default value for this is 'daemon'. This option is deprecated.
status_file="/foo"
Write internal states out to this file periodically ("-" = use stdout). This is primarily used for debugging. The default value for this attribute is undefined. This option can be changed while qdiskd is running.
min_score="3"
Absolute minimum score to be consider one's self "alive". If omitted, or set to 0, the default function "floor((n+1)/2)" is used, where n is the total of all of defined heuristics' score attribute. This must never exceed the sum of the heuristic scores, or else the quorum disk will never be available.
reboot="1"
If set to 0 (off), qdiskd will *not* reboot after a negative transition as a result in a change in score (see section 2.2). The default for this value is 1 (on). This option can be changed while qdiskd is running.
master_wins="0"
If set to 1 (on), only the qdiskd master will advertise its votes to CMAN. In a network partition, only the qdisk master will provide votes to CMAN. Consequently, that node will automatically "win" in a fence race.
allow_kill="1"
If set to 0 (off), qdiskd will *not* instruct to kill nodes it thinks are dead (as a result of not writing to the quorum disk). The default for this value is 1 (on). This option can be changed while qdiskd is running.
paranoid="0"
If set to 1 (on), qdiskd will watch internal timers and reboot the node if it takes more than (interval * tko) seconds to complete a quorum disk pass. The default for this value is 0 (off). This option can be changed while qdiskd is running.
io_timeout="0"
If set to 1 (on), qdiskd will watch internal timers and reboot the node if qdisk is not able to write to disk after (interval * tko) seconds. The default for this value is 0 (off). If io_timeout is active max_error_cycles is overridden and set to off.
scheduler="rr"
Valid values are 'rr', 'fifo', and 'other'. Selects the scheduling queue in the Linux kernel for operation of the main & score threads (does not affect the heuristics; they are always run in the 'other' queue). Default is 'rr'. See sched_setscheduler(2) for more details.
priority="1"
Valid values for 'rr' and 'fifo' are 1..100 inclusive. Valid values for 'other' are -20..20 inclusive. Sets the priority of the main & score threads. The default value is 1 (in the RR and FIFO queues, higher numbers denote higher priority; in OTHER, lower values denote higher priority). This option can be changed while qdiskd is running.
stop_cman="0"
Ordinarily, cluster membership is left up to CMAN, not qdisk. If this parameter is set to 1 (on), qdiskd will tell CMAN to leave the cluster if it is unable to initialize the quorum disk during startup. This can be used to prevent cluster participation by a node which has been disconnected from the SAN. The default for this value is 0 (off). This option can be changed while qdiskd is running.
use_uptime="1"
If this parameter is set to 1 (on), qdiskd will use values from /proc/uptime for internal timings. This is a bit less precise than gettimeofday(2), but the benefit is that changing the system clock will not affect qdiskd's behavior - even if paranoid is enabled. If set to 0, qdiskd will use gettimeofday(2), which is more precise. The default for this value is 1 (on / use uptime).
device="/dev/sda1"
This is the device the quorum daemon will use. This device must be the same on all nodes.
label="mylabel"
This overrides the device field if present. If specified, the quorum daemon will read /proc/partitions and check for qdisk signatures on every block device found, comparing the label against the specified label. This is useful in configurations where the block device name differs on a per-node basis.
cman_label="mylabel"
This overrides the label advertised to CMAN if present. If specified, the quorum daemon will register with this name instead of the actual device name.
max_error_cycles="0"/>
If we receive an I/O error during a cycle, we do not poll CMAN and tell it we are alive. If specified, this value will cause qdiskd to exit after the specified number of consecutive cycles during which I/O errors occur. The default is 0 (no maximum). This option can be changed while qdiskd is running. This option is ignored if io_timeout is set to 1.
/>
3.3.1. Quorum Disk Timings¶
Qdiskd should not be used in environments requiring failure detection times of less than approximately 10 seconds.interval * (tko + master_wait + upgrade_wait)
3.2. The <heuristic> tag¶
This tag is a child of the <quorumd> tag. Heuristics may not be changed while qdiskd is running.<heuristic
program="/test.sh"
This is the program used to determine if this heuristic is alive. This can be anything which may be executed by /bin/sh -c. A return value of zero indicates success; anything else indicates failure. This is required.
score="1"
This is the weight of this heuristic. Be careful when determining scores for heuristics. The default score for each heuristic is 1.
interval="2"
This is the frequency (in seconds) at which we poll the heuristic. The default interval for every heuristic is 2 seconds.
tko="1"
After this many failed attempts to run the heuristic, it is considered DOWN, and its score is removed. The default tko for each heuristic is 1, which may be inadequate for things such as 'ping'.
/>
3.3. Examples¶
3.3.1. 3 cluster nodes & 3 routers¶
<cman expected_votes="6" .../>
<clusternode name="node1" votes="1" ... />
</clusternodes>
<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/>
</quorumd>
3.3.2. 2 cluster nodes & 1 IP tiebreaker¶
<cman two_node="0" expected_votes="3" .../>
<clusternode name="node1" votes="1" ... />
</clusternodes>
<heuristic program="ping A -c1 -t1" score="1" interval="2" tko="3"/>
</quorumd>
3.4. Heuristic score considerations¶
* Heuristic timeouts should be set high enough to allow the previous run of a given heuristic to complete.3.5. Creating a quorum disk partition¶
The mkqdisk utility can create and list currently configured quorum disks visible to the local node; see mkqdisk(8) for more details.SEE ALSO¶
mkqdisk(8), qdiskd(8), cman(5), syslog.conf(5), gettimeofday(2)20 Feb 2007 |