NAME¶
sanlock - shared storage lock manager
SYNOPSIS¶
sanlock [COMMAND] [ACTION] ...
DESCRIPTION¶
The sanlock daemon manages leases for applications running on a cluster of hosts
with shared storage. All lease management and coordination is done through
reading and writing blocks on the shared storage. Two types of leases are
used, each based on a different algorithm:
"delta leases" are slow to acquire and require regular i/o to shared
storage. A delta lease exists in a single sector of storage. Acquiring a delta
lease involves reads and writes to that sector separated by specific delays.
Once acquired, a lease must be renewed by updating a timestamp in the sector
regularly. sanlock uses a delta lease internally to hold a lease on a host_id.
host_id leases prevent two hosts from using the same host_id and provide basic
host liveness information based on the renewals.
"paxos leases" are generally fast to acquire and sanlock makes them
available to applications as general purpose resource leases. A paxos lease
exists in 1MB of shared storage (8MB for 4k sectors). Acquiring a paxos lease
involves reads and writes to max_hosts (2000) sectors in a specific sequence
specified by the Disk Paxos algorithm. paxos leases use host_id's internally
to indicate the owner of the lease, and the algorithm fails if different hosts
use the same host_id. So, delta leases provide the unique host_id's used in
paxos leases. paxos leases also refer to delta leases to check if a host_id is
alive.
Before sanlock can be used, the user must assign each host a host_id, which is a
number between 1 and 2000. Two hosts should not be given the same host_id
(even though delta leases attempt to detect this mistake.)
sanlock views a pool of storage as a "lockspace". Each distinct pool
of storage, e.g. from different sources, would typically be defined as a
separate lockspace, with a unique lockspace name.
Part of this storage space must be reserved and initialized for sanlock to store
delta leases. Each host that wants to use the lockspace must first acquire a
delta lease on its host_id number within the lockspace. (See the add_lockspace
action/api.) The space required for 2000 delta leases in the lockspace (for
2000 possible host_id's) is 1MB (8MB for 4k sectors). (This is the same size
required for a single paxos lease.)
More storage space must be reserved and initialized for paxos leases, according
to the needs of the applications using sanlock.
The following steps illustrate these concepts using the command line.
Applications may choose to do these same steps through libsanlock.
1. Create storage pools and reserve and initialize host_id leases
two different LUNs on a SAN: /dev/sdb, /dev/sdc
# vgcreate pool1 /dev/sdb
# vgcreate pool2 /dev/sdc
# lvcreate -n hostid_leases -L 1MB pool1
# lvcreate -n hostid_leases -L 1MB pool2
# sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
# sanlock direct init -s LS2:0:/dev/pool2/hostid_leases:0
2. Start the sanlock daemon on each host
# sanlock daemon
3. Add each lockspace to be used
host1:
# sanlock client add_lockspace -s LS1:1:/dev/pool1/hostid_leases:0
# sanlock client add_lockspace -s LS2:1:/dev/pool2/hostid_leases:0
host2:
# sanlock client add_lockspace -s LS1:2:/dev/pool1/hostid_leases:0
# sanlock client add_lockspace -s LS2:2:/dev/pool2/hostid_leases:0
4. Applications can now reserve/initialize space for resource leases, and then
acquire the leases as they need to access the resources.
The resource leases that are created and how they are used depends on the
application. For example, say application A, running on host1 and host2, needs
to synchronize access to data it stores on /dev/pool1/Adata. A could use a
resource lease as follows:
5. Reserve and initialize a single resource lease for Adata
# lvcreate -n Adata_lease -L 1MB pool1
# sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
6. Acquire the lease from the app using libsanlock (see sanlock_register,
sanlock_acquire). If the app is already running as pid 123, and has registered
with the sanlock daemon, the lease can be added for it manually.
# sanlock client acquire -r LS1:Adata:/dev/pool1/Adata_lease:0 -p 123
offsets
offsets must be 1MB aligned for disks with 512 byte sectors, and 8MB aligned for
disks with 4096 byte sectors.
offsets may be used to place leases on the same device rather than using
separate devices and offset 0 as shown in examples above, e.g. these commands
above:
# sanlock direct init -s LS1:0:/dev/pool1/hostid_leases:0
# sanlock direct init -r LS1:Adata:/dev/pool1/Adata_lease:0
could be replaced by:
# sanlock direct init -s LS1:0:/dev/pool1/leases:0
# sanlock direct init -r LS1:Adata:/dev/pool1/leases:1048576
failures
If a process holding resource leases fails or exits without releasing its
leases, sanlock will release the leases for it automatically.
If the sanlock daemon cannot renew a lockspace host_id for a specific period of
time (usually because storage access is lost), sanlock will kill any process
holding a resource lease within the lockspace.
If the sanlock daemon crashes or gets stuck, it will no longer renew the expiry
time of its per-host_id connections to the wdmd daemon, and the watchdog
device will reset the host.
watchdog
sanlock uses the
wdmd(8) daemon to access /dev/watchdog. A separate wdmd
connection is maintained with wdmd for each host_id being renewed. Each
host_id connection has an expiry time for some seconds in the future. After
each successful host_id renewal, sanlock updates the associated expiry time in
wdmd. If wdmd finds any connection expired, it will not pet /dev/watchdog.
After enough successive expired/failed checks, the watchdog device will fire
and reset the host.
After a number of failed attempts to renew a host_id, sanlock kills any process
using that lockspace. Once all those processes have exited, sanlock will
unregister the associated wdmd connection. wdmd will no longer find the
expired connection, and will resume petting /dev/watchdog (assuming it finds
no other failed/expired tests.) If the killed processes did not exit quickly
enough, the expired wdmd connection will not be unregistered, and
/dev/watchdog will reset the host.
Based on these known timeout values, sanlock on another host can calculate,
based on the last host_id renewal, when the failed host will have been reset
by its watchdog (or killed all the necessary processes).
If the sanlock daemon itself fails, crashes, get stuck, it will no longer update
the expiry time for its host_id connections to wdmd, which will also lead to
the watchdog resetting the host.
safety
sanlock leases are meant to guarantee that two process on two hosts are never
allowed to hold the same resource lease at once. If they were, the resource
being protected may be corrupted. There are three levels of protection built
into sanlock itself:
1. The paxos leases and delta leases themselves.
2. If the leases cannot function because storage access is lost (host_id's
cannot be renewed), the sanlock daemon kills any pids using resource leases in
the lockspace.
3. If the pids do not exit after being killed, or if the sanlock daemon fails,
the watchdog device resets the host.
OPTIONS¶
COMMAND can be one of three primary top level choices
sanlock daemon start daemon
sanlock client send request to daemon (default command if none given)
sanlock direct access storage directly (no coordination with daemon)
sanlock daemon [options]
-D no fork and print all logging to stderr
-Q 0|1 quiet error messages for common lock contention
-R 0|1 renewal debugging, log debug info for each renewal
-L pri write logging at priority level and up to logfile (-1 none)
-S pri write logging at priority level and up to syslog (-1 none)
-U uid user id
-G gid group id
-t num max worker threads
-w 0|1 use watchdog through wdmd
-h 0|1 use high priority features (realtime scheduling, mlockall)
-a 0|1 use async i/o
-o sec io timeout in seconds
sanlock client action [options]
sanlock client status
Print processes, lockspaces, and resources being manged by the sanlock daemon.
Add -D to show extra internal daemon status for debugging. Add -o p to show
resources by pid, or -o s to show resources by lockspace.
sanlock client host_status -s LOCKSPACE
Print state of host_id delta leases read during the last renewal. Only
lockspace_name is used from the LOCKSPACE argument. Add -D to show extra
internal daemon status for debugging.
sanlock client log_dump
Print the sanlock daemon internal debug log.
sanlock client shutdown
Ask the sanlock daemon to exit. Without the force option (-f 0), the command
will be ignored if any lockspaces exist. With the force option (-f 1), any
registered processes will be killed, their resource leases released, and
lockspaces removed.
sanlock client init -s LOCKSPACE
sanlock client init -r RESOURCE
Tell the sanlock daemon to initialize storage for lease areas. (See sanlock
direct init.)
sanlock client align -s LOCKSPACE
Tell the sanlock daemon to report the required lease alignment for a storage
path. Only path is used from the LOCKSPACE argument.
sanlock client add_lockspace -s LOCKSPACE
Tell the sanlock daemon to acquire the specified host_id in the lockspace. This
will allow resources to be acquired in the lockspace.
sanlock client inq_lockspace -s LOCKSPACE
Ask to the sanlock daemon weather the lockspace is acquired or not.
sanlock client rem_lockspace -s LOCKSPACE
Tell the sanlock daemon to release the specified host_id in the lockspace. Any
processes holding resource leases in this lockspace will be killed, and the
resource leases not released.
sanlock client command -r RESOURCE
-c path
args
Register with the sanlock daemon, acquire the specified resource lease, and exec
the command at path with args. When the command exits, the sanlock daemon will
release the lease. -c must be the final option.
sanlock client acquire -r RESOURCE
-p pid
sanlock client release -r RESOURCE
-p pid
Tell the sanlock daemon to acquire or release the specified resource lease for
the given pid. The pid must be registered with the sanlock daemon. acquire can
optionally take a versioned RESOURCE string RESOURCE:lver, where lver is the
version of the lease that must be acquired, or fail.
sanlock client inquire -p pid
Print the resource leases held the given pid. The format is a versioned RESOURCE
string "RESOURCE:lver" where lver is the version of the lease held.
sanlock client request -r RESOURCE
-f
force_mode
Request the owner of a resource do something specified by force_mode. A
versioned RESOURCE:lver string must be used with a greater version than is
presently held. Zero lver and force_mode clears the request.
sanlock client examine -r RESOURCE
Examine the request record for the currently held resource lease and carry out
the action specified by the requested force_mode.
sanlock client examine -s LOCKSPACE
Examine requests for all resource leases currently held in the named lockspace.
Only lockspace_name is used from the LOCKSPACE argument.
sanlock direct action [options]
-a 0|1 use async i/o
-o sec io timeout in seconds
sanlock direct init -s LOCKSPACE
sanlock direct init -r RESOURCE
Initialize storage for 2000 host_id (delta) leases for the given lockspace, or
initialize storage for one resource (paxos) lease. Both options require 1MB of
space. The host_id in the LOCKSPACE string is not relevant to initialization,
so the value is ignored. (The default of 2000 host_ids can be changed for
special cases using the -n num_hosts and -m max_hosts options.)
sanlock direct read_leader -s LOCKSPACE
sanlock direct read_leader -r RESOURCE
Read a leader record from disk and print the fields. The leader record is the
single sector of a delta lease, or the first sector of a paxos lease.
sanlock direct read_id -s LOCKSPACE
sanlock direct live_id -s LOCKSPACE
read_id reads a host_id and prints the owner. live_id reads a host_id once a
second until it the timestamp or owner change (prints live 1), or until
host_dead_seconds (prints live 0). (host_dead_seconds is derived from the
io_timeout option. The live 0|1 conclusion will not match the sanlock daemon's
conclusion unless the configured timeouts match.)
sanlock direct dump path[:offset]
Read disk sectors and print leader records for delta or paxos leases. Add -f 1
to print the request record values for paxos leases, and host_ids set in delta
lease bitmaps.
LOCKSPACE option string¶
-s lockspace_name:host_id:path:offset
lockspace_name name of lockspace
host_id local host identifier in lockspace
path path to storage reserved for leases
offset offset on path (bytes)
RESOURCE option string¶
-r
lockspace_name:resource_name:path:offset
lockspace_name name of lockspace
resource_name name of resource
path path to storage reserved for leases
offset offset on path (bytes)
RESOURCE option string with version¶
-r
lockspace_name:resource_name:path:offset:
lver
lver leader version or SH for shared lease
Defaults¶
sanlock help shows the default values for the options above.
sanlock version shows the build version.
USAGE¶
Request/Examine¶
The first part of making a request for a resource is writing the request record
of the resource (the sector following the leader record). To make a successful
request:
- •
- RESOURCE:lver must be greater than the lver presently held
by the other host. This implies the leader record must be read to discover
the lver, prior to making a request.
- •
- RESOURCE:lver must be greater than or equal to the lver
presently written to the request record. Two hosts may write a new request
at the same time for the same lver, in which case both would succeed, but
the force_mode from the last would win.
- •
- The force_mode must be greater than zero.
- •
- To unconditionally clear the request record (set both lver
and force_mode to 0), make request with RESOURCE:0 and force_mode 0.
The owner of the requested resource will not know of the request unless it is
explicitly told to examine its resources via the "examine"
api/command, or otherwise notfied.
The second part of making a request is notifying the resource lease owner that
it should examine the request records of its resource leases. The notification
will cause the lease owner to automatically run the equivalent of
"sanlock client examine -s LOCKSPACE" for the lockspace of the
requested resource.
The notification is made using a bitmap in each host_id delta lease. Each bit
represents each of the possible host_ids (1-2000). If host A wants to notify
host B to examine its resources, A sets the bit in its own bitmap that
corresponds to the host_id of B. When B next renews its delta lease, it reads
the delta leases for all hosts and checks each bitmap to see if its own
host_id has been set. It finds the bit for its own host_id set in A's bitmap,
and examines its resource request records. (The bit remains set in A's bitmap
for request_finish_seconds.)
force_mode determines the action the resource lease owner should take:
1 (KILL_PID): kill the process holding the resource lease. When the
process has exited, the resource lease will be released, and can then be
acquired by anyone.
SEE ALSO¶
wdmd(8)