THEORY OF OPERATION¶
The capacity of an NVDIMM REGION (contiguous span of persistent memory) is
accessed via one or more NAMESPACE devices. REGION is the Linux term for what
ACPI and UEFI call a DIMM-interleave-set, or a system-physical-address-range
that is striped (by the memory controller) across one or more memory modules.
The UEFI specification defines the NVDIMM Label Protocol as
the combination of label area access methods and a data format for
provisioning one or more NAMESPACE objects from a REGION. Note that label
support is optional and if Linux does not detect the label capability it
will automatically instantiate a "label-less" namespace per
region. Examples of label-less namespaces are the ones created by the
kernel’s memmap=ss!nn command line option (see the nvdimm wiki
on kernel.org), or NVDIMMs without a valid namespace index in their
Label-less namespaces lack many of the features of their
cousins. For example, their size cannot be modified, or they cannot be fully
destroyed (i.e. the space reclaimed). A destroy operation will zero
any mode-specific metadata. Finally, for create-namespace operations on
label-less namespaces, ndctl bypasses the region capacity availability
checks, and always satisfies the request using the full region capacity. The
only reconfiguration operation supported on a label-less namespace is
changing its mode.
A namespace can be provisioned to operate in one of 4 modes,
fsdax, devdax, sector, and raw. Here are the
expected usage models for these modes:
•fsdax: Filesystem-DAX mode is the default mode of
a namespace when specifying ndctl create-namespace
with no options. It
creates a block device (/dev/pmemX[.Y]) that supports the DAX capabilities of
Linux filesystems (xfs and ext4 to date). DAX removes the page cache from the
I/O path and allows mmap(2)
to establish direct mappings to persistent memory
media. The DAX capability enables workloads / working-sets that would exceed
the capacity of the page cache to scale up to the capacity of persistent
memory. Workloads that fit in page cache or perform bulk data transfers may
not see benefit from DAX. When in doubt, pick this mode.
•devdax: Device-DAX mode enables similar mmap(2)
DAX mapping capabilities as Filesystem-DAX. However, instead of a block-device
that can support a DAX-enabled filesystem, this mode emits a single character
device file (/dev/daxX.Y). Use this mode to assign persistent memory to a
virtual-machine, register persistent memory for RDMA, or when gigantic
mappings are needed.
•sector: Use this mode to host legacy filesystems
that do not checksum metadata or applications that are not prepared for torn
sectors after a crash. Expected usage for this mode is for small boot volumes.
This mode is compatible with other operating systems.
•raw: Raw mode is effectively just a memory disk
that does not support DAX. Typically this indicates a namespace that was
created by tooling or another operating system that did not know how to create
a Linux fsdax or devdax mode namespace. This mode is compatible
with other operating systems, but again, does not support DAX operation.
Create a pmem
namespace (subject to
available capacity). A pmem namespace supports the dax (direct access)
capability to mmap(2)
persistent memory directly into a process address space.
A blk namespace access persistent memory through a block-window-aperture.
Compared to pmem it supports a traditional storage error model (EIO on error
rather than a cpu exception on a bad memory access), but it does not support
•"raw": expose the namespace capacity
directly with limitations. Neither a raw pmem namepace nor raw blk namespace
support sector atomicity by default (see "sector" mode below). A raw
pmem namespace may have limited to no dax support depending the kernel. In
other words operations like direct-I/O targeting a dax buffer may fail for a
pmem namespace in raw mode or indirect through a page-cache buffer. See
"fsdax" and "devdax" mode for dax operation.
•"sector": persistent memory, given that
it is byte addressable, does not support sector atomicity. The problematic
aspect of sector tearing is that most applications do not know they have a
atomic sector update dependency. At least a disk rarely ever tears sectors and
if it does it almost certainly returns a checksum error on access. Persistent
memory devices will always tear and always silently. Until an application is
audited to be robust in the presence of sector-tearing "safe" mode
is recommended. This imposes some performance overhead and disables the dax
capability. (also known as "safe" or "btt" mode)
•"fsdax": A pmem namespace in this mode
supports dax operation with a block-device based filesystem (in previous ndctl
releases this mode was named "memory" mode). This mode comes at the
cost of allocating per-page metadata. The capacity can be allocated from
"System RAM", or from a reserved portion of "Persistent
Memory" (see the --map= option). NOTE: A filesystem that supports DAX is
required for dax operation. If the raw block device (/dev/pmemX) is used
directly without a filesystem, it will use the page cache. See
"devdax" mode for raw device access that supports dax.
•"devdax": The device-dax character
device interface is a statically allocated / raw access analogue of
filesystem-dax (in previous ndctl releases this mode was named "dax"
mode). It allows memory ranges to be mapped without need of an intervening
filesystem. The device-dax is interface strict, precise and predictable.
Specifically the interface:
•Guarantees fault granularity with respect to a
given page size (4K, 2M, or 1G on x86) set at configuration time.
•Enforces deterministic behavior by being strict
about what fault scenarios are supported. I.e. if a device is configured with
a 2M alignment an attempt to fault a 4K aligned offset will result in SIGBUS.
:: Note both fsdax and devdax mode require 16MiB physical
alignment to be cross-arch compatible. By default ndctl will block attempts to
create namespaces in these modes when the physical starting address of the
namespace is not 16MiB aligned. The --force option tries to override this
constraint if the platform supports a smaller alignment, but this is not
For NVDIMM devices that support namespace labels, set the
namespace size in bytes. Otherwise it defaults to the maximum size specified
by platform firmware. This option supports the suffixes "k" or
"K" for KiB, "m" or "M" for MiB, "g"
or "G" for GiB and "t" or "T" for TiB.
For pmem namepsaces the size must be a multiple of the
interleave-width and the namespace alignment (see
Applications that want to establish dax memory mappings
with page table entries greater than system base page size (4K on x86) need a
persistent memory namespace that is sufficiently aligned. For
"fsdax" and "devdax" mode this defaults to 2M. Note that
"devdax" mode enforces all mappings to be aligned to this value,
i.e. it fails unaligned mapping attempts. The "fsdax" alignment
setting determines the starting alignment of filesystem extents and may limit
the possible granularities, if a large mapping is not possible it will
silently fall back to a smaller page size.
Reconfigure an existing namespace. This option is a
shortcut for the following sequence:
•Read all parameters from @victim_namespace
•Create @new_namespace merging old parameters with
new ones :: Note that the major implication of a destroy-create cycle is that
data from @victim_namespace is not preserved in @new_namespace. The attributes
transferred from @victim_namespace are the geometry, mode, and name (not uuid
without --uuid=). No attempt is made to preserve the data and any old data
that is visible in @new_namespace is by coincidence not convention.
"Backup and restore" is the only reliable method to populate
@new_namespace with data from @victim_namespace.
This option is not recommended as a new uuid should be
generated every time a namespace is (re-)created. For recovery scenarios
however the uuid may be specified.
For NVDIMM devices that support namespace labels, specify
a human friendly name for a namespace. This name is available as a device
attribute for use in udev rules.
Specify the logical sector size (LBA size) of the Linux
block device associated with an namespace.
A pmem namespace in "fsdax" or
"devdax" mode requires allocation of per-page metadata. The
allocation can be drawn from either:
•"mem": typical system memory
•"dev": persistent memory reserved from
the namespace :: Given relative capacities of "Persistent Memory" to
"System RAM" the allocation defaults to reserving space out of the
namespace directly ("--map=dev"). The overhead is 64-bytes per 4K
(16GB per 1TB) on x86.
Do not stop after creating one namespace. Instead,
greedily create as many namespaces as possible within the given --bus and
--region filter restrictions. This will abort if any creation attempt results
in an error unless --force is also supplied.
Unless this option is specified the reconfigure
namespace operation will fail if the namespace is presently active.
Specifying --force causes the namespace to be disabled before the operation is
attempted. However, if the namespace is mounted then the disable
namespace and reconfigure namespace operations will be
aborted. The namespace must be unmounted before being reconfigured. When used
in conjunction with --continue, continue the namespace creation loop even if
an error is encountered for intermediate namespaces.
-L, --autolabel, --no-autolabel
Legacy NVDIMM devices do not support namespace labels. In
that case the kernel creates region-sized namespaces that can not be deleted.
Their mode can be changed, but they can not be resized smaller than their
parent region. This is termed a "label-less namespace". In contrast,
NVDIMMs and hypervisors that support the ACPI 6.2 label area definition (ACPI
6.2 Section 6.5.10 NVDIMM Label Methods) support "labelled
•There are two cases where the kernel will default
to label-less operation:
•NVDIMM does not support labels
•The NVDIMM supports labels, but the Label Index
Block (see UEFI 2.7) is not present and there is no capacity aliasing between
blk and pmem regions.
•In the latter case the configuration can be
upgraded to labelled operation by writing an index block on all DIMMs in a
region and re-enabling that region. The autolabel
ndctl create-namespace --reconfig
tries to do this by default if it can
determine that all DIMM capacity is referenced by the namespace being
reconfigured. It will otherwise fail to autolabel and remain in label-less
mode if it finds a DIMM contributes capacity to more than one region. This
check prevents inadvertent data loss of that other region is in active use.
The --autolabel option is implied by default, the --no-autolabel option can be
used to disable this behavior. When automatic labeling fails and labelled
operation is still desired the safety policy can be bypassed by the following
commands, note that all data on all regions is forfeited by running these
ndctl disable-region all
ndctl init-labels all
ndctl enable-region all
-R, --autorecover, --no-autorecover
By default, if a namespace creation attempt fails, ndctl
will cleanup the partially initialized namespace. Use --no-autorecover to
disable this behavior for debug and development scenarios where it useful to
have the label and info-block state preserved after a failure.
Emit debug messages for the namespace creation
A regionX device name, or a region id number.
Restrict the operation to the specified region(s). The keyword all can
be specified to indicate the lack of any restriction, however this is the same
as not supplying a --region option at all.
A bus id number, or a provider string (e.g.
"ACPI.NFIT"). Restrict the operation to the specified bus(es). The
keyword all can be specified to indicate the lack of any restriction,
however this is the same as not supplying a --bus option at all.