.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "xl-numa-placement 7"
.TH xl-numa-placement 7 "2023-03-23" "4.14.5" "Xen"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
xl\-numa\-placement \- Guest Automatic NUMA Placement in libxl and xl
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
.SS "Rationale"
.IX Subsection "Rationale"
\&\s-1NUMA\s0 (which stands for Non-Uniform Memory Access) means that the memory
accessing times of a program running on a \s-1CPU\s0 depends on the relative
distance between that \s-1CPU\s0 and that memory. In fact, most of the \s-1NUMA\s0
systems are built in such a way that each processor has its local memory,
on which it can operate very fast. On the other hand, getting and storing
data from and on remote memory (that is, memory local to some other processor)
is quite more complex and slow. On these machines, a \s-1NUMA\s0 node is usually
defined as a set of processor cores (typically a physical \s-1CPU\s0 package) and
the memory directly attached to the set of cores.
.PP
\&\s-1NUMA\s0 awareness becomes very important as soon as many domains start
running memory-intensive workloads on a shared host. In fact, the cost
of accessing non node-local memory locations is very high, and the
performance degradation is likely to be noticeable.
.PP
For more information, have a look at the Xen \s-1NUMA\s0 Introduction <https://wiki.xenproject.org/wiki/Xen_on_NUMA_Machines>
page on the Wiki.
.SS "Xen and \s-1NUMA\s0 machines: the concept of \fInode-affinity\fP"
.IX Subsection "Xen and NUMA machines: the concept of node-affinity"
The Xen hypervisor deals with \s-1NUMA\s0 machines throughout the concept of
\&\fInode-affinity\fR. The node-affinity of a domain is the set of \s-1NUMA\s0 nodes
of the host where the memory for the domain is being allocated (mostly,
at domain creation time). This is, at least in principle, different and
unrelated with the vCPU (hard and soft, see below) scheduling affinity,
which instead is the set of pCPUs where the vCPU is allowed (or prefers)
to run.
.PP
Of course, despite the fact that they belong to and affect different
subsystems, the domain node-affinity and the vCPUs affinity are not
completely independent.
In fact, if the domain node-affinity is not explicitly specified by the
user, via the proper libxl calls or xl config item, it will be computed
basing on the vCPUs' scheduling affinity.
.PP
Notice that, even if the node affinity of a domain may change on-line,
it is very important to \*(L"place\*(R" the domain correctly when it is fist
created, as the most of its memory is allocated at that time and can
not (for now) be moved easily.
.SS "Placing via pinning and cpupools"
.IX Subsection "Placing via pinning and cpupools"
The simplest way of placing a domain on a \s-1NUMA\s0 node is setting the hard
scheduling affinity of the domain's vCPUs to the pCPUs of the node. This
also goes under the name of vCPU pinning, and can be done through the
\&\*(L"cpus=\*(R" option in the config file (more about this below). Another option
is to pool together the pCPUs spanning the node and put the domain in
such a \fIcpupool\fR with the \*(L"pool=\*(R" config option (as documented in our
Wiki <https://wiki.xenproject.org/wiki/Cpupools_Howto>).
.PP
In both the above cases, the domain will not be able to execute outside
the specified set of pCPUs for any reasons, even if all those pCPUs are
busy doing something else while there are others, idle, pCPUs.
.PP
So, when doing this, local memory accesses are 100% guaranteed, but that
may come at he cost of some load imbalances.
.SS "\s-1NUMA\s0 aware scheduling"
.IX Subsection "NUMA aware scheduling"
If using the credit1 scheduler, and starting from Xen 4.3, the scheduler
itself always tries to run the domain's vCPUs on one of the nodes in
its node-affinity. Only if that turns out to be impossible, it will just
pick any free pCPU. Locality of access is less guaranteed than in the
pinning case, but that comes along with better chances to exploit all
the host resources (e.g., the pCPUs).
.PP
Starting from Xen 4.5, credit1 supports two forms of affinity: hard and
soft, both on a per-vCPU basis. This means each vCPU can have its own
soft affinity, stating where such vCPU prefers to execute on. This is
less strict than what it (also starting from 4.5) is called hard affinity,
as the vCPU can potentially run everywhere, it just prefers some pCPUs
rather than others.
In Xen 4.5, therefore, NUMA-aware scheduling is achieved by matching the
soft affinity of the vCPUs of a domain with its node-affinity.
.PP
In fact, as it was for 4.3, if all the pCPUs in a vCPU's soft affinity
are busy, it is possible for the domain to run outside from there. The
idea is that slower execution (due to remote memory accesses) is still
better than no execution at all (as it would happen with pinning). For
this reason, \s-1NUMA\s0 aware scheduling has the potential of bringing
substantial performances benefits, although this will depend on the
workload.
.PP
Notice that, for each vCPU, the following three scenarios are possbile:
.IP "\(bu" 4
a vCPU \fIis pinned\fR to some pCPUs and \fIdoes not have\fR any soft affinity
In this case, the vCPU is always scheduled on one of the pCPUs to which
it is pinned, without any specific peference among them.
.IP "\(bu" 4
a vCPU \fIhas\fR its own soft affinity and \fIis not\fR pinned to any particular
pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the
scheduler will try to have it running on one of the pCPUs in its soft
affinity;
.IP "\(bu" 4
a vCPU \fIhas\fR its own vCPU soft affinity and \fIis also\fR pinned to some
pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs
onto which it is pinned, with, among them, a preference for the ones
that also forms its soft affinity. In case pinning and soft affinity
form two disjoint sets of pCPUs, pinning \*(L"wins\*(R", and the soft affinity
is just ignored.
.SS "Guest placement in xl"
.IX Subsection "Guest placement in xl"
If using xl for creating and managing guests, it is very easy to ask for
both manual or automatic placement of them across the host's \s-1NUMA\s0 nodes.
.PP
Note that xm/xend does a very similar thing, the only differences being
the details of the heuristics adopted for automatic placement (see below),
and the lack of support (in both xm/xend and the Xen versions where that
was the default toolstack) for \s-1NUMA\s0 aware scheduling.
.SS "Placing the guest manually"
.IX Subsection "Placing the guest manually"
Thanks to the \*(L"cpus=\*(R" option, it is possible to specify where a domain
should be created and scheduled on, directly in its config file. This
affects \s-1NUMA\s0 placement and memory accesses as, in this case, the
hypervisor constructs the node-affinity of a \s-1VM\s0 basing right on its
vCPU pinning when it is created.
.PP
This is very simple and effective, but requires the user/system
administrator to explicitly specify the pinning for each and every domain,
or Xen won't be able to guarantee the locality for their memory accesses.
.PP
That, of course, also mean the vCPUs of the domain will only be able to
execute on those same pCPUs.
.PP
It is is also possible to have a \*(L"cpus_soft=\*(R" option in the xl config file,
to specify the soft affinity for all the vCPUs of the domain. This affects
the \s-1NUMA\s0 placement in the following way:
.IP "\(bu" 4
if only \*(L"cpus_soft=\*(R" is present, the \s-1VM\s0's node-affinity will be equal
to the nodes to which the pCPUs in the soft affinity mask belong;
.IP "\(bu" 4
if both \*(L"cpus_soft=\*(R" and \*(L"cpus=\*(R" are present, the \s-1VM\s0's node-affinity
will be equal to the nodes to which the pCPUs present both in hard and
soft affinity belong.
.SS "Placing the guest automatically"
.IX Subsection "Placing the guest automatically"
If neither \*(L"cpus=\*(R" nor \*(L"cpus_soft=\*(R" are present in the config file, libxl
tries to figure out on its own on which node(s) the domain could fit best.
If it finds one (some), the domain's node affinity get set to there,
and both memory allocations and \s-1NUMA\s0 aware scheduling (for the credit
scheduler and starting from Xen 4.3) will comply with it. Starting from
Xen 4.5, this also means that the mask resulting from this \*(L"fitting\*(R"
procedure will become the soft affinity of all the vCPUs of the domain.
.PP
It is worthwhile noting that optimally fitting a set of VMs on the \s-1NUMA\s0
nodes of an host is an incarnation of the Bin Packing Problem. In fact,
the various VMs with different memory sizes are the items to be packed,
and the host nodes are the bins. As such problem is known to be NP-hard,
we will be using some heuristics.
.PP
The first thing to do is find the nodes or the sets of nodes (from now
on referred to as 'candidates') that have enough free memory and enough
physical CPUs for accommodating the new domain. The idea is to find a
spot for the domain with at least as much free memory as it has configured
to have, and as much pCPUs as it has vCPUs.  After that, the actual
decision on which candidate to pick happens accordingly to the following
heuristics:
.IP "\(bu" 4
candidates involving fewer nodes are considered better. In case
two (or more) candidates span the same number of nodes,
.IP "\(bu" 4
candidates with a smaller number of vCPUs runnable on them (due
to previous placement and/or plain vCPU pinning) are considered
better. In case the same number of vCPUs can run on two (or more)
candidates,
.IP "\(bu" 4
the candidate with with the greatest amount of free memory is
considered to be the best one.
.PP
Giving preference to candidates with fewer nodes ensures better
performance for the guest, as it avoid spreading its memory among
different nodes. Favoring candidates with fewer vCPUs already runnable
there ensures a good balance of the overall host load. Finally, if more
candidates fulfil these criteria, prioritizing the nodes that have the
largest amounts of free memory helps keeping the memory fragmentation
small, and maximizes the probability of being able to put more domains
there.
.SS "Guest placement in libxl"
.IX Subsection "Guest placement in libxl"
xl achieves automatic \s-1NUMA\s0 placement because that is what libxl does
by default. No \s-1API\s0 is provided (yet) for modifying the behaviour of
the placement algorithm. However, if your program is calling libxl,
it is possible to set the \f(CW\*(C`numa_placement\*(C'\fR build info key to \f(CW\*(C`false\*(C'\fR
(it is \f(CW\*(C`true\*(C'\fR by default) with something like the below, to prevent
any placement from happening:
.PP
.Vb 1
\&    libxl_defbool_set(&domain_build_info\->numa_placement, false);
.Ve
.PP
Also, if \f(CW\*(C`numa_placement\*(C'\fR is set to \f(CW\*(C`true\*(C'\fR, the domain's vCPUs must
not be pinned (i.e., \f(CW\*(C`domain_build_info\->cpumap\*(C'\fR must have all its
bits set, as it is by default), or domain creation will fail with
\&\f(CW\*(C`ERROR_INVAL\*(C'\fR.
.PP
Starting from Xen 4.3, in case automatic placement happens (and is
successful), it will affect the domain's node-affinity and \fInot\fR its
vCPU pinning. Namely, the domain's vCPUs will not be pinned to any
pCPU on the host, but the memory from the domain will come from the
selected node(s) and the \s-1NUMA\s0 aware scheduling (if the credit scheduler
is in use) will try to keep the domain's vCPUs there as much as possible.
.PP
Besides than that, looking and/or tweaking the placement algorithm
search \*(L"Automatic \s-1NUMA\s0 placement\*(R" in libxl_internal.h.
.PP
Note this may change in future versions of Xen/libxl.
.SS "Xen < 4.5"
.IX Subsection "Xen < 4.5"
The concept of vCPU soft affinity has been introduced for the first time
in Xen 4.5. In 4.3, it is the domain's node-affinity that drives the
NUMA-aware scheduler. The main difference is soft affinity is per-vCPU,
and so each vCPU can have its own mask of pCPUs, while node-affinity is
per-domain, that is the equivalent of having all the vCPUs with the same
soft affinity.
.SS "Xen < 4.3"
.IX Subsection "Xen < 4.3"
As \s-1NUMA\s0 aware scheduling is a new feature of Xen 4.3, things are a little
bit different for earlier version of Xen. If no \*(L"cpus=\*(R" option is specified
and Xen 4.2 is in use, the automatic placement algorithm still runs, but
the results is used to \fIpin\fR the vCPUs of the domain to the output node(s).
This is consistent with what was happening with xm/xend.
.PP
On a version of Xen earlier than 4.2, there is not automatic placement at
all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning
being introduced/modified.
.SS "Limitations"
.IX Subsection "Limitations"
Analyzing various possible placement solutions is what makes the
algorithm flexible and quite effective. However, that also means
it won't scale well to systems with arbitrary number of nodes.
For this reason, automatic placement is disabled (with a warning)
if it is requested on a host with more than 16 \s-1NUMA\s0 nodes.