.\" Automatically generated by Pod::Man 4.10 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "xen-tscmode 7" .TH xen-tscmode 7 "2021-03-24" "4.11.4" "Xen" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" xen\-tscmode \- Xen TSC (time stamp counter) and timekeeping discussion .SH "OVERVIEW" .IX Header "OVERVIEW" As of Xen 4.0, a new config option called tsc_mode may be specified for each domain. The default for tsc_mode handles the vast majority of hardware and software environments. This document is targeted for Xen users and administrators that may need to select a non-default tsc_mode. .PP Proper selection of tsc_mode depends on an understanding not only of the guest operating system (\s-1OS\s0), but also of the application set that will ever run on this guest \s-1OS.\s0 This is because tsc_mode applies equally to both the \s-1OS\s0 and \s-1ALL\s0 apps that are running on this domain, now or in the future. .PP Key questions to be answered for the \s-1OS\s0 and/or each application are: .IP "\(bu" 4 Does the OS/app use the rdtsc instruction at all? (We will explain below how to determine this.) .IP "\(bu" 4 At what frequency is the rdtsc instruction executed by either the \s-1OS\s0 or any running apps? If the sum exceeds about 10,000 rdtsc instructions per second per processor, we call this a \*(L"high-TSC-frequency\*(R" OS/app/environment. (This is relatively rare, and developers of \s-1OS\s0's and apps that are high-TSC-frequency are usually aware of it.) .IP "\(bu" 4 If the OS/app does use rdtsc, will it behave incorrectly if \*(L"time goes backwards\*(R" or if the frequency of the \s-1TSC\s0 suddenly changes? If so, we call this a \*(L"TSC-sensitive\*(R" app or \s-1OS\s0; otherwise it is \*(L"TSC-resilient\*(R". .PP This last is the \s-1US$64,000\s0 question as it may be very difficult (or, for legacy apps, even impossible) to predict all possible failure cases. As a result, unless proven otherwise, any app that uses rdtsc must be assumed to be TSC-sensitive and, as we will see, this is the default starting in Xen 4.0. .PP Xen's new tsc_mode parameter determines the circumstances under which the family of rdtsc instructions are executed \*(L"natively\*(R" vs emulated. Roughly speaking, native means rdtsc is fast but TSC-sensitive apps may, under unpredictable circumstances, run incorrectly; emulated means there is some performance degradation (unobservable in most cases), but TSC-sensitive apps will always run correctly. Prior to Xen 4.0, all rdtsc instructions were native: \*(L"fast but potentially incorrect.\*(R" Starting at Xen 4.0, the default is that all rdtsc instructions are \&\*(L"correct but potentially slow\*(R". The tsc_mode parameter in 4.0 provides an intelligent default but allows system administrator's to adjust how rdtsc instructions are executed differently for different domains. .PP The non-default choices for tsc_mode are: .IP "\(bu" 4 \&\fBtsc_mode=1\fR (always emulate). .Sp All rdtsc instructions are emulated; this is the best choice when TSC-sensitive apps are running and it is necessary to understand worst-case performance degradation for a specific hardware environment. .IP "\(bu" 4 \&\fBtsc_mode=2\fR (never emulate). .Sp This is the same as prior to Xen 4.0 and is the best choice if it is certain that all apps running in this \s-1VM\s0 are TSC-resilient and highest performance is required. .IP "\(bu" 4 \&\fBtsc_mode=3\fR (\s-1PVRDTSCP\s0). .Sp High-TSC-frequency apps may be paravirtualized (modified) to obtain both correctness and highest performance; any unmodified apps must be TSC-resilient. .PP If tsc_mode is left unspecified (or set to \fBtsc_mode=0\fR), a hybrid algorithm is utilized to ensure correctness while providing the best performance possible given: .IP "\(bu" 4 the requirement of correctness, .IP "\(bu" 4 the underlying hardware, and .IP "\(bu" 4 whether or not the \s-1VM\s0 has been saved/restored/migrated .PP To understand this in more detail, the rest of this document must be read. .SH "DETERMINING RDTSC FREQUENCY" .IX Header "DETERMINING RDTSC FREQUENCY" To determine the frequency of rdtsc instructions that are emulated, an \*(L"xl\*(R" command can be used by a privileged user of domain0. The command: .PP .Vb 1 \& # xl debug\-key s; xl dmesg | tail .Ve .PP provides information about \s-1TSC\s0 usage in each domain where \s-1TSC\s0 emulation is currently enabled. .SH "TSC HISTORY" .IX Header "TSC HISTORY" To understand tsc_mode completely, some background on \s-1TSC\s0 is required: .PP The x86 \*(L"timestamp counter\*(R", or \s-1TSC,\s0 is a 64\-bit register on each processor that increases monotonically. Historically, \s-1TSC\s0 incremented every processor cycle, but on recent processors, it increases at a constant rate even if the processor changes frequency (for example, to reduce processor power usage). \s-1TSC\s0 is known by x86 programmers as the fastest, highest-precision measurement of the passage of time so it is often used as a foundation for performance monitoring. And since it is guaranteed to be monotonically increasing and, at 64 bits, is guaranteed to not wraparound within 10 years, it is sometimes used as a random number or a unique sequence identifier, such as to stamp transactions so they can be replayed in a specific order. .PP On most older \s-1SMP\s0 and early multi-core machines, \s-1TSC\s0 was not synchronized between processors. Thus if an application were to read the \s-1TSC\s0 on one processor, then was moved by the \s-1OS\s0 to another processor, then read \&\s-1TSC\s0 again, it might appear that \*(L"time went backwards\*(R". This loss of monotonicity resulted in many obscure application bugs when TSC-sensitive apps were ported from a uniprocessor to an \s-1SMP\s0 environment; as a result, many applications \*(-- especially in the Windows world \*(-- removed their dependency on \s-1TSC\s0 and replaced their timestamp needs with OS-specific functions, losing both performance and precision. On some more recent generations of multi-core machines, especially multi-socket multi-core machines, the \s-1TSC\s0 was synchronized but if one processor were to enter certain low-power states, its \s-1TSC\s0 would stop, destroying the synchrony and again causing obscure bugs. This reinforced decisions to avoid use of \s-1TSC\s0 altogether. On the most recent generations of multi-core machines, however, synchronization is provided across all processors in all power states, even on multi-socket machines, and provide a flag that indicates that \s-1TSC\s0 is synchronized and \*(L"invariant\*(R". Thus \&\s-1TSC\s0 is once again useful for applications, and even newer operating systems are using and depending upon \s-1TSC\s0 for critical timekeeping tasks when running on these recent machines. .PP We will refer to hardware that ensures \s-1TSC\s0 is both synchronized and invariant as \*(L"TSC-safe\*(R" and any hardware on which \s-1TSC\s0 is not (or may not remain) synchronized as \*(L"TSC-unsafe\*(R". .PP As a result of \s-1TSC\s0's sordid history, two classes of applications use \&\s-1TSC:\s0 old applications designed for single processors, and the most recent enterprise applications which require high-frequency high-precision timestamping. .PP We will refer to apps that might break if running on a TSC-unsafe machine as \*(L"TSC-sensitive\*(R"; apps that don't use \s-1TSC,\s0 or do use \&\s-1TSC\s0 but use it in a way that monotonicity and frequency invariance are unimportant as \*(L"TSC-resilient\*(R". .PP The emergence of virtualization once again complicates the usage of \&\s-1TSC.\s0 When features such as save/restore or live migration are employed, a guest \s-1OS\s0 and all its currently running applications may be invisibly transported to an entirely different physical machine. While \s-1TSC\s0 may be \*(L"safe\*(R" on one machine, it is essentially impossible to precisely synchronize \s-1TSC\s0 across a data center or even a pool of machines. As a result, when run in a virtualized environment, rare and obscure \&\*(L"time going backwards\*(R" problems might once again occur for those TSC-sensitive applications. Worse, if a guest \s-1OS\s0 moves from, for example, a 3GHz machine to a 1.5GHz machine, attempts by an OS/app to measure time intervals with \s-1TSC\s0 may without notice be incorrect by a factor of two. .PP The rdtsc (read timestamp counter) instruction is used to read the \&\s-1TSC\s0 register. The rdtscp instruction is a variant of rdtsc on recent processors. We refer to these together as the rdtsc family of instructions, or just \*(L"rdtsc\*(R". Instructions in the rdtsc family are non-privileged, but privileged software may set a cpuid bit to cause all rdtsc family instructions to trap. This trap can be detected by Xen, which can then transparently \*(L"emulate\*(R" the results of the rdtsc instruction and return control to the code following the rdtsc instruction. .PP To provide a \*(L"safe\*(R" \s-1TSC,\s0 i.e. to ensure both \s-1TSC\s0 monotonicity and a fixed rate, Xen provides rdtsc emulation whenever necessary or when explicitly specified by a per-VM configuration option. \s-1TSC\s0 emulation is relatively slow \*(-- roughly 15\-20 times slower than the rdtsc instruction when executed natively. However, except when an \s-1OS\s0 or application uses the rdtsc instruction at a high frequency (e.g. more than about 10,000 times per second per processor), this performance degradation is not noticeable (i.e. <0.3%). And, \s-1TSC\s0 emulation is nearly always faster than OS-provided alternatives (e.g. Linux's gettimeofday). For environments where it is certain that all apps are TSC-resilient (e.g. \&\*(L"TSC-safeness\*(R" is not necessary) and highest performance is a requirement, \s-1TSC\s0 emulation may be entirely disabled (tsc_mode==2). .PP The default mode (tsc_mode==0) checks TSC-safeness of the underlying hardware on which the virtual machine is launched. If it is TSC-safe, rdtsc will execute at hardware speed; if it is not, rdtsc will be emulated. Once a virtual machine is save/restored or migrated, however, there are two possibilities: \s-1TSC\s0 remains native \s-1IF\s0 the source physical machine and target physical machine have the same \s-1TSC\s0 frequency (or, for \s-1HVM/PVH\s0 guests, if \s-1TSC\s0 scaling support is available); else \s-1TSC\s0 is emulated. Note that, though emulated, the \*(L"apparent\*(R" \s-1TSC\s0 frequency will be the \s-1TSC\s0 frequency of the initial physical machine, even after migration. .PP For environments where both TSC-safeness \s-1AND\s0 highest performance even across migration is a requirement, application code can be specially modified to use an algorithm explicitly designed into Xen for this purpose. This mode (tsc_mode==3) is called \s-1PVRDTSCP,\s0 because it requires app paravirtualization (awareness by the app that it may be running on top of Xen), and utilizes a variation of the rdtsc instruction called rdtscp that is available on most recent generation processors. (The rdtscp instruction differs from the rdtsc instruction in that it reads not only the \s-1TSC\s0 but an additional register set by system software.) When a pvrdtscp-modified app is running on a processor that is both TSC-safe and supports the rdtscp instruction, information can be obtained about migration and \s-1TSC\s0 frequency/offset adjustment to allow the vast majority of timestamps to be obtained at top performance; when running on a TSC-unsafe processor or a processor that doesn't support the rdtscp instruction, rdtscp is emulated. .PP \&\s-1PVRDTSCP\s0 (tsc_mode==3) has two limitations. First, it applies to all apps running in this virtual machine. This means that all apps must either be TSC-resilient or pvrdtscp-modified. Second, highest performance is only obtained on TSC-safe machines that support the rdtscp instruction; when running on older machines, rdtscp is emulated and thus slower. For more information on \s-1PVRDTSCP,\s0 see below. .PP Finally, tsc_mode==1 always enables \s-1TSC\s0 emulation, regardless of the underlying physical hardware. The \*(L"apparent\*(R" \s-1TSC\s0 frequency will be the \s-1TSC\s0 frequency of the initial physical machine, even after migration. This mode is useful to measure any performance degradation that might be encountered by a tsc_mode==0 domain after migration occurs, or a tsc_mode==3 domain when it is running on TSC-unsafe hardware. .PP Note that while Xen ensures that an emulated \s-1TSC\s0 is \*(L"safe\*(R" across migration, it does not ensure that it continues to tick at the same rate during the actual migration. As an oversimplified example, if \s-1TSC\s0 is ticking once per second in a guest, and the guest is saved when the \s-1TSC\s0 is 1000, then restored 30 seconds later, \s-1TSC\s0 is only guaranteed to be greater than or equal to 1001, not precisely 1030. This has some \s-1OS\s0 implications as will be seen in the next section. .SH "TSC INVARIANT BIT and NO_MIGRATE" .IX Header "TSC INVARIANT BIT and NO_MIGRATE" Related to \s-1TSC\s0 emulation, the \*(L"\s-1TSC\s0 Invariant\*(R" bit is architecturally defined in a cpuid bit on the most recent x86 processors. If set, \s-1TSC\s0 invariance ensures that the \s-1TSC\s0 is \*(L"safe\*(R", that is it will increment at a constant rate regardless of power events, will be synchronized across all processors, and was properly initialized to zero on all processors at boot-time by system hardware/BIOS. As long as system software never writes to \s-1TSC, TSC\s0 will be safe and continuously incremented at a fixed rate and thus can be used as a system \*(L"clocksource\*(R". .PP This bit is used by some \s-1OS\s0's, and specifically by Linux starting with version 2.6.30(?), to select \s-1TSC\s0 as a system clocksource. Once selected, \&\s-1TSC\s0 remains the Linux system clocksource unless manually overridden. In a virtualized environment, since it is not possible to synchronize \s-1TSC\s0 across all the machines in a pool or data center, a migration may \*(L"break\*(R" \&\s-1TSC\s0 as a usable clocksource; while time will not go backwards, it may not track wallclock time well enough to avoid certain time-sensitive consequences. As a result, Xen can only expose the \s-1TSC\s0 Invariant bit to a guest \s-1OS\s0 if it is certain that the domain will never migrate. As of Xen 4.0, the \*(L"no_migrate=1\*(R" \s-1VM\s0 configuration option may be specified to disable migration. If no_migrate is selected and the \s-1VM\s0 is running on a physical machine with \*(L"\s-1TSC\s0 Invariant\*(R", Linux 2.6.30+ will safely use \s-1TSC\s0 as the system clocksource. But, attempts to migrate or, once saved, restore this domain will fail. .PP There is another cpuid-related complication: The x86 cpuid instruction is non-privileged. \s-1HVM\s0 domains are configured to always trap this instruction to Xen, where Xen can \*(L"filter\*(R" the result. In a \s-1PV OS,\s0 all cpuid instructions have been replaced by a paravirtualized equivalent of the cpuid instruction (\*(L"pvcpuid\*(R") and also trap to Xen. But apps in a \s-1PV\s0 guest that use a cpuid instruction execute it directly, without a trap to Xen. As a result, an app may directly examine the physical \s-1TSC\s0 Invariant cpuid bit and make decisions based on that bit. This is still an unsolved problem, though a workaround exists as part of the \s-1PVRDTSCP\s0 tsc_mode for apps that can be modified. .SH "MORE ON PVRDTSCP" .IX Header "MORE ON PVRDTSCP" Paravirtualized \s-1OS\s0's use the \*(L"pvclock\*(R" algorithm to manage the passing of time. This sophisticated algorithm obtains information from a memory page shared between Xen and the \s-1OS\s0 and selects information from this page based on the current virtual \s-1CPU\s0 (vcpu) in order to properly adapt to TSC-unsafe systems and changes that occur across migration. Neither this shared page nor the vcpu information is available to a userland app so the pvclock algorithm cannot be directly used by an app, at least without performance degradation roughly equal to the cost of just emulating an rdtsc. .PP As a result, as of 4.0, Xen provides capabilities for a userland app to obtain key time values similar to the information accessible to the \s-1PV OS\s0 pvclock algorithm. The app uses the rdtscp instruction which is defined in recent processors to obtain both the \s-1TSC\s0 and an auxiliary value called \s-1TSC_AUX.\s0 Xen is responsible for setting \s-1TSC_AUX\s0 to the same value on all vcpus running any domain with tsc_mode==3; further, Xen tools are responsible for monotonically incrementing \s-1TSC_AUX\s0 anytime the domain is restored/migrated (thus changing key time values); and, when the domain is running on a physical machine that either is not TSC-safe or does not support the rdtscp instruction, Xen is responsible for emulating the rdtscp instruction and for setting \&\s-1TSC_AUX\s0 to zero on all processors. .PP Xen also provides pvclock information via a \*(L"pvcpuid\*(R" instruction. While this results in a slow trap, the information changes (and thus must be reobtained via pvcpuid) \s-1ONLY\s0 when \s-1TSC_AUX\s0 has changed, which should be very rare relative to a high frequency of rdtscp instructions. .PP Finally, Xen provides additional time-related information via other pvcpuid instructions. First, an app is capable of determining if it is currently running on Xen, next whether the tsc_mode setting of the domain in which it is running, and finally whether the underlying hardware is TSC-safe and supports the rdtscp instruction. .PP As a result, a pvrdtscp-modified app has sufficient information to compute the pvclock \*(L"elapsed nanoseconds\*(R" which can be used as a timestamp. And this can be done nearly as fast as a native rdtsc instruction, much faster than emulation, and also much faster than nearly all OS-provided time mechanisms. While pvrtscp is too complex for most apps, certain enterprise TSC-sensitive high-TSC-frequency apps may find it useful to obtain a significant performance gain. .SH "HARDWARE TSC SCALING" .IX Header "HARDWARE TSC SCALING" Intel \s-1VMX TSC\s0 scaling and \s-1AMD SVM TSC\s0 ratio allow the guest \s-1TSC\s0 read by guest rdtsc/p increasing in a different frequency than the host \&\s-1TSC\s0 frequency. .PP If a \s-1HVM\s0 container in default \s-1TSC\s0 mode (tsc_mode=0) or \s-1PVRDTSCP\s0 mode (tsc_mode=3) is created on a host that provides constant \s-1TSC,\s0 its guest \s-1TSC\s0 frequency will be the same as the host. If it is later migrated to another host that provides constant \s-1TSC\s0 and supports Intel \&\s-1VMX TSC\s0 scaling/AMD \s-1SVM TSC\s0 ratio, its guest \s-1TSC\s0 frequency will be the same before and after migration. .PP For above \s-1HVM\s0 container in default \s-1TSC\s0 mode (tsc_mode=0), if above hosts support rdtscp, both guest rdtsc and rdtscp instructions will be executed natively before and after migration. .PP For above \s-1HVM\s0 container in \s-1PVRDTSCP\s0 mode (tsc_mode=3), if the destination host does not support rdtscp, the guest rdtscp instruction will be emulated with the guest \s-1TSC\s0 frequency. .SH "AUTHORS" .IX Header "AUTHORS" Dan Magenheimer