Documentation/arch/x86/tdx.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =====================================
   4 Intel Trust Domain Extensions (TDX)
   5 =====================================
   6
   7 Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from
   8 the host and physical attacks by isolating the guest register state and by
   9 encrypting the guest memory. In TDX, a special module running in a special
  10 mode sits between the host and the guest and manages the guest/host
  11 separation.
  12
  13 TDX Host Kernel Support
  14 =======================
  15
  16 TDX introduces a new CPU mode called Secure Arbitration Mode (SEAM) and
  17 a new isolated range pointed by the SEAM Ranger Register (SEAMRR).  A
  18 CPU-attested software module called 'the TDX module' runs inside the new
  19 isolated range to provide the functionalities to manage and run protected
  20 VMs.
  21
  22 TDX also leverages Intel Multi-Key Total Memory Encryption (MKTME) to
  23 provide crypto-protection to the VMs.  TDX reserves part of MKTME KeyIDs
  24 as TDX private KeyIDs, which are only accessible within the SEAM mode.
  25 BIOS is responsible for partitioning legacy MKTME KeyIDs and TDX KeyIDs.
  26
  27 Before the TDX module can be used to create and run protected VMs, it
  28 must be loaded into the isolated range and properly initialized.  The TDX
  29 architecture doesn't require the BIOS to load the TDX module, but the
  30 kernel assumes it is loaded by the BIOS.
  31
  32 TDX boot-time detection
  33 -----------------------
  34
  35 The kernel detects TDX by detecting TDX private KeyIDs during kernel
  36 boot.  Below dmesg shows when TDX is enabled by BIOS::
  37
  38   [..] virt/tdx: BIOS enabled: private KeyID range: [16, 64)
  39
  40 TDX module initialization
  41 ---------------------------------------
  42
  43 The kernel talks to the TDX module via the new SEAMCALL instruction.  The
  44 TDX module implements SEAMCALL leaf functions to allow the kernel to
  45 initialize it.
  46
  47 If the TDX module isn't loaded, the SEAMCALL instruction fails with a
  48 special error.  In this case the kernel fails the module initialization
  49 and reports the module isn't loaded::
  50
  51   [..] virt/tdx: module not loaded
  52
  53 Initializing the TDX module consumes roughly ~1/256th system RAM size to
  54 use it as 'metadata' for the TDX memory.  It also takes additional CPU
  55 time to initialize those metadata along with the TDX module itself.  Both
  56 are not trivial.  The kernel initializes the TDX module at runtime on
  57 demand.
  58
  59 Besides initializing the TDX module, a per-cpu initialization SEAMCALL
  60 must be done on one cpu before any other SEAMCALLs can be made on that
  61 cpu.
  62
  63 The kernel provides two functions, tdx_enable() and tdx_cpu_enable() to
  64 allow the user of TDX to enable the TDX module and enable TDX on local
  65 cpu respectively.
  66
  67 Making SEAMCALL requires VMXON has been done on that CPU.  Currently only
  68 KVM implements VMXON.  For now both tdx_enable() and tdx_cpu_enable()
  69 don't do VMXON internally (not trivial), but depends on the caller to
  70 guarantee that.
  71
  72 To enable TDX, the caller of TDX should: 1) temporarily disable CPU
  73 hotplug; 2) do VMXON and tdx_enable_cpu() on all online cpus; 3) call
  74 tdx_enable().  For example::
  75
  76         cpus_read_lock();
  77         on_each_cpu(vmxon_and_tdx_cpu_enable());
  78         ret = tdx_enable();
  79         cpus_read_unlock();
  80         if (ret)
  81                 goto no_tdx;
  82         // TDX is ready to use
  83
  84 And the caller of TDX must guarantee the tdx_cpu_enable() has been
  85 successfully done on any cpu before it wants to run any other SEAMCALL.
  86 A typical usage is do both VMXON and tdx_cpu_enable() in CPU hotplug
  87 online callback, and refuse to online if tdx_cpu_enable() fails.
  88
  89 User can consult dmesg to see whether the TDX module has been initialized.
  90
  91 If the TDX module is initialized successfully, dmesg shows something
  92 like below::
  93
  94   [..] virt/tdx: 262668 KBs allocated for PAMT
  95   [..] virt/tdx: module initialized
  96
  97 If the TDX module failed to initialize, dmesg also shows it failed to
  98 initialize::
  99
 100   [..] virt/tdx: module initialization failed ...
 101
 102 TDX Interaction to Other Kernel Components
 103 ------------------------------------------
 104
 105 TDX Memory Policy
 106 ~~~~~~~~~~~~~~~~~
 107
 108 TDX reports a list of "Convertible Memory Region" (CMR) to tell the
 109 kernel which memory is TDX compatible.  The kernel needs to build a list
 110 of memory regions (out of CMRs) as "TDX-usable" memory and pass those
 111 regions to the TDX module.  Once this is done, those "TDX-usable" memory
 112 regions are fixed during module's lifetime.
 113
 114 To keep things simple, currently the kernel simply guarantees all pages
 115 in the page allocator are TDX memory.  Specifically, the kernel uses all
 116 system memory in the core-mm "at the time of TDX module initialization"
 117 as TDX memory, and in the meantime, refuses to online any non-TDX-memory
 118 in the memory hotplug.
 119
 120 Physical Memory Hotplug
 121 ~~~~~~~~~~~~~~~~~~~~~~~
 122
 123 Note TDX assumes convertible memory is always physically present during
 124 machine's runtime.  A non-buggy BIOS should never support hot-removal of
 125 any convertible memory.  This implementation doesn't handle ACPI memory
 126 removal but depends on the BIOS to behave correctly.
 127
 128 CPU Hotplug
 129 ~~~~~~~~~~~
 130
 131 TDX module requires the per-cpu initialization SEAMCALL must be done on
 132 one cpu before any other SEAMCALLs can be made on that cpu.  The kernel
 133 provides tdx_cpu_enable() to let the user of TDX to do it when the user
 134 wants to use a new cpu for TDX task.
 135
 136 TDX doesn't support physical (ACPI) CPU hotplug.  During machine boot,
 137 TDX verifies all boot-time present logical CPUs are TDX compatible before
 138 enabling TDX.  A non-buggy BIOS should never support hot-add/removal of
 139 physical CPU.  Currently the kernel doesn't handle physical CPU hotplug,
 140 but depends on the BIOS to behave correctly.
 141
 142 Note TDX works with CPU logical online/offline, thus the kernel still
 143 allows to offline logical CPU and online it again.
 144
 145 Kexec()
 146 ~~~~~~~
 147
 148 TDX host support currently lacks the ability to handle kexec.  For
 149 simplicity only one of them can be enabled in the Kconfig.  This will be
 150 fixed in the future.
 151
 152 Erratum
 153 ~~~~~~~
 154
 155 The first few generations of TDX hardware have an erratum.  A partial
 156 write to a TDX private memory cacheline will silently "poison" the
 157 line.  Subsequent reads will consume the poison and generate a machine
 158 check.
 159
 160 A partial write is a memory write where a write transaction of less than
 161 cacheline lands at the memory controller.  The CPU does these via
 162 non-temporal write instructions (like MOVNTI), or through UC/WC memory
 163 mappings.  Devices can also do partial writes via DMA.
 164
 165 Theoretically, a kernel bug could do partial write to TDX private memory
 166 and trigger unexpected machine check.  What's more, the machine check
 167 code will present these as "Hardware error" when they were, in fact, a
 168 software-triggered issue.  But in the end, this issue is hard to trigger.
 169
 170 If the platform has such erratum, the kernel prints additional message in
 171 machine check handler to tell user the machine check may be caused by
 172 kernel bug on TDX private memory.
 173
 174 Interaction vs S3 and deeper states
 175 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 176
 177 TDX cannot survive from S3 and deeper states.  The hardware resets and
 178 disables TDX completely when platform goes to S3 and deeper.  Both TDX
 179 guests and the TDX module get destroyed permanently.
 180
 181 The kernel uses S3 for suspend-to-ram, and use S4 and deeper states for
 182 hibernation.  Currently, for simplicity, the kernel chooses to make TDX
 183 mutually exclusive with S3 and hibernation.
 184
 185 The kernel disables TDX during early boot when hibernation support is
 186 available::
 187
 188   [..] virt/tdx: initialization failed: Hibernation support is enabled
 189
 190 Add 'nohibernate' kernel command line to disable hibernation in order to
 191 use TDX.
 192
 193 ACPI S3 is disabled during kernel early boot if TDX is enabled.  The user
 194 needs to turn off TDX in the BIOS in order to use S3.
 195
 196 TDX Guest Support
 197 =================
 198 Since the host cannot directly access guest registers or memory, much
 199 normal functionality of a hypervisor must be moved into the guest. This is
 200 implemented using a Virtualization Exception (#VE) that is handled by the
 201 guest kernel. A #VE is handled entirely inside the guest kernel, but some
 202 require the hypervisor to be consulted.
 203
 204 TDX includes new hypercall-like mechanisms for communicating from the
 205 guest to the hypervisor or the TDX module.
 206
 207 New TDX Exceptions
 208 ------------------
 209
 210 TDX guests behave differently from bare-metal and traditional VMX guests.
 211 In TDX guests, otherwise normal instructions or memory accesses can cause
 212 #VE or #GP exceptions.
 213
 214 Instructions marked with an '*' conditionally cause exceptions.  The
 215 details for these instructions are discussed below.
 216
 217 Instruction-based #VE
 218 ~~~~~~~~~~~~~~~~~~~~~
 219
 220 - Port I/O (INS, OUTS, IN, OUT)
 221 - HLT
 222 - MONITOR, MWAIT
 223 - WBINVD, INVD
 224 - VMCALL
 225 - RDMSR*,WRMSR*
 226 - CPUID*
 227
 228 Instruction-based #GP
 229 ~~~~~~~~~~~~~~~~~~~~~
 230
 231 - All VMX instructions: INVEPT, INVVPID, VMCLEAR, VMFUNC, VMLAUNCH,
 232   VMPTRLD, VMPTRST, VMREAD, VMRESUME, VMWRITE, VMXOFF, VMXON
 233 - ENCLS, ENCLU
 234 - GETSEC
 235 - RSM
 236 - ENQCMD
 237 - RDMSR*,WRMSR*
 238
 239 RDMSR/WRMSR Behavior
 240 ~~~~~~~~~~~~~~~~~~~~
 241
 242 MSR access behavior falls into three categories:
 243
 244 - #GP generated
 245 - #VE generated
 246 - "Just works"
 247
 248 In general, the #GP MSRs should not be used in guests.  Their use likely
 249 indicates a bug in the guest.  The guest may try to handle the #GP with a
 250 hypercall but it is unlikely to succeed.
 251
 252 The #VE MSRs are typically able to be handled by the hypervisor.  Guests
 253 can make a hypercall to the hypervisor to handle the #VE.
 254
 255 The "just works" MSRs do not need any special guest handling.  They might
 256 be implemented by directly passing through the MSR to the hardware or by
 257 trapping and handling in the TDX module.  Other than possibly being slow,
 258 these MSRs appear to function just as they would on bare metal.
 259
 260 CPUID Behavior
 261 ~~~~~~~~~~~~~~
 262
 263 For some CPUID leaves and sub-leaves, the virtualized bit fields of CPUID
 264 return values (in guest EAX/EBX/ECX/EDX) are configurable by the
 265 hypervisor. For such cases, the Intel TDX module architecture defines two
 266 virtualization types:
 267
 268 - Bit fields for which the hypervisor controls the value seen by the guest
 269   TD.
 270
 271 - Bit fields for which the hypervisor configures the value such that the
 272   guest TD either sees their native value or a value of 0.  For these bit
 273   fields, the hypervisor can mask off the native values, but it can not
 274   turn *on* values.
 275
 276 A #VE is generated for CPUID leaves and sub-leaves that the TDX module does
 277 not know how to handle. The guest kernel may ask the hypervisor for the
 278 value with a hypercall.
 279
 280 #VE on Memory Accesses
 281 ----------------------
 282
 283 There are essentially two classes of TDX memory: private and shared.
 284 Private memory receives full TDX protections.  Its content is protected
 285 against access from the hypervisor.  Shared memory is expected to be
 286 shared between guest and hypervisor and does not receive full TDX
 287 protections.
 288
 289 A TD guest is in control of whether its memory accesses are treated as
 290 private or shared.  It selects the behavior with a bit in its page table
 291 entries.  This helps ensure that a guest does not place sensitive
 292 information in shared memory, exposing it to the untrusted hypervisor.
 293
 294 #VE on Shared Memory
 295 ~~~~~~~~~~~~~~~~~~~~
 296
 297 Access to shared mappings can cause a #VE.  The hypervisor ultimately
 298 controls whether a shared memory access causes a #VE, so the guest must be
 299 careful to only reference shared pages it can safely handle a #VE.  For
 300 instance, the guest should be careful not to access shared memory in the
 301 #VE handler before it reads the #VE info structure (TDG.VP.VEINFO.GET).
 302
 303 Shared mapping content is entirely controlled by the hypervisor. The guest
 304 should only use shared mappings for communicating with the hypervisor.
 305 Shared mappings must never be used for sensitive memory content like kernel
 306 stacks.  A good rule of thumb is that hypervisor-shared memory should be
 307 treated the same as memory mapped to userspace.  Both the hypervisor and
 308 userspace are completely untrusted.
 309
 310 MMIO for virtual devices is implemented as shared memory.  The guest must
 311 be careful not to access device MMIO regions unless it is also prepared to
 312 handle a #VE.
 313
 314 #VE on Private Pages
 315 ~~~~~~~~~~~~~~~~~~~~
 316
 317 An access to private mappings can also cause a #VE.  Since all kernel
 318 memory is also private memory, the kernel might theoretically need to
 319 handle a #VE on arbitrary kernel memory accesses.  This is not feasible, so
 320 TDX guests ensure that all guest memory has been "accepted" before memory
 321 is used by the kernel.
 322
 323 A modest amount of memory (typically 512M) is pre-accepted by the firmware
 324 before the kernel runs to ensure that the kernel can start up without
 325 being subjected to a #VE.
 326
 327 The hypervisor is permitted to unilaterally move accepted pages to a
 328 "blocked" state. However, if it does this, page access will not generate a
 329 #VE.  It will, instead, cause a "TD Exit" where the hypervisor is required
 330 to handle the exception.
 331
 332 Linux #VE handler
 333 -----------------
 334
 335 Just like page faults or #GP's, #VE exceptions can be either handled or be
 336 fatal.  Typically, an unhandled userspace #VE results in a SIGSEGV.
 337 An unhandled kernel #VE results in an oops.
 338
 339 Handling nested exceptions on x86 is typically nasty business.  A #VE
 340 could be interrupted by an NMI which triggers another #VE and hilarity
 341 ensues.  The TDX #VE architecture anticipated this scenario and includes a
 342 feature to make it slightly less nasty.
 343
 344 During #VE handling, the TDX module ensures that all interrupts (including
 345 NMIs) are blocked.  The block remains in place until the guest makes a
 346 TDG.VP.VEINFO.GET TDCALL.  This allows the guest to control when interrupts
 347 or a new #VE can be delivered.
 348
 349 However, the guest kernel must still be careful to avoid potential
 350 #VE-triggering actions (discussed above) while this block is in place.
 351 While the block is in place, any #VE is elevated to a double fault (#DF)
 352 which is not recoverable.
 353
 354 MMIO handling
 355 -------------
 356
 357 In non-TDX VMs, MMIO is usually implemented by giving a guest access to a
 358 mapping which will cause a VMEXIT on access, and then the hypervisor
 359 emulates the access.  That is not possible in TDX guests because VMEXIT
 360 will expose the register state to the host. TDX guests don't trust the host
 361 and can't have their state exposed to the host.
 362
 363 In TDX, MMIO regions typically trigger a #VE exception in the guest.  The
 364 guest #VE handler then emulates the MMIO instruction inside the guest and
 365 converts it into a controlled TDCALL to the host, rather than exposing
 366 guest state to the host.
 367
 368 MMIO addresses on x86 are just special physical addresses. They can
 369 theoretically be accessed with any instruction that accesses memory.
 370 However, the kernel instruction decoding method is limited. It is only
 371 designed to decode instructions like those generated by io.h macros.
 372
 373 MMIO access via other means (like structure overlays) may result in an
 374 oops.
 375
 376 Shared Memory Conversions
 377 -------------------------
 378
 379 All TDX guest memory starts out as private at boot.  This memory can not
 380 be accessed by the hypervisor.  However, some kernel users like device
 381 drivers might have a need to share data with the hypervisor.  To do this,
 382 memory must be converted between shared and private.  This can be
 383 accomplished using some existing memory encryption helpers:
 384
 385  * set_memory_decrypted() converts a range of pages to shared.
 386  * set_memory_encrypted() converts memory back to private.
 387
 388 Device drivers are the primary user of shared memory, but there's no need
 389 to touch every driver. DMA buffers and ioremap() do the conversions
 390 automatically.
 391
 392 TDX uses SWIOTLB for most DMA allocations. The SWIOTLB buffer is
 393 converted to shared on boot.
 394
 395 For coherent DMA allocation, the DMA buffer gets converted on the
 396 allocation. Check force_dma_unencrypted() for details.
 397
 398 Attestation
 399 ===========
 400
 401 Attestation is used to verify the TDX guest trustworthiness to other
 402 entities before provisioning secrets to the guest. For example, a key
 403 server may want to use attestation to verify that the guest is the
 404 desired one before releasing the encryption keys to mount the encrypted
 405 rootfs or a secondary drive.
 406
 407 The TDX module records the state of the TDX guest in various stages of
 408 the guest boot process using the build time measurement register (MRTD)
 409 and runtime measurement registers (RTMR). Measurements related to the
 410 guest initial configuration and firmware image are recorded in the MRTD
 411 register. Measurements related to initial state, kernel image, firmware
 412 image, command line options, initrd, ACPI tables, etc are recorded in
 413 RTMR registers. For more details, as an example, please refer to TDX
 414 Virtual Firmware design specification, section titled "TD Measurement".
 415 At TDX guest runtime, the attestation process is used to attest to these
 416 measurements.
 417
 418 The attestation process consists of two steps: TDREPORT generation and
 419 Quote generation.
 420
 421 TDX guest uses TDCALL[TDG.MR.REPORT] to get the TDREPORT (TDREPORT_STRUCT)
 422 from the TDX module. TDREPORT is a fixed-size data structure generated by
 423 the TDX module which contains guest-specific information (such as build
 424 and boot measurements), platform security version, and the MAC to protect
 425 the integrity of the TDREPORT. A user-provided 64-Byte REPORTDATA is used
 426 as input and included in the TDREPORT. Typically it can be some nonce
 427 provided by attestation service so the TDREPORT can be verified uniquely.
 428 More details about the TDREPORT can be found in Intel TDX Module
 429 specification, section titled "TDG.MR.REPORT Leaf".
 430
 431 After getting the TDREPORT, the second step of the attestation process
 432 is to send it to the Quoting Enclave (QE) to generate the Quote. TDREPORT
 433 by design can only be verified on the local platform as the MAC key is
 434 bound to the platform. To support remote verification of the TDREPORT,
 435 TDX leverages Intel SGX Quoting Enclave to verify the TDREPORT locally
 436 and convert it to a remotely verifiable Quote. Method of sending TDREPORT
 437 to QE is implementation specific. Attestation software can choose
 438 whatever communication channel available (i.e. vsock or TCP/IP) to
 439 send the TDREPORT to QE and receive the Quote.
 440
 441 References
 442 ==========
 443
 444 TDX reference material is collected here:
 445
 446 https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html