Documentation/virt/hyperv/vpci.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 PCI pass-thru devices
   4 =========================
   5 In a Hyper-V guest VM, PCI pass-thru devices (also called
   6 virtual PCI devices, or vPCI devices) are physical PCI devices
   7 that are mapped directly into the VM's physical address space.
   8 Guest device drivers can interact directly with the hardware
   9 without intermediation by the host hypervisor.  This approach
  10 provides higher bandwidth access to the device with lower
  11 latency, compared with devices that are virtualized by the
  12 hypervisor.  The device should appear to the guest just as it
  13 would when running on bare metal, so no changes are required
  14 to the Linux device drivers for the device.
  15
  16 Hyper-V terminology for vPCI devices is "Discrete Device
  17 Assignment" (DDA).  Public documentation for Hyper-V DDA is
  18 available here: `DDA`_
  19
  20 .. _DDA: https://learn.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment
  21
  22 DDA is typically used for storage controllers, such as NVMe,
  23 and for GPUs.  A similar mechanism for NICs is called SR-IOV
  24 and produces the same benefits by allowing a guest device
  25 driver to interact directly with the hardware.  See Hyper-V
  26 public documentation here: `SR-IOV`_
  27
  28 .. _SR-IOV: https://learn.microsoft.com/en-us/windows-hardware/drivers/network/overview-of-single-root-i-o-virtualization--sr-iov-
  29
  30 This discussion of vPCI devices includes DDA and SR-IOV
  31 devices.
  32
  33 Device Presentation
  34 -------------------
  35 Hyper-V provides full PCI functionality for a vPCI device when
  36 it is operating, so the Linux device driver for the device can
  37 be used unchanged, provided it uses the correct Linux kernel
  38 APIs for accessing PCI config space and for other integration
  39 with Linux.  But the initial detection of the PCI device and
  40 its integration with the Linux PCI subsystem must use Hyper-V
  41 specific mechanisms.  Consequently, vPCI devices on Hyper-V
  42 have a dual identity.  They are initially presented to Linux
  43 guests as VMBus devices via the standard VMBus "offer"
  44 mechanism, so they have a VMBus identity and appear under
  45 /sys/bus/vmbus/devices.  The VMBus vPCI driver in Linux at
  46 drivers/pci/controller/pci-hyperv.c handles a newly introduced
  47 vPCI device by fabricating a PCI bus topology and creating all
  48 the normal PCI device data structures in Linux that would
  49 exist if the PCI device were discovered via ACPI on a bare-
  50 metal system.  Once those data structures are set up, the
  51 device also has a normal PCI identity in Linux, and the normal
  52 Linux device driver for the vPCI device can function as if it
  53 were running in Linux on bare-metal.  Because vPCI devices are
  54 presented dynamically through the VMBus offer mechanism, they
  55 do not appear in the Linux guest's ACPI tables.  vPCI devices
  56 may be added to a VM or removed from a VM at any time during
  57 the life of the VM, and not just during initial boot.
  58
  59 With this approach, the vPCI device is a VMBus device and a
  60 PCI device at the same time.  In response to the VMBus offer
  61 message, the hv_pci_probe() function runs and establishes a
  62 VMBus connection to the vPCI VSP on the Hyper-V host.  That
  63 connection has a single VMBus channel.  The channel is used to
  64 exchange messages with the vPCI VSP for the purpose of setting
  65 up and configuring the vPCI device in Linux.  Once the device
  66 is fully configured in Linux as a PCI device, the VMBus
  67 channel is used only if Linux changes the vCPU to be interrupted
  68 in the guest, or if the vPCI device is removed from
  69 the VM while the VM is running.  The ongoing operation of the
  70 device happens directly between the Linux device driver for
  71 the device and the hardware, with VMBus and the VMBus channel
  72 playing no role.
  73
  74 PCI Device Setup
  75 ----------------
  76 PCI device setup follows a sequence that Hyper-V originally
  77 created for Windows guests, and that can be ill-suited for
  78 Linux guests due to differences in the overall structure of
  79 the Linux PCI subsystem compared with Windows.  Nonetheless,
  80 with a bit of hackery in the Hyper-V virtual PCI driver for
  81 Linux, the virtual PCI device is setup in Linux so that
  82 generic Linux PCI subsystem code and the Linux driver for the
  83 device "just work".
  84
  85 Each vPCI device is set up in Linux to be in its own PCI
  86 domain with a host bridge.  The PCI domainID is derived from
  87 bytes 4 and 5 of the instance GUID assigned to the VMBus vPCI
  88 device.  The Hyper-V host does not guarantee that these bytes
  89 are unique, so hv_pci_probe() has an algorithm to resolve
  90 collisions.  The collision resolution is intended to be stable
  91 across reboots of the same VM so that the PCI domainIDs don't
  92 change, as the domainID appears in the user space
  93 configuration of some devices.
  94
  95 hv_pci_probe() allocates a guest MMIO range to be used as PCI
  96 config space for the device.  This MMIO range is communicated
  97 to the Hyper-V host over the VMBus channel as part of telling
  98 the host that the device is ready to enter d0.  See
  99 hv_pci_enter_d0().  When the guest subsequently accesses this
 100 MMIO range, the Hyper-V host intercepts the accesses and maps
 101 them to the physical device PCI config space.
 102
 103 hv_pci_probe() also gets BAR information for the device from
 104 the Hyper-V host, and uses this information to allocate MMIO
 105 space for the BARs.  That MMIO space is then setup to be
 106 associated with the host bridge so that it works when generic
 107 PCI subsystem code in Linux processes the BARs.
 108
 109 Finally, hv_pci_probe() creates the root PCI bus.  At this
 110 point the Hyper-V virtual PCI driver hackery is done, and the
 111 normal Linux PCI machinery for scanning the root bus works to
 112 detect the device, to perform driver matching, and to
 113 initialize the driver and device.
 114
 115 PCI Device Removal
 116 ------------------
 117 A Hyper-V host may initiate removal of a vPCI device from a
 118 guest VM at any time during the life of the VM.  The removal
 119 is instigated by an admin action taken on the Hyper-V host and
 120 is not under the control of the guest OS.
 121
 122 A guest VM is notified of the removal by an unsolicited
 123 "Eject" message sent from the host to the guest over the VMBus
 124 channel associated with the vPCI device.  Upon receipt of such
 125 a message, the Hyper-V virtual PCI driver in Linux
 126 asynchronously invokes Linux kernel PCI subsystem calls to
 127 shutdown and remove the device.  When those calls are
 128 complete, an "Ejection Complete" message is sent back to
 129 Hyper-V over the VMBus channel indicating that the device has
 130 been removed.  At this point, Hyper-V sends a VMBus rescind
 131 message to the Linux guest, which the VMBus driver in Linux
 132 processes by removing the VMBus identity for the device.  Once
 133 that processing is complete, all vestiges of the device having
 134 been present are gone from the Linux kernel.  The rescind
 135 message also indicates to the guest that Hyper-V has stopped
 136 providing support for the vPCI device in the guest.  If the
 137 guest were to attempt to access that device's MMIO space, it
 138 would be an invalid reference. Hypercalls affecting the device
 139 return errors, and any further messages sent in the VMBus
 140 channel are ignored.
 141
 142 After sending the Eject message, Hyper-V allows the guest VM
 143 60 seconds to cleanly shutdown the device and respond with
 144 Ejection Complete before sending the VMBus rescind
 145 message.  If for any reason the Eject steps don't complete
 146 within the allowed 60 seconds, the Hyper-V host forcibly
 147 performs the rescind steps, which will likely result in
 148 cascading errors in the guest because the device is now no
 149 longer present from the guest standpoint and accessing the
 150 device MMIO space will fail.
 151
 152 Because ejection is asynchronous and can happen at any point
 153 during the guest VM lifecycle, proper synchronization in the
 154 Hyper-V virtual PCI driver is very tricky.  Ejection has been
 155 observed even before a newly offered vPCI device has been
 156 fully setup.  The Hyper-V virtual PCI driver has been updated
 157 several times over the years to fix race conditions when
 158 ejections happen at inopportune times. Care must be taken when
 159 modifying this code to prevent re-introducing such problems.
 160 See comments in the code.
 161
 162 Interrupt Assignment
 163 --------------------
 164 The Hyper-V virtual PCI driver supports vPCI devices using
 165 MSI, multi-MSI, or MSI-X.  Assigning the guest vCPU that will
 166 receive the interrupt for a particular MSI or MSI-X message is
 167 complex because of the way the Linux setup of IRQs maps onto
 168 the Hyper-V interfaces.  For the single-MSI and MSI-X cases,
 169 Linux calls hv_compse_msi_msg() twice, with the first call
 170 containing a dummy vCPU and the second call containing the
 171 real vCPU.  Furthermore, hv_irq_unmask() is finally called
 172 (on x86) or the GICD registers are set (on arm64) to specify
 173 the real vCPU again.  Each of these three calls interact
 174 with Hyper-V, which must decide which physical CPU should
 175 receive the interrupt before it is forwarded to the guest VM.
 176 Unfortunately, the Hyper-V decision-making process is a bit
 177 limited, and can result in concentrating the physical
 178 interrupts on a single CPU, causing a performance bottleneck.
 179 See details about how this is resolved in the extensive
 180 comment above the function hv_compose_msi_req_get_cpu().
 181
 182 The Hyper-V virtual PCI driver implements the
 183 irq_chip.irq_compose_msi_msg function as hv_compose_msi_msg().
 184 Unfortunately, on Hyper-V the implementation requires sending
 185 a VMBus message to the Hyper-V host and awaiting an interrupt
 186 indicating receipt of a reply message.  Since
 187 irq_chip.irq_compose_msi_msg can be called with IRQ locks
 188 held, it doesn't work to do the normal sleep until awakened by
 189 the interrupt. Instead hv_compose_msi_msg() must send the
 190 VMBus message, and then poll for the completion message. As
 191 further complexity, the vPCI device could be ejected/rescinded
 192 while the polling is in progress, so this scenario must be
 193 detected as well.  See comments in the code regarding this
 194 very tricky area.
 195
 196 Most of the code in the Hyper-V virtual PCI driver (pci-
 197 hyperv.c) applies to Hyper-V and Linux guests running on x86
 198 and on arm64 architectures.  But there are differences in how
 199 interrupt assignments are managed.  On x86, the Hyper-V
 200 virtual PCI driver in the guest must make a hypercall to tell
 201 Hyper-V which guest vCPU should be interrupted by each
 202 MSI/MSI-X interrupt, and the x86 interrupt vector number that
 203 the x86_vector IRQ domain has picked for the interrupt.  This
 204 hypercall is made by hv_arch_irq_unmask().  On arm64, the
 205 Hyper-V virtual PCI driver manages the allocation of an SPI
 206 for each MSI/MSI-X interrupt.  The Hyper-V virtual PCI driver
 207 stores the allocated SPI in the architectural GICD registers,
 208 which Hyper-V emulates, so no hypercall is necessary as with
 209 x86.  Hyper-V does not support using LPIs for vPCI devices in
 210 arm64 guest VMs because it does not emulate a GICv3 ITS.
 211
 212 The Hyper-V virtual PCI driver in Linux supports vPCI devices
 213 whose drivers create managed or unmanaged Linux IRQs.  If the
 214 smp_affinity for an unmanaged IRQ is updated via the /proc/irq
 215 interface, the Hyper-V virtual PCI driver is called to tell
 216 the Hyper-V host to change the interrupt targeting and
 217 everything works properly.  However, on x86 if the x86_vector
 218 IRQ domain needs to reassign an interrupt vector due to
 219 running out of vectors on a CPU, there's no path to inform the
 220 Hyper-V host of the change, and things break.  Fortunately,
 221 guest VMs operate in a constrained device environment where
 222 using all the vectors on a CPU doesn't happen. Since such a
 223 problem is only a theoretical concern rather than a practical
 224 concern, it has been left unaddressed.
 225
 226 DMA
 227 ---
 228 By default, Hyper-V pins all guest VM memory in the host
 229 when the VM is created, and programs the physical IOMMU to
 230 allow the VM to have DMA access to all its memory.  Hence
 231 it is safe to assign PCI devices to the VM, and allow the
 232 guest operating system to program the DMA transfers.  The
 233 physical IOMMU prevents a malicious guest from initiating
 234 DMA to memory belonging to the host or to other VMs on the
 235 host. From the Linux guest standpoint, such DMA transfers
 236 are in "direct" mode since Hyper-V does not provide a virtual
 237 IOMMU in the guest.
 238
 239 Hyper-V assumes that physical PCI devices always perform
 240 cache-coherent DMA.  When running on x86, this behavior is
 241 required by the architecture.  When running on arm64, the
 242 architecture allows for both cache-coherent and
 243 non-cache-coherent devices, with the behavior of each device
 244 specified in the ACPI DSDT.  But when a PCI device is assigned
 245 to a guest VM, that device does not appear in the DSDT, so the
 246 Hyper-V VMBus driver propagates cache-coherency information
 247 from the VMBus node in the ACPI DSDT to all VMBus devices,
 248 including vPCI devices (since they have a dual identity as a VMBus
 249 device and as a PCI device).  See vmbus_dma_configure().
 250 Current Hyper-V versions always indicate that the VMBus is
 251 cache coherent, so vPCI devices on arm64 always get marked as
 252 cache coherent and the CPU does not perform any sync
 253 operations as part of dma_map/unmap_*() calls.
 254
 255 vPCI protocol versions
 256 ----------------------
 257 As previously described, during vPCI device setup and teardown
 258 messages are passed over a VMBus channel between the Hyper-V
 259 host and the Hyper-v vPCI driver in the Linux guest.  Some
 260 messages have been revised in newer versions of Hyper-V, so
 261 the guest and host must agree on the vPCI protocol version to
 262 be used.  The version is negotiated when communication over
 263 the VMBus channel is first established.  See
 264 hv_pci_protocol_negotiation(). Newer versions of the protocol
 265 extend support to VMs with more than 64 vCPUs, and provide
 266 additional information about the vPCI device, such as the
 267 guest virtual NUMA node to which it is most closely affined in
 268 the underlying hardware.
 269
 270 Guest NUMA node affinity
 271 ------------------------
 272 When the vPCI protocol version provides it, the guest NUMA
 273 node affinity of the vPCI device is stored as part of the Linux
 274 device information for subsequent use by the Linux driver. See
 275 hv_pci_assign_numa_node().  If the negotiated protocol version
 276 does not support the host providing NUMA affinity information,
 277 the Linux guest defaults the device NUMA node to 0.  But even
 278 when the negotiated protocol version includes NUMA affinity
 279 information, the ability of the host to provide such
 280 information depends on certain host configuration options.  If
 281 the guest receives NUMA node value "0", it could mean NUMA
 282 node 0, or it could mean "no information is available".
 283 Unfortunately it is not possible to distinguish the two cases
 284 from the guest side.
 285
 286 PCI config space access in a CoCo VM
 287 ------------------------------------
 288 Linux PCI device drivers access PCI config space using a
 289 standard set of functions provided by the Linux PCI subsystem.
 290 In Hyper-V guests these standard functions map to functions
 291 hv_pcifront_read_config() and hv_pcifront_write_config()
 292 in the Hyper-V virtual PCI driver.  In normal VMs,
 293 these hv_pcifront_*() functions directly access the PCI config
 294 space, and the accesses trap to Hyper-V to be handled.
 295 But in CoCo VMs, memory encryption prevents Hyper-V
 296 from reading the guest instruction stream to emulate the
 297 access, so the hv_pcifront_*() functions must invoke
 298 hypercalls with explicit arguments describing the access to be
 299 made.
 300
 301 Config Block back-channel
 302 -------------------------
 303 The Hyper-V host and Hyper-V virtual PCI driver in Linux
 304 together implement a non-standard back-channel communication
 305 path between the host and guest.  The back-channel path uses
 306 messages sent over the VMBus channel associated with the vPCI
 307 device.  The functions hyperv_read_cfg_blk() and
 308 hyperv_write_cfg_blk() are the primary interfaces provided to
 309 other parts of the Linux kernel.  As of this writing, these
 310 interfaces are used only by the Mellanox mlx5 driver to pass
 311 diagnostic data to a Hyper-V host running in the Azure public
 312 cloud.  The functions hyperv_read_cfg_blk() and
 313 hyperv_write_cfg_blk() are implemented in a separate module
 314 (pci-hyperv-intf.c, under CONFIG_PCI_HYPERV_INTERFACE) that
 315 effectively stubs them out when running in non-Hyper-V
 316 environments.