Documentation/virt/kvm/vcpu-requests.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =================
   4 KVM VCPU Requests
   5 =================
   6
   7 Overview
   8 ========
   9
  10 KVM supports an internal API enabling threads to request a VCPU thread to
  11 perform some activity.  For example, a thread may request a VCPU to flush
  12 its TLB with a VCPU request.  The API consists of the following functions::
  13
  14   /* Check if any requests are pending for VCPU @vcpu. */
  15   bool kvm_request_pending(struct kvm_vcpu *vcpu);
  16
  17   /* Check if VCPU @vcpu has request @req pending. */
  18   bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
  19
  20   /* Clear request @req for VCPU @vcpu. */
  21   void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
  22
  23   /*
  24    * Check if VCPU @vcpu has request @req pending. When the request is
  25    * pending it will be cleared and a memory barrier, which pairs with
  26    * another in kvm_make_request(), will be issued.
  27    */
  28   bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
  29
  30   /*
  31    * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
  32    * with another in kvm_check_request(), prior to setting the request.
  33    */
  34   void kvm_make_request(int req, struct kvm_vcpu *vcpu);
  35
  36   /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
  37   bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
  38
  39 Typically a requester wants the VCPU to perform the activity as soon
  40 as possible after making the request.  This means most requests
  41 (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
  42 and kvm_make_all_cpus_request() has the kicking of all VCPUs built
  43 into it.
  44
  45 VCPU Kicks
  46 ----------
  47
  48 The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
  49 order to perform some KVM maintenance.  To do so, an IPI is sent, forcing
  50 a guest mode exit.  However, a VCPU thread may not be in guest mode at the
  51 time of the kick.  Therefore, depending on the mode and state of the VCPU
  52 thread, there are two other actions a kick may take.  All three actions
  53 are listed below:
  54
  55 1) Send an IPI.  This forces a guest mode exit.
  56 2) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
  57    mode that wait on waitqueues.  Waking them removes the threads from
  58    the waitqueues, allowing the threads to run again.  This behavior
  59    may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
  60 3) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
  61    sleeping, then there is nothing to do.
  62
  63 VCPU Mode
  64 ---------
  65
  66 VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
  67 guest is running in guest mode or not, as well as some specific
  68 outside guest mode states.  The architecture may use ``vcpu->mode`` to
  69 ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
  70 as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
  71 even to ensure IPI acknowledgements are waited upon (see "Waiting for
  72 Acknowledgements").  The following modes are defined:
  73
  74 OUTSIDE_GUEST_MODE
  75
  76   The VCPU thread is outside guest mode.
  77
  78 IN_GUEST_MODE
  79
  80   The VCPU thread is in guest mode.
  81
  82 EXITING_GUEST_MODE
  83
  84   The VCPU thread is transitioning from IN_GUEST_MODE to
  85   OUTSIDE_GUEST_MODE.
  86
  87 READING_SHADOW_PAGE_TABLES
  88
  89   The VCPU thread is outside guest mode, but it wants the sender of
  90   certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
  91   thread is done reading the page tables.
  92
  93 VCPU Request Internals
  94 ======================
  95
  96 VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
  97 This means general bitops, like those documented in [atomic-ops]_ could
  98 also be used, e.g. ::
  99
 100   clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
 101
 102 However, VCPU request users should refrain from doing so, as it would
 103 break the abstraction.  The first 8 bits are reserved for architecture
 104 independent requests, all additional bits are available for architecture
 105 dependent requests.
 106
 107 Architecture Independent Requests
 108 ---------------------------------
 109
 110 KVM_REQ_TLB_FLUSH
 111
 112   KVM's common MMU notifier may need to flush all of a guest's TLB
 113   entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
 114   choose to use the common kvm_flush_remote_tlbs() implementation will
 115   need to handle this VCPU request.
 116
 117 KVM_REQ_VM_DEAD
 118
 119   This request informs all VCPUs that the VM is dead and unusable, e.g. due to
 120   fatal error or because the VM's state has been intentionally destroyed.
 121
 122 KVM_REQ_UNBLOCK
 123
 124   This request informs the vCPU to exit kvm_vcpu_block.  It is used for
 125   example from timer handlers that run on the host on behalf of a vCPU,
 126   or in order to update the interrupt routing and ensure that assigned
 127   devices will wake up the vCPU.
 128
 129 KVM_REQ_UNHALT
 130
 131   This request may be made from the KVM common function kvm_vcpu_block(),
 132   which is used to emulate an instruction that causes a CPU to halt until
 133   one of an architectural specific set of events and/or interrupts is
 134   received (determined by checking kvm_arch_vcpu_runnable()).  When that
 135   event or interrupt arrives kvm_vcpu_block() makes the request.  This is
 136   in contrast to when kvm_vcpu_block() returns due to any other reason,
 137   such as a pending signal, which does not indicate the VCPU's halt
 138   emulation should stop, and therefore does not make the request.
 139
 140 KVM_REQ_OUTSIDE_GUEST_MODE
 141
 142   This "request" ensures the target vCPU has exited guest mode prior to the
 143   sender of the request continuing on.  No action needs be taken by the target,
 144   and so no request is actually logged for the target.  This request is similar
 145   to a "kick", but unlike a kick it guarantees the vCPU has actually exited
 146   guest mode.  A kick only guarantees the vCPU will exit at some point in the
 147   future, e.g. a previous kick may have started the process, but there's no
 148   guarantee the to-be-kicked vCPU has fully exited guest mode.
 149
 150 KVM_REQUEST_MASK
 151 ----------------
 152
 153 VCPU requests should be masked by KVM_REQUEST_MASK before using them with
 154 bitops.  This is because only the lower 8 bits are used to represent the
 155 request's number.  The upper bits are used as flags.  Currently only two
 156 flags are defined.
 157
 158 VCPU Request Flags
 159 ------------------
 160
 161 KVM_REQUEST_NO_WAKEUP
 162
 163   This flag is applied to requests that only need immediate attention
 164   from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
 165   to be awaken for these requests.  Sleeping VCPUs will handle the
 166   requests when they are awaken later for some other reason.
 167
 168 KVM_REQUEST_WAIT
 169
 170   When requests with this flag are made with kvm_make_all_cpus_request(),
 171   then the caller will wait for each VCPU to acknowledge its IPI before
 172   proceeding.  This flag only applies to VCPUs that would receive IPIs.
 173   If, for example, the VCPU is sleeping, so no IPI is necessary, then
 174   the requesting thread does not wait.  This means that this flag may be
 175   safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for
 176   Acknowledgements" for more information about requests with
 177   KVM_REQUEST_WAIT.
 178
 179 VCPU Requests with Associated State
 180 ===================================
 181
 182 Requesters that want the receiving VCPU to handle new state need to ensure
 183 the newly written state is observable to the receiving VCPU thread's CPU
 184 by the time it observes the request.  This means a write memory barrier
 185 must be inserted after writing the new state and before setting the VCPU
 186 request bit.  Additionally, on the receiving VCPU thread's side, a
 187 corresponding read barrier must be inserted after reading the request bit
 188 and before proceeding to read the new state associated with it.  See
 189 scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
 190 [memory-barriers]_.
 191
 192 The pair of functions, kvm_check_request() and kvm_make_request(), provide
 193 the memory barriers, allowing this requirement to be handled internally by
 194 the API.
 195
 196 Ensuring Requests Are Seen
 197 ==========================
 198
 199 When making requests to VCPUs, we want to avoid the receiving VCPU
 200 executing in guest mode for an arbitrary long time without handling the
 201 request.  We can be sure this won't happen as long as we ensure the VCPU
 202 thread checks kvm_request_pending() before entering guest mode and that a
 203 kick will send an IPI to force an exit from guest mode when necessary.
 204 Extra care must be taken to cover the period after the VCPU thread's last
 205 kvm_request_pending() check and before it has entered guest mode, as kick
 206 IPIs will only trigger guest mode exits for VCPU threads that are in guest
 207 mode or at least have already disabled interrupts in order to prepare to
 208 enter guest mode.  This means that an optimized implementation (see "IPI
 209 Reduction") must be certain when it's safe to not send the IPI.  One
 210 solution, which all architectures except s390 apply, is to:
 211
 212 - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
 213   the last kvm_request_pending() check;
 214 - enable interrupts atomically when entering the guest.
 215
 216 This solution also requires memory barriers to be placed carefully in both
 217 the requesting thread and the receiving VCPU.  With the memory barriers we
 218 can exclude the possibility of a VCPU thread observing
 219 !kvm_request_pending() on its last check and then not receiving an IPI for
 220 the next request made of it, even if the request is made immediately after
 221 the check.  This is done by way of the Dekker memory barrier pattern
 222 (scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
 223 this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
 224 them into the pattern gives::
 225
 226   CPU1                                    CPU2
 227   =================                       =================
 228   local_irq_disable();
 229   WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
 230   smp_mb();                               smp_mb();
 231   if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
 232                                               IN_GUEST_MODE) {
 233       ...abort guest entry...                 ...send IPI...
 234   }                                       }
 235
 236 As stated above, the IPI is only useful for VCPU threads in guest mode or
 237 that have already disabled interrupts.  This is why this specific case of
 238 the Dekker pattern has been extended to disable interrupts before setting
 239 ``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
 240 pedantically implement the memory barrier pattern, guaranteeing the
 241 compiler doesn't interfere with ``vcpu->mode``'s carefully planned
 242 accesses.
 243
 244 IPI Reduction
 245 -------------
 246
 247 As only one IPI is needed to get a VCPU to check for any/all requests,
 248 then they may be coalesced.  This is easily done by having the first IPI
 249 sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
 250 transitional state, EXITING_GUEST_MODE, is used for this purpose.
 251
 252 Waiting for Acknowledgements
 253 ----------------------------
 254
 255 Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
 256 be sent, and the acknowledgements to be waited upon, even when the target
 257 VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
 258 is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
 259 is set after disabling interrupts.  To support these cases, the
 260 KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
 261 checking that the VCPU is IN_GUEST_MODE to checking that it is not
 262 OUTSIDE_GUEST_MODE.
 263
 264 Request-less VCPU Kicks
 265 -----------------------
 266
 267 As the determination of whether or not to send an IPI depends on the
 268 two-variable Dekker memory barrier pattern, then it's clear that
 269 request-less VCPU kicks are almost never correct.  Without the assurance
 270 that a non-IPI generating kick will still result in an action by the
 271 receiving VCPU, as the final kvm_request_pending() check does for
 272 request-accompanying kicks, then the kick may not do anything useful at
 273 all.  If, for instance, a request-less kick was made to a VCPU that was
 274 just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
 275 the VCPU thread may continue its entry without actually having done
 276 whatever it was the kick was meant to initiate.
 277
 278 One exception is x86's posted interrupt mechanism.  In this case, however,
 279 even the request-less VCPU kick is coupled with the same
 280 local_irq_disable() + smp_mb() pattern described above; the ON bit
 281 (Outstanding Notification) in the posted interrupt descriptor takes the
 282 role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
 283 set before reading ``vcpu->mode``; dually, in the VCPU thread,
 284 vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
 285 IN_GUEST_MODE.
 286
 287 Additional Considerations
 288 =========================
 289
 290 Sleeping VCPUs
 291 --------------
 292
 293 VCPU threads may need to consider requests before and/or after calling
 294 functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
 295 do or not, and, if they do, which requests need consideration, is
 296 architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
 297 to check if it should awaken.  One reason to do so is to provide
 298 architectures a function where requests may be checked if necessary.
 299
 300 Clearing Requests
 301 -----------------
 302
 303 Generally it only makes sense for the receiving VCPU thread to clear a
 304 request.  However, in some circumstances, such as when the requesting
 305 thread and the receiving VCPU thread are executed serially, such as when
 306 they are the same thread, or when they are using some form of concurrency
 307 control to temporarily execute synchronously, then it's possible to know
 308 that the request may be cleared immediately, rather than waiting for the
 309 receiving VCPU thread to handle the request in VCPU RUN.  The only current
 310 examples of this are kvm_vcpu_block() calls made by VCPUs to block
 311 themselves.  A possible side-effect of that call is to make the
 312 KVM_REQ_UNHALT request, which may then be cleared immediately when the
 313 VCPU returns from the call.
 314
 315 References
 316 ==========
 317
 318 .. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
 319 .. [memory-barriers] Documentation/memory-barriers.txt
 320 .. [lwn-mb] https://lwn.net/Articles/573436/