1 .. SPDX-License-Identifier: GPL-2.0
10 KVM supports an internal API enabling threads to request a VCPU thread to
11 perform some activity. For example, a thread may request a VCPU to flush
12 its TLB with a VCPU request. The API consists of the following functions::
14 /* Check if any requests are pending for VCPU @vcpu. */
15 bool kvm_request_pending(struct kvm_vcpu *vcpu);
17 /* Check if VCPU @vcpu has request @req pending. */
18 bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
20 /* Clear request @req for VCPU @vcpu. */
21 void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
24 * Check if VCPU @vcpu has request @req pending. When the request is
25 * pending it will be cleared and a memory barrier, which pairs with
26 * another in kvm_make_request(), will be issued.
28 bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
31 * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
32 * with another in kvm_check_request(), prior to setting the request.
34 void kvm_make_request(int req, struct kvm_vcpu *vcpu);
36 /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
37 bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
39 Typically a requester wants the VCPU to perform the activity as soon
40 as possible after making the request. This means most requests
41 (kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
42 and kvm_make_all_cpus_request() has the kicking of all VCPUs built
48 The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
49 order to perform some KVM maintenance. To do so, an IPI is sent, forcing
50 a guest mode exit. However, a VCPU thread may not be in guest mode at the
51 time of the kick. Therefore, depending on the mode and state of the VCPU
52 thread, there are two other actions a kick may take. All three actions
55 1) Send an IPI. This forces a guest mode exit.
56 2) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest
57 mode that wait on waitqueues. Waking them removes the threads from
58 the waitqueues, allowing the threads to run again. This behavior
59 may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
60 3) Nothing. When the VCPU is not in guest mode and the VCPU thread is not
61 sleeping, then there is nothing to do.
66 VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
67 guest is running in guest mode or not, as well as some specific
68 outside guest mode states. The architecture may use ``vcpu->mode`` to
69 ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
70 as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
71 even to ensure IPI acknowledgements are waited upon (see "Waiting for
72 Acknowledgements"). The following modes are defined:
76 The VCPU thread is outside guest mode.
80 The VCPU thread is in guest mode.
84 The VCPU thread is transitioning from IN_GUEST_MODE to
87 READING_SHADOW_PAGE_TABLES
89 The VCPU thread is outside guest mode, but it wants the sender of
90 certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
91 thread is done reading the page tables.
93 VCPU Request Internals
94 ======================
96 VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
97 This means general bitops, like those documented in [atomic-ops]_ could
100 clear_bit(KVM_REQ_UNBLOCK & KVM_REQUEST_MASK, &vcpu->requests);
102 However, VCPU request users should refrain from doing so, as it would
103 break the abstraction. The first 8 bits are reserved for architecture
104 independent requests, all additional bits are available for architecture
107 Architecture Independent Requests
108 ---------------------------------
112 KVM's common MMU notifier may need to flush all of a guest's TLB
113 entries, calling kvm_flush_remote_tlbs() to do so. Architectures that
114 choose to use the common kvm_flush_remote_tlbs() implementation will
115 need to handle this VCPU request.
119 This request informs all VCPUs that the VM is dead and unusable, e.g. due to
120 fatal error or because the VM's state has been intentionally destroyed.
124 This request informs the vCPU to exit kvm_vcpu_block. It is used for
125 example from timer handlers that run on the host on behalf of a vCPU,
126 or in order to update the interrupt routing and ensure that assigned
127 devices will wake up the vCPU.
129 KVM_REQ_OUTSIDE_GUEST_MODE
131 This "request" ensures the target vCPU has exited guest mode prior to the
132 sender of the request continuing on. No action needs be taken by the target,
133 and so no request is actually logged for the target. This request is similar
134 to a "kick", but unlike a kick it guarantees the vCPU has actually exited
135 guest mode. A kick only guarantees the vCPU will exit at some point in the
136 future, e.g. a previous kick may have started the process, but there's no
137 guarantee the to-be-kicked vCPU has fully exited guest mode.
142 VCPU requests should be masked by KVM_REQUEST_MASK before using them with
143 bitops. This is because only the lower 8 bits are used to represent the
144 request's number. The upper bits are used as flags. Currently only two
150 KVM_REQUEST_NO_WAKEUP
152 This flag is applied to requests that only need immediate attention
153 from VCPUs running in guest mode. That is, sleeping VCPUs do not need
154 to be awaken for these requests. Sleeping VCPUs will handle the
155 requests when they are awaken later for some other reason.
159 When requests with this flag are made with kvm_make_all_cpus_request(),
160 then the caller will wait for each VCPU to acknowledge its IPI before
161 proceeding. This flag only applies to VCPUs that would receive IPIs.
162 If, for example, the VCPU is sleeping, so no IPI is necessary, then
163 the requesting thread does not wait. This means that this flag may be
164 safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for
165 Acknowledgements" for more information about requests with
168 VCPU Requests with Associated State
169 ===================================
171 Requesters that want the receiving VCPU to handle new state need to ensure
172 the newly written state is observable to the receiving VCPU thread's CPU
173 by the time it observes the request. This means a write memory barrier
174 must be inserted after writing the new state and before setting the VCPU
175 request bit. Additionally, on the receiving VCPU thread's side, a
176 corresponding read barrier must be inserted after reading the request bit
177 and before proceeding to read the new state associated with it. See
178 scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
181 The pair of functions, kvm_check_request() and kvm_make_request(), provide
182 the memory barriers, allowing this requirement to be handled internally by
185 Ensuring Requests Are Seen
186 ==========================
188 When making requests to VCPUs, we want to avoid the receiving VCPU
189 executing in guest mode for an arbitrary long time without handling the
190 request. We can be sure this won't happen as long as we ensure the VCPU
191 thread checks kvm_request_pending() before entering guest mode and that a
192 kick will send an IPI to force an exit from guest mode when necessary.
193 Extra care must be taken to cover the period after the VCPU thread's last
194 kvm_request_pending() check and before it has entered guest mode, as kick
195 IPIs will only trigger guest mode exits for VCPU threads that are in guest
196 mode or at least have already disabled interrupts in order to prepare to
197 enter guest mode. This means that an optimized implementation (see "IPI
198 Reduction") must be certain when it's safe to not send the IPI. One
199 solution, which all architectures except s390 apply, is to:
201 - set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
202 the last kvm_request_pending() check;
203 - enable interrupts atomically when entering the guest.
205 This solution also requires memory barriers to be placed carefully in both
206 the requesting thread and the receiving VCPU. With the memory barriers we
207 can exclude the possibility of a VCPU thread observing
208 !kvm_request_pending() on its last check and then not receiving an IPI for
209 the next request made of it, even if the request is made immediately after
210 the check. This is done by way of the Dekker memory barrier pattern
211 (scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables,
212 this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting
213 them into the pattern gives::
216 ================= =================
218 WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu);
220 if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) ==
222 ...abort guest entry... ...send IPI...
225 As stated above, the IPI is only useful for VCPU threads in guest mode or
226 that have already disabled interrupts. This is why this specific case of
227 the Dekker pattern has been extended to disable interrupts before setting
228 ``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to
229 pedantically implement the memory barrier pattern, guaranteeing the
230 compiler doesn't interfere with ``vcpu->mode``'s carefully planned
236 As only one IPI is needed to get a VCPU to check for any/all requests,
237 then they may be coalesced. This is easily done by having the first IPI
238 sending kick also change the VCPU mode to something !IN_GUEST_MODE. The
239 transitional state, EXITING_GUEST_MODE, is used for this purpose.
241 Waiting for Acknowledgements
242 ----------------------------
244 Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
245 be sent, and the acknowledgements to be waited upon, even when the target
246 VCPU threads are in modes other than IN_GUEST_MODE. For example, one case
247 is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
248 is set after disabling interrupts. To support these cases, the
249 KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
250 checking that the VCPU is IN_GUEST_MODE to checking that it is not
253 Request-less VCPU Kicks
254 -----------------------
256 As the determination of whether or not to send an IPI depends on the
257 two-variable Dekker memory barrier pattern, then it's clear that
258 request-less VCPU kicks are almost never correct. Without the assurance
259 that a non-IPI generating kick will still result in an action by the
260 receiving VCPU, as the final kvm_request_pending() check does for
261 request-accompanying kicks, then the kick may not do anything useful at
262 all. If, for instance, a request-less kick was made to a VCPU that was
263 just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
264 the VCPU thread may continue its entry without actually having done
265 whatever it was the kick was meant to initiate.
267 One exception is x86's posted interrupt mechanism. In this case, however,
268 even the request-less VCPU kick is coupled with the same
269 local_irq_disable() + smp_mb() pattern described above; the ON bit
270 (Outstanding Notification) in the posted interrupt descriptor takes the
271 role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is
272 set before reading ``vcpu->mode``; dually, in the VCPU thread,
273 vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
276 Additional Considerations
277 =========================
282 VCPU threads may need to consider requests before and/or after calling
283 functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they
284 do or not, and, if they do, which requests need consideration, is
285 architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
286 to check if it should awaken. One reason to do so is to provide
287 architectures a function where requests may be checked if necessary.
292 .. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
293 .. [memory-barriers] Documentation/memory-barriers.txt
294 .. [lwn-mb] https://lwn.net/Articles/573436/