1 .. SPDX-License-Identifier: GPL-2.0
3 =====================================
4 Scaling in the Linux Networking Stack
5 =====================================
11 This document describes a set of complementary techniques in the Linux
12 networking stack to increase parallelism and improve performance for
13 multi-processor systems.
15 The following technologies are described:
17 - RSS: Receive Side Scaling
18 - RPS: Receive Packet Steering
19 - RFS: Receive Flow Steering
20 - Accelerated Receive Flow Steering
21 - XPS: Transmit Packet Steering
24 RSS: Receive Side Scaling
25 =========================
27 Contemporary NICs support multiple receive and transmit descriptor queues
28 (multi-queue). On reception, a NIC can send different packets to different
29 queues to distribute processing among CPUs. The NIC distributes packets by
30 applying a filter to each packet that assigns it to one of a small number
31 of logical flows. Packets for each flow are steered to a separate receive
32 queue, which in turn can be processed by separate CPUs. This mechanism is
33 generally known as “Receive-side Scaling” (RSS). The goal of RSS and
34 the other scaling techniques is to increase performance uniformly.
35 Multi-queue distribution can also be used for traffic prioritization, but
36 that is not the focus of these techniques.
38 The filter used in RSS is typically a hash function over the network
39 and/or transport layer headers-- for example, a 4-tuple hash over
40 IP addresses and TCP ports of a packet. The most common hardware
41 implementation of RSS uses a 128-entry indirection table where each entry
42 stores a queue number. The receive queue for a packet is determined
43 by masking out the low order seven bits of the computed hash for the
44 packet (usually a Toeplitz hash), taking this number as a key into the
45 indirection table and reading the corresponding value.
47 Some NICs support symmetric RSS hashing where, if the IP (source address,
48 destination address) and TCP/UDP (source port, destination port) tuples
49 are swapped, the computed hash is the same. This is beneficial in some
50 applications that monitor TCP/IP flows (IDS, firewalls, ...etc) and need
51 both directions of the flow to land on the same Rx queue (and CPU). The
52 "Symmetric-XOR" is a type of RSS algorithms that achieves this hash
53 symmetry by XORing the input source and destination fields of the IP
54 and/or L4 protocols. This, however, results in reduced input entropy and
55 could potentially be exploited. Specifically, the algorithm XORs the input
58 # (SRC_IP ^ DST_IP, SRC_IP ^ DST_IP, SRC_PORT ^ DST_PORT, SRC_PORT ^ DST_PORT)
60 The result is then fed to the underlying RSS algorithm.
62 Some advanced NICs allow steering packets to queues based on
63 programmable filters. For example, webserver bound TCP port 80 packets
64 can be directed to their own receive queue. Such “n-tuple” filters can
65 be configured from ethtool (--config-ntuple).
71 The driver for a multi-queue capable NIC typically provides a kernel
72 module parameter for specifying the number of hardware queues to
73 configure. In the bnx2x driver, for instance, this parameter is called
74 num_queues. A typical RSS configuration would be to have one receive queue
75 for each CPU if the device supports enough queues, or otherwise at least
76 one for each memory domain, where a memory domain is a set of CPUs that
77 share a particular memory level (L1, L2, NUMA node, etc.).
79 The indirection table of an RSS device, which resolves a queue by masked
80 hash, is usually programmed by the driver at initialization. The
81 default mapping is to distribute the queues evenly in the table, but the
82 indirection table can be retrieved and modified at runtime using ethtool
83 commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
84 indirection table could be done to give different queues different
91 Each receive queue has a separate IRQ associated with it. The NIC triggers
92 this to notify a CPU when new packets arrive on the given queue. The
93 signaling path for PCIe devices uses message signaled interrupts (MSI-X),
94 that can route each interrupt to a particular CPU. The active mapping
95 of queues to IRQs can be determined from /proc/interrupts. By default,
96 an IRQ may be handled on any CPU. Because a non-negligible part of packet
97 processing takes place in receive interrupt handling, it is advantageous
98 to spread receive interrupts between CPUs. To manually adjust the IRQ
99 affinity of each interrupt see Documentation/core-api/irq/irq-affinity.rst. Some systems
100 will be running irqbalance, a daemon that dynamically optimizes IRQ
101 assignments and as a result may override any manual settings.
104 Suggested Configuration
105 ~~~~~~~~~~~~~~~~~~~~~~~
107 RSS should be enabled when latency is a concern or whenever receive
108 interrupt processing forms a bottleneck. Spreading load between CPUs
109 decreases queue length. For low latency networking, the optimal setting
110 is to allocate as many queues as there are CPUs in the system (or the
111 NIC maximum, if lower). The most efficient high-rate configuration
112 is likely the one with the smallest number of receive queues where no
113 receive queue overflows due to a saturated CPU, because in default
114 mode with interrupt coalescing enabled, the aggregate number of
115 interrupts (and thus work) grows with each additional queue.
117 Per-cpu load can be observed using the mpstat utility, but note that on
118 processors with hyperthreading (HT), each hyperthread is represented as
119 a separate CPU. For interrupt handling, HT has shown no benefit in
120 initial tests, so limit the number of queues to the number of CPU cores
123 Dedicated RSS contexts
124 ~~~~~~~~~~~~~~~~~~~~~~
126 Modern NICs support creating multiple co-existing RSS configurations
127 which are selected based on explicit matching rules. This can be very
128 useful when application wants to constrain the set of queues receiving
129 traffic for e.g. a particular destination port or IP address.
130 The example below shows how to direct all traffic to TCP port 22
133 To create an additional RSS context use::
135 # ethtool -X eth0 hfunc toeplitz context new
138 Kernel reports back the ID of the allocated context (the default, always
139 present RSS context has ID of 0). The new context can be queried and
140 modified using the same APIs as the default context::
142 # ethtool -x eth0 context 1
143 RX flow hash indirection table for eth0 with 13 RX ring(s):
145 8: 8 9 10 11 12 0 1 2
147 # ethtool -X eth0 equal 2 context 1
148 # ethtool -x eth0 context 1
149 RX flow hash indirection table for eth0 with 13 RX ring(s):
154 To make use of the new context direct traffic to it using an n-tuple
157 # ethtool -N eth0 flow-type tcp6 dst-port 22 context 1
158 Added rule with ID 1023
160 When done, remove the context and the rule::
162 # ethtool -N eth0 delete 1023
163 # ethtool -X eth0 context 1 delete
166 RPS: Receive Packet Steering
167 ============================
169 Receive Packet Steering (RPS) is logically a software implementation of
170 RSS. Being in software, it is necessarily called later in the datapath.
171 Whereas RSS selects the queue and hence CPU that will run the hardware
172 interrupt handler, RPS selects the CPU to perform protocol processing
173 above the interrupt handler. This is accomplished by placing the packet
174 on the desired CPU’s backlog queue and waking up the CPU for processing.
175 RPS has some advantages over RSS:
177 1) it can be used with any NIC
178 2) software filters can easily be added to hash over new protocols
179 3) it does not increase hardware device interrupt rate (although it does
180 introduce inter-processor interrupts (IPIs))
182 RPS is called during bottom half of the receive interrupt handler, when
183 a driver sends a packet up the network stack with netif_rx() or
184 netif_receive_skb(). These call the get_rps_cpu() function, which
185 selects the queue that should process a packet.
187 The first step in determining the target CPU for RPS is to calculate a
188 flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
189 depending on the protocol). This serves as a consistent hash of the
190 associated flow of the packet. The hash is either provided by hardware
191 or will be computed in the stack. Capable hardware can pass the hash in
192 the receive descriptor for the packet; this would usually be the same
193 hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
194 skb->hash and can be used elsewhere in the stack as a hash of the
197 Each receive hardware queue has an associated list of CPUs to which
198 RPS may enqueue packets for processing. For each received packet,
199 an index into the list is computed from the flow hash modulo the size
200 of the list. The indexed CPU is the target for processing the packet,
201 and the packet is queued to the tail of that CPU’s backlog queue. At
202 the end of the bottom half routine, IPIs are sent to any CPUs for which
203 packets have been queued to their backlog queue. The IPI wakes backlog
204 processing on the remote CPU, and any queued packets are then processed
205 up the networking stack.
211 RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
212 by default for SMP). Even when compiled in, RPS remains disabled until
213 explicitly configured. The list of CPUs to which RPS may forward traffic
214 can be configured for each receive queue using a sysfs file entry::
216 /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
218 This file implements a bitmap of CPUs. RPS is disabled when it is zero
219 (the default), in which case packets are processed on the interrupting
220 CPU. Documentation/core-api/irq/irq-affinity.rst explains how CPUs are assigned to
224 Suggested Configuration
225 ~~~~~~~~~~~~~~~~~~~~~~~
227 For a single queue device, a typical RPS configuration would be to set
228 the rps_cpus to the CPUs in the same memory domain of the interrupting
229 CPU. If NUMA locality is not an issue, this could also be all CPUs in
230 the system. At high interrupt rate, it might be wise to exclude the
231 interrupting CPU from the map since that already performs much work.
233 For a multi-queue system, if RSS is configured so that a hardware
234 receive queue is mapped to each CPU, then RPS is probably redundant
235 and unnecessary. If there are fewer hardware queues than CPUs, then
236 RPS might be beneficial if the rps_cpus for each queue are the ones that
237 share the same memory domain as the interrupting CPU for that queue.
243 RPS scales kernel receive processing across CPUs without introducing
244 reordering. The trade-off to sending all packets from the same flow
245 to the same CPU is CPU load imbalance if flows vary in packet rate.
246 In the extreme case a single flow dominates traffic. Especially on
247 common server workloads with many concurrent connections, such
248 behavior indicates a problem such as a misconfiguration or spoofed
249 source Denial of Service attack.
251 Flow Limit is an optional RPS feature that prioritizes small flows
252 during CPU contention by dropping packets from large flows slightly
253 ahead of those from small flows. It is active only when an RPS or RFS
254 destination CPU approaches saturation. Once a CPU's input packet
255 queue exceeds half the maximum queue length (as set by sysctl
256 net.core.netdev_max_backlog), the kernel starts a per-flow packet
257 count over the last 256 packets. If a flow exceeds a set ratio (by
258 default, half) of these packets when a new packet arrives, then the
259 new packet is dropped. Packets from other flows are still only
260 dropped once the input packet queue reaches netdev_max_backlog.
261 No packets are dropped when the input packet queue length is below
262 the threshold, so flow limit does not sever connections outright:
263 even large flows maintain connectivity.
269 Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
270 turned on. It is implemented for each CPU independently (to avoid lock
271 and cache contention) and toggled per CPU by setting the relevant bit
272 in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
273 bitmap interface as rps_cpus (see above) when called from procfs::
275 /proc/sys/net/core/flow_limit_cpu_bitmap
277 Per-flow rate is calculated by hashing each packet into a hashtable
278 bucket and incrementing a per-bucket counter. The hash function is
279 the same that selects a CPU in RPS, but as the number of buckets can
280 be much larger than the number of CPUs, flow limit has finer-grained
281 identification of large flows and fewer false positives. The default
282 table has 4096 buckets. This value can be modified through sysctl::
284 net.core.flow_limit_table_len
286 The value is only consulted when a new table is allocated. Modifying
287 it does not update active tables.
290 Suggested Configuration
291 ~~~~~~~~~~~~~~~~~~~~~~~
293 Flow limit is useful on systems with many concurrent connections,
294 where a single connection taking up 50% of a CPU indicates a problem.
295 In such environments, enable the feature on all CPUs that handle
296 network rx interrupts (as set in /proc/irq/N/smp_affinity).
298 The feature depends on the input packet queue length to exceed
299 the flow limit threshold (50%) + the flow history length (256).
300 Setting net.core.netdev_max_backlog to either 1000 or 10000
301 performed well in experiments.
304 RFS: Receive Flow Steering
305 ==========================
307 While RPS steers packets solely based on hash, and thus generally
308 provides good load distribution, it does not take into account
309 application locality. This is accomplished by Receive Flow Steering
310 (RFS). The goal of RFS is to increase datacache hitrate by steering
311 kernel processing of packets to the CPU where the application thread
312 consuming the packet is running. RFS relies on the same RPS mechanisms
313 to enqueue packets onto the backlog of another CPU and to wake up that
316 In RFS, packets are not forwarded directly by the value of their hash,
317 but the hash is used as index into a flow lookup table. This table maps
318 flows to the CPUs where those flows are being processed. The flow hash
319 (see RPS section above) is used to calculate the index into this table.
320 The CPU recorded in each entry is the one which last processed the flow.
321 If an entry does not hold a valid CPU, then packets mapped to that entry
322 are steered using plain RPS. Multiple table entries may point to the
323 same CPU. Indeed, with many flows and few CPUs, it is very likely that
324 a single application thread handles flows with many different flow hashes.
326 rps_sock_flow_table is a global flow table that contains the *desired* CPU
327 for flows: the CPU that is currently processing the flow in userspace.
328 Each table value is a CPU index that is updated during calls to recvmsg
329 and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and
332 When the scheduler moves a thread to a new CPU while it has outstanding
333 receive packets on the old CPU, packets may arrive out of order. To
334 avoid this, RFS uses a second flow table to track outstanding packets
335 for each flow: rps_dev_flow_table is a table specific to each hardware
336 receive queue of each device. Each table value stores a CPU index and a
337 counter. The CPU index represents the *current* CPU onto which packets
338 for this flow are enqueued for further kernel processing. Ideally, kernel
339 and userspace processing occur on the same CPU, and hence the CPU index
340 in both tables is identical. This is likely false if the scheduler has
341 recently migrated a userspace thread while the kernel still has packets
342 enqueued for kernel processing on the old CPU.
344 The counter in rps_dev_flow_table values records the length of the current
345 CPU's backlog when a packet in this flow was last enqueued. Each backlog
346 queue has a head counter that is incremented on dequeue. A tail counter
347 is computed as head counter + queue length. In other words, the counter
348 in rps_dev_flow[i] records the last element in flow i that has
349 been enqueued onto the currently designated CPU for flow i (of course,
350 entry i is actually selected by hash and multiple flows may hash to the
353 And now the trick for avoiding out of order packets: when selecting the
354 CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
355 and the rps_dev_flow table of the queue that the packet was received on
356 are compared. If the desired CPU for the flow (found in the
357 rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
358 table), the packet is enqueued onto that CPU’s backlog. If they differ,
359 the current CPU is updated to match the desired CPU if one of the
362 - The current CPU's queue head counter >= the recorded tail counter
363 value in rps_dev_flow[i]
364 - The current CPU is unset (>= nr_cpu_ids)
365 - The current CPU is offline
367 After this check, the packet is sent to the (possibly updated) current
368 CPU. These rules aim to ensure that a flow only moves to a new CPU when
369 there are no packets outstanding on the old CPU, as the outstanding
370 packets could arrive later than those about to be processed on the new
377 RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
378 by default for SMP). The functionality remains disabled until explicitly
379 configured. The number of entries in the global flow table is set through::
381 /proc/sys/net/core/rps_sock_flow_entries
383 The number of entries in the per-queue flow table are set through::
385 /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
388 Suggested Configuration
389 ~~~~~~~~~~~~~~~~~~~~~~~
391 Both of these need to be set before RFS is enabled for a receive queue.
392 Values for both are rounded up to the nearest power of two. The
393 suggested flow count depends on the expected number of active connections
394 at any given time, which may be significantly less than the number of open
395 connections. We have found that a value of 32768 for rps_sock_flow_entries
396 works fairly well on a moderately loaded server.
398 For a single queue device, the rps_flow_cnt value for the single queue
399 would normally be configured to the same value as rps_sock_flow_entries.
400 For a multi-queue device, the rps_flow_cnt for each queue might be
401 configured as rps_sock_flow_entries / N, where N is the number of
402 queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
403 are 16 configured receive queues, rps_flow_cnt for each queue might be
410 Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
411 balancing mechanism that uses soft state to steer flows based on where
412 the application thread consuming the packets of each flow is running.
413 Accelerated RFS should perform better than RFS since packets are sent
414 directly to a CPU local to the thread consuming the data. The target CPU
415 will either be the same CPU where the application runs, or at least a CPU
416 which is local to the application thread’s CPU in the cache hierarchy.
418 To enable accelerated RFS, the networking stack calls the
419 ndo_rx_flow_steer driver function to communicate the desired hardware
420 queue for packets matching a particular flow. The network stack
421 automatically calls this function every time a flow entry in
422 rps_dev_flow_table is updated. The driver in turn uses a device specific
423 method to program the NIC to steer the packets.
425 The hardware queue for a flow is derived from the CPU recorded in
426 rps_dev_flow_table. The stack consults a CPU to hardware queue map which
427 is maintained by the NIC driver. This is an auto-generated reverse map of
428 the IRQ affinity table shown by /proc/interrupts. Drivers can use
429 functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
430 to populate the map. For each CPU, the corresponding queue in the map is
431 set to be one whose processing CPU is closest in cache locality.
434 Accelerated RFS Configuration
435 -----------------------------
437 Accelerated RFS is only available if the kernel is compiled with
438 CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
439 It also requires that ntuple filtering is enabled via ethtool. The map
440 of CPU to queues is automatically deduced from the IRQ affinities
441 configured for each receive queue by the driver, so no additional
442 configuration should be necessary.
445 Suggested Configuration
446 ~~~~~~~~~~~~~~~~~~~~~~~
448 This technique should be enabled whenever one wants to use RFS and the
449 NIC supports hardware acceleration.
452 XPS: Transmit Packet Steering
453 =============================
455 Transmit Packet Steering is a mechanism for intelligently selecting
456 which transmit queue to use when transmitting a packet on a multi-queue
457 device. This can be accomplished by recording two kinds of maps, either
458 a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
459 to hardware transmit queue(s).
461 1. XPS using CPUs map
463 The goal of this mapping is usually to assign queues
464 exclusively to a subset of CPUs, where the transmit completions for
465 these queues are processed on a CPU within this set. This choice
466 provides two benefits. First, contention on the device queue lock is
467 significantly reduced since fewer CPUs contend for the same queue
468 (contention can be eliminated completely if each CPU has its own
469 transmit queue). Secondly, cache miss rate on transmit completion is
470 reduced, in particular for data cache lines that hold the sk_buff
473 2. XPS using receive queues map
475 This mapping is used to pick transmit queue based on the receive
476 queue(s) map configuration set by the administrator. A set of receive
477 queues can be mapped to a set of transmit queues (many:many), although
478 the common use case is a 1:1 mapping. This will enable sending packets
479 on the same queue associations for transmit and receive. This is useful for
480 busy polling multi-threaded workloads where there are challenges in
481 associating a given CPU to a given application thread. The application
482 threads are not pinned to CPUs and each thread handles packets
483 received on a single queue. The receive queue number is cached in the
484 socket for the connection. In this model, sending the packets on the same
485 transmit queue corresponding to the associated receive queue has benefits
486 in keeping the CPU overhead low. Transmit completion work is locked into
487 the same queue-association that a given application is polling on. This
488 avoids the overhead of triggering an interrupt on another CPU. When the
489 application cleans up the packets during the busy poll, transmit completion
490 may be processed along with it in the same thread context and so result in
493 XPS is configured per transmit queue by setting a bitmap of
494 CPUs/receive-queues that may use that queue to transmit. The reverse
495 mapping, from CPUs to transmit queues or from receive-queues to transmit
496 queues, is computed and maintained for each network device. When
497 transmitting the first packet in a flow, the function get_xps_queue() is
498 called to select a queue. This function uses the ID of the receive queue
499 for the socket connection for a match in the receive queue-to-transmit queue
500 lookup table. Alternatively, this function can also use the ID of the
501 running CPU as a key into the CPU-to-queue lookup table. If the
502 ID matches a single queue, that is used for transmission. If multiple
503 queues match, one is selected by using the flow hash to compute an index
504 into the set. When selecting the transmit queue based on receive queue(s)
505 map, the transmit device is not validated against the receive device as it
506 requires expensive lookup operation in the datapath.
508 The queue chosen for transmitting a particular flow is saved in the
509 corresponding socket structure for the flow (e.g. a TCP connection).
510 This transmit queue is used for subsequent packets sent on the flow to
511 prevent out of order (ooo) packets. The choice also amortizes the cost
512 of calling get_xps_queues() over all packets in the flow. To avoid
513 ooo packets, the queue for a flow can subsequently only be changed if
514 skb->ooo_okay is set for a packet in the flow. This flag indicates that
515 there are no outstanding packets in the flow, so the transmit queue can
516 change without the risk of generating out of order packets. The
517 transport layer is responsible for setting ooo_okay appropriately. TCP,
518 for instance, sets the flag when all data for a connection has been
524 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
525 default for SMP). If compiled in, it is driver dependent whether, and
526 how, XPS is configured at device init. The mapping of CPUs/receive-queues
527 to transmit queue can be inspected and configured using sysfs:
529 For selection based on CPUs map::
531 /sys/class/net/<dev>/queues/tx-<n>/xps_cpus
533 For selection based on receive-queues map::
535 /sys/class/net/<dev>/queues/tx-<n>/xps_rxqs
538 Suggested Configuration
539 ~~~~~~~~~~~~~~~~~~~~~~~
541 For a network device with a single transmission queue, XPS configuration
542 has no effect, since there is no choice in this case. In a multi-queue
543 system, XPS is preferably configured so that each CPU maps onto one queue.
544 If there are as many queues as there are CPUs in the system, then each
545 queue can also map onto one CPU, resulting in exclusive pairings that
546 experience no contention. If there are fewer queues than CPUs, then the
547 best CPUs to share a given queue are probably those that share the cache
548 with the CPU that processes transmit completions for that queue
549 (transmit interrupts).
551 For transmit queue selection based on receive queue(s), XPS has to be
552 explicitly configured mapping receive-queue(s) to transmit queue(s). If the
553 user configuration for receive-queue map does not apply, then the transmit
554 queue is selected based on the CPUs map.
557 Per TX Queue rate limitation
558 ============================
560 These are rate-limitation mechanisms implemented by HW, where currently
561 a max-rate attribute is supported, by setting a Mbps value to::
563 /sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
565 A value of zero means disabled, and this is the default.
570 RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
571 2.6.38. Original patches were submitted by Tom Herbert
572 (therbert@google.com)
574 Accelerated RFS was introduced in 2.6.35. Original patches were
575 submitted by Ben Hutchings (bwh@kernel.org)
579 - Tom Herbert (therbert@google.com)
580 - Willem de Bruijn (willemb@google.com)