Documentation/networking/multi-pf-netdev.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2 .. include:: <isonum.txt>
   3
   4 ===============
   5 Multi-PF Netdev
   6 ===============
   7
   8 Contents
   9 ========
  10
  11 - `Background`_
  12 - `Overview`_
  13 - `mlx5 implementation`_
  14 - `Channels distribution`_
  15 - `Observability`_
  16 - `Steering`_
  17 - `Mutually exclusive features`_
  18
  19 Background
  20 ==========
  21
  22 The Multi-PF NIC technology enables several CPUs within a multi-socket server to connect directly to
  23 the network, each through its own dedicated PCIe interface. Through either a connection harness that
  24 splits the PCIe lanes between two cards or by bifurcating a PCIe slot for a single card. This
  25 results in eliminating the network traffic traversing over the internal bus between the sockets,
  26 significantly reducing overhead and latency, in addition to reducing CPU utilization and increasing
  27 network throughput.
  28
  29 Overview
  30 ========
  31
  32 The feature adds support for combining multiple PFs of the same port in a Multi-PF environment under
  33 one netdev instance. It is implemented in the netdev layer. Lower-layer instances like pci func,
  34 sysfs entry, and devlink are kept separate.
  35 Passing traffic through different devices belonging to different NUMA sockets saves cross-NUMA
  36 traffic and allows apps running on the same netdev from different NUMAs to still feel a sense of
  37 proximity to the device and achieve improved performance.
  38
  39 mlx5 implementation
  40 ===================
  41
  42 Multi-PF or Socket-direct in mlx5 is achieved by grouping PFs together which belong to the same
  43 NIC and has the socket-direct property enabled, once all PFs are probed, we create a single netdev
  44 to represent all of them, symmetrically, we destroy the netdev whenever any of the PFs is removed.
  45
  46 The netdev network channels are distributed between all devices, a proper configuration would utilize
  47 the correct close NUMA node when working on a certain app/CPU.
  48
  49 We pick one PF to be a primary (leader), and it fills a special role. The other devices
  50 (secondaries) are disconnected from the network at the chip level (set to silent mode). In silent
  51 mode, no south <-> north traffic flowing directly through a secondary PF. It needs the assistance of
  52 the leader PF (east <-> west traffic) to function. All Rx/Tx traffic is steered through the primary
  53 to/from the secondaries.
  54
  55 Currently, we limit the support to PFs only, and up to two PFs (sockets).
  56
  57 Channels distribution
  58 =====================
  59
  60 We distribute the channels between the different PFs to achieve local NUMA node performance
  61 on multiple NUMA nodes.
  62
  63 Each combined channel works against one specific PF, creating all its datapath queues against it. We
  64 distribute channels to PFs in a round-robin policy.
  65
  66 ::
  67
  68         Example for 2 PFs and 5 channels:
  69         +--------+--------+
  70         | ch idx | PF idx |
  71         +--------+--------+
  72         |    0   |    0   |
  73         |    1   |    1   |
  74         |    2   |    0   |
  75         |    3   |    1   |
  76         |    4   |    0   |
  77         +--------+--------+
  78
  79
  80 The reason we prefer round-robin is, it is less influenced by changes in the number of channels. The
  81 mapping between a channel index and a PF is fixed, no matter how many channels the user configures.
  82 As the channel stats are persistent across channel's closure, changing the mapping every single time
  83 would turn the accumulative stats less representing of the channel's history.
  84
  85 This is achieved by using the correct core device instance (mdev) in each channel, instead of them
  86 all using the same instance under "priv->mdev".
  87
  88 Observability
  89 =============
  90 The relation between PF, irq, napi, and queue can be observed via netlink spec::
  91
  92   $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump queue-get --json='{"ifindex": 13}'
  93   [{'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'rx'},
  94    {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'rx'},
  95    {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'rx'},
  96    {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'rx'},
  97    {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'rx'},
  98    {'id': 0, 'ifindex': 13, 'napi-id': 539, 'type': 'tx'},
  99    {'id': 1, 'ifindex': 13, 'napi-id': 540, 'type': 'tx'},
 100    {'id': 2, 'ifindex': 13, 'napi-id': 541, 'type': 'tx'},
 101    {'id': 3, 'ifindex': 13, 'napi-id': 542, 'type': 'tx'},
 102    {'id': 4, 'ifindex': 13, 'napi-id': 543, 'type': 'tx'}]
 103
 104   $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump napi-get --json='{"ifindex": 13}'
 105   [{'id': 543, 'ifindex': 13, 'irq': 42},
 106    {'id': 542, 'ifindex': 13, 'irq': 41},
 107    {'id': 541, 'ifindex': 13, 'irq': 40},
 108    {'id': 540, 'ifindex': 13, 'irq': 39},
 109    {'id': 539, 'ifindex': 13, 'irq': 36}]
 110
 111 Here you can clearly observe our channels distribution policy::
 112
 113   $ ls /proc/irq/{36,39,40,41,42}/mlx5* -d -1
 114   /proc/irq/36/mlx5_comp1@pci:0000:08:00.0
 115   /proc/irq/39/mlx5_comp1@pci:0000:09:00.0
 116   /proc/irq/40/mlx5_comp2@pci:0000:08:00.0
 117   /proc/irq/41/mlx5_comp2@pci:0000:09:00.0
 118   /proc/irq/42/mlx5_comp3@pci:0000:08:00.0
 119
 120 Steering
 121 ========
 122 Secondary PFs are set to "silent" mode, meaning they are disconnected from the network.
 123
 124 In Rx, the steering tables belong to the primary PF only, and it is its role to distribute incoming
 125 traffic to other PFs, via cross-vhca steering capabilities. Still maintain a single default RSS table,
 126 that is capable of pointing to the receive queues of a different PF.
 127
 128 In Tx, the primary PF creates a new Tx flow table, which is aliased by the secondaries, so they can
 129 go out to the network through it.
 130
 131 In addition, we set default XPS configuration that, based on the CPU, selects an SQ belonging to the
 132 PF on the same node as the CPU.
 133
 134 XPS default config example:
 135
 136 NUMA node(s):          2
 137 NUMA node0 CPU(s):     0-11
 138 NUMA node1 CPU(s):     12-23
 139
 140 PF0 on node0, PF1 on node1.
 141
 142 - /sys/class/net/eth2/queues/tx-0/xps_cpus:000001
 143 - /sys/class/net/eth2/queues/tx-1/xps_cpus:001000
 144 - /sys/class/net/eth2/queues/tx-2/xps_cpus:000002
 145 - /sys/class/net/eth2/queues/tx-3/xps_cpus:002000
 146 - /sys/class/net/eth2/queues/tx-4/xps_cpus:000004
 147 - /sys/class/net/eth2/queues/tx-5/xps_cpus:004000
 148 - /sys/class/net/eth2/queues/tx-6/xps_cpus:000008
 149 - /sys/class/net/eth2/queues/tx-7/xps_cpus:008000
 150 - /sys/class/net/eth2/queues/tx-8/xps_cpus:000010
 151 - /sys/class/net/eth2/queues/tx-9/xps_cpus:010000
 152 - /sys/class/net/eth2/queues/tx-10/xps_cpus:000020
 153 - /sys/class/net/eth2/queues/tx-11/xps_cpus:020000
 154 - /sys/class/net/eth2/queues/tx-12/xps_cpus:000040
 155 - /sys/class/net/eth2/queues/tx-13/xps_cpus:040000
 156 - /sys/class/net/eth2/queues/tx-14/xps_cpus:000080
 157 - /sys/class/net/eth2/queues/tx-15/xps_cpus:080000
 158 - /sys/class/net/eth2/queues/tx-16/xps_cpus:000100
 159 - /sys/class/net/eth2/queues/tx-17/xps_cpus:100000
 160 - /sys/class/net/eth2/queues/tx-18/xps_cpus:000200
 161 - /sys/class/net/eth2/queues/tx-19/xps_cpus:200000
 162 - /sys/class/net/eth2/queues/tx-20/xps_cpus:000400
 163 - /sys/class/net/eth2/queues/tx-21/xps_cpus:400000
 164 - /sys/class/net/eth2/queues/tx-22/xps_cpus:000800
 165 - /sys/class/net/eth2/queues/tx-23/xps_cpus:800000
 166
 167 Mutually exclusive features
 168 ===========================
 169
 170 The nature of Multi-PF, where different channels work with different PFs, conflicts with
 171 stateful features where the state is maintained in one of the PFs.
 172 For example, in the TLS device-offload feature, special context objects are created per connection
 173 and maintained in the PF.  Transitioning between different RQs/SQs would break the feature. Hence,
 174 we disable this combination for now.