1 .. SPDX-License-Identifier: GPL-2.0
9 ``devlink-port`` is a port that exists on the device. It has a logically
10 separate ingress/egress point of the device. A devlink port can be any one
11 of many flavours. A devlink port flavour along with port attributes
12 describe what a port represents.
14 A device driver that intends to publish a devlink port sets the
15 devlink port attributes and registers the devlink port.
17 Devlink port flavours are described below.
19 .. list-table:: List of devlink port flavours
24 * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
25 - Any kind of physical port. This can be an eswitch physical port or any
26 other physical port on the device.
27 * - ``DEVLINK_PORT_FLAVOUR_DSA``
28 - This indicates a DSA interconnect port.
29 * - ``DEVLINK_PORT_FLAVOUR_CPU``
30 - This indicates a CPU port applicable only to DSA.
31 * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
32 - This indicates an eswitch port representing a port of PCI
33 physical function (PF).
34 * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
35 - This indicates an eswitch port representing a port of PCI
36 virtual function (VF).
37 * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
38 - This indicates an eswitch port representing a port of PCI
40 * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
41 - This indicates a virtual port for the PCI virtual function.
43 Devlink port can have a different type based on the link layer described below.
45 .. list-table:: List of devlink port types
50 * - ``DEVLINK_PORT_TYPE_ETH``
51 - Driver should set this port type when a link layer of the port is
53 * - ``DEVLINK_PORT_TYPE_IB``
54 - Driver should set this port type when a link layer of the port is
56 * - ``DEVLINK_PORT_TYPE_AUTO``
57 - This type is indicated by the user when driver should detect the port
62 In most cases a PCI device has only one controller. A controller consists of
63 potentially multiple physical, virtual functions and subfunctions. A function
64 consists of one or more ports. This port is represented by the devlink eswitch
67 A PCI device connected to multiple CPUs or multiple PCI root complexes or a
68 SmartNIC, however, may have multiple controllers. For a device with multiple
69 controllers, each controller is distinguished by a unique controller number.
70 An eswitch is on the PCI device which supports ports of multiple controllers.
72 An example view of a system with two controllers::
74 ---------------------------------------------------------
76 | --------- --------- ------- ------- |
77 ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
78 | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- |
79 | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ |
80 | connect | | ------- ------- |
81 ----------- | | controller_num=1 (no eswitch) |
82 ------|--------------------------------------------------
85 ---------------------------------------------------------
86 | devlink eswitch ports and reps |
87 | ----------------------------------------------------- |
88 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
89 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
90 | ----------------------------------------------------- |
91 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
92 | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | |
93 | ----------------------------------------------------- |
96 ----------- | --------- --------- ------- ------- |
97 | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| |
98 | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- |
99 | connect | | | pf0 |______/________/ | pf1 |___/_______/ |
100 ----------- | ------- ------- |
102 | local controller_num=0 (eswitch) |
103 ---------------------------------------------------------
105 In the above example, the external controller (identified by controller number = 1)
106 doesn't have the eswitch. Local controller (identified by controller number = 0)
107 has the eswitch. The Devlink instance on the local controller has eswitch
108 devlink ports for both the controllers.
110 Function configuration
111 ======================
113 Users can configure one or more function attributes before enumerating the PCI
114 function. Usually it means, user should configure function attribute
115 before a bus specific device for the function is created. However, when
116 SRIOV is enabled, virtual function devices are created on the PCI bus.
117 Hence, function attribute should be configured before binding virtual
118 function device to the driver. For subfunctions, this means user should
119 configure port function attribute before activating the port function.
121 A user may set the hardware address of the function using
122 `devlink port function set hw_addr` command. For Ethernet port function
123 this means a MAC address.
125 Users may also set the RoCE capability of the function using
126 `devlink port function set roce` command.
128 Users may also set the function as migratable using
129 `devlink port function set migratable` command.
131 Users may also set the IPsec crypto capability of the function using
132 `devlink port function set ipsec_crypto` command.
134 Users may also set the IPsec packet capability of the function using
135 `devlink port function set ipsec_packet` command.
142 The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
143 device created for the PCI VF/SF.
145 - Get the MAC address of the VF identified by its unique devlink port index::
147 $ devlink port show pci/0000:06:00.0/2
148 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
150 hw_addr 00:00:00:00:00:00
152 - Set the MAC address of the VF identified by its unique devlink port index::
154 $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
156 $ devlink port show pci/0000:06:00.0/2
157 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
159 hw_addr 00:11:22:33:44:55
161 - Get the MAC address of the SF identified by its unique devlink port index::
163 $ devlink port show pci/0000:06:00.0/32768
164 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
166 hw_addr 00:00:00:00:00:00
168 - Set the MAC address of the SF identified by its unique devlink port index::
170 $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
172 $ devlink port show pci/0000:06:00.0/32768
173 pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
175 hw_addr 00:00:00:00:88:88
177 RoCE capability setup
178 ---------------------
179 Not all PCI VFs/SFs require RoCE capability.
181 When RoCE capability is disabled, it saves system memory per PCI VF/SF.
183 When user disables RoCE capability for a VF/SF, user application cannot send or
184 receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
187 When RoCE capability is disabled in the device using port function attribute,
188 VF/SF driver cannot override it.
190 - Get RoCE capability of the VF device::
192 $ devlink port show pci/0000:06:00.0/2
193 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
195 hw_addr 00:00:00:00:00:00 roce enable
197 - Set RoCE capability of the VF device::
199 $ devlink port function set pci/0000:06:00.0/2 roce disable
201 $ devlink port show pci/0000:06:00.0/2
202 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
204 hw_addr 00:00:00:00:00:00 roce disable
206 migratable capability setup
207 ---------------------------
208 Live migration is the process of transferring a live virtual machine
209 from one physical host to another without disrupting its normal
212 User who want PCI VFs to be able to perform live migration need to
213 explicitly enable the VF migratable capability.
215 When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
216 with migration support, the user can migrate the VM with this VF from one HV to a
219 However, when migratable capability is enable, device will disable features which cannot
220 be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
222 Example of LM with migratable function configuration:
223 - Get migratable capability of the VF device::
225 $ devlink port show pci/0000:06:00.0/2
226 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
228 hw_addr 00:00:00:00:00:00 migratable disable
230 - Set migratable capability of the VF device::
232 $ devlink port function set pci/0000:06:00.0/2 migratable enable
234 $ devlink port show pci/0000:06:00.0/2
235 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
237 hw_addr 00:00:00:00:00:00 migratable enable
239 - Bind VF to VFIO driver with migration support::
241 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
242 $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
243 $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
247 Perform live migration.
249 IPsec crypto capability setup
250 -----------------------------
251 When user enables IPsec crypto capability for a VF, user application can offload
252 XFRM state crypto operation (Encrypt/Decrypt) to this VF.
254 When IPsec crypto capability is disabled (default) for a VF, the XFRM state is
255 processed in software by the kernel.
257 - Get IPsec crypto capability of the VF device::
259 $ devlink port show pci/0000:06:00.0/2
260 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
262 hw_addr 00:00:00:00:00:00 ipsec_crypto disabled
264 - Set IPsec crypto capability of the VF device::
266 $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable
268 $ devlink port show pci/0000:06:00.0/2
269 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
271 hw_addr 00:00:00:00:00:00 ipsec_crypto enabled
273 IPsec packet capability setup
274 -----------------------------
275 When user enables IPsec packet capability for a VF, user application can offload
276 XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as
279 When IPsec packet capability is disabled (default) for a VF, the XFRM state and
280 policy is processed in software by the kernel.
282 - Get IPsec packet capability of the VF device::
284 $ devlink port show pci/0000:06:00.0/2
285 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
287 hw_addr 00:00:00:00:00:00 ipsec_packet disabled
289 - Set IPsec packet capability of the VF device::
291 $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable
293 $ devlink port show pci/0000:06:00.0/2
294 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
296 hw_addr 00:00:00:00:00:00 ipsec_packet enabled
301 Subfunction is a lightweight function that has a parent PCI function on which
302 it is deployed. Subfunction is created and deployed in unit of 1. Unlike
303 SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
304 A subfunction communicates with the hardware through the parent PCI function.
306 To use a subfunction, 3 steps setup sequence is followed:
308 1) create - create a subfunction;
309 2) configure - configure subfunction attributes;
310 3) deploy - deploy the subfunction;
312 Subfunction management is done using devlink port user interface.
313 User performs setup on the subfunction management device.
317 A subfunction is created using a devlink port interface. A user adds the
318 subfunction by adding a devlink port of subfunction flavour. The devlink
319 kernel code calls down to subfunction management driver (devlink ops) and asks
320 it to create a subfunction devlink port. Driver then instantiates the
321 subfunction port and any associated objects such as health reporters and
322 representor netdevice.
326 A subfunction devlink port is created but it is not active yet. That means the
327 entities are created on devlink side, the e-switch port representor is created,
328 but the subfunction device itself is not created. A user might use e-switch port
329 representor to do settings, putting it into bridge, adding TC rules, etc. A user
330 might as well configure the hardware address (such as MAC address) of the
331 subfunction while subfunction is inactive.
335 Once a subfunction is configured, user must activate it to use it. Upon
336 activation, subfunction management driver asks the subfunction management
337 device to instantiate the subfunction device on particular PCI function.
338 A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
339 At this point a matching subfunction driver binds to the subfunction's auxiliary device.
341 Rate object management
342 ======================
344 Devlink provides API to manage tx rates of single devlink port or a group.
345 This is done through rate objects, which can be one of the two types:
348 Represents a single devlink port; created/destroyed by the driver. Since leaf
349 have 1to1 mapping to its devlink port, in user space it is referred as
350 ``pci/<bus_addr>/<port_index>``;
353 Represents a group of rate objects (leafs and/or nodes); created/deleted by
354 request from the userspace; initially empty (no rate objects added). In
355 userspace it is referred as ``pci/<bus_addr>/<node_name>``, where
356 ``node_name`` can be any identifier, except decimal number, to avoid
357 collisions with leafs.
359 API allows to configure following rate object's parameters:
362 Minimum TX rate value shared among all other rate objects, or rate objects
363 that parts of the parent group, if it is a part of the same group.
366 Maximum TX rate value.
369 Allows for usage of strict priority arbiter among siblings. This
370 arbitration scheme attempts to schedule nodes based on their priority
371 as long as the nodes remain within their bandwidth limit. The higher the
372 priority the higher the probability that the node will get selected for
376 Allows for usage of Weighted Fair Queuing arbitration scheme among
377 siblings. This arbitration scheme can be used simultaneously with the
378 strict priority. As a node is configured with a higher rate it gets more
379 BW relative to its siblings. Values are relative like a percentage
380 points, they basically tell how much BW should node take relative to
384 Parent node name. Parent node rate limits are considered as additional limits
385 to all node children limits. ``tx_max`` is an upper limit for children.
386 ``tx_share`` is a total bandwidth distributed among children.
388 ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
389 nodes with the same priority form a WFQ subgroup in the sibling group
390 and arbitration among them is based on assigned weights.
392 Arbitration flow from the high level:
394 #. Choose a node, or group of nodes with the highest priority that stays
395 within the BW limit and are not blocked. Use ``tx_priority`` as a
396 parameter for this arbitration.
398 #. If group of nodes have the same priority perform WFQ arbitration on
399 that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
401 #. Select the winner node, and continue arbitration flow among its children,
402 until leaf node is reached, and the winner is established.
404 #. If all the nodes from the highest priority sub-group are satisfied, or
405 overused their assigned BW, move to the lower priority nodes.
407 Driver implementations are allowed to support both or either rate object types
408 and setting methods of their parameters. Additionally driver implementation
409 may export nodes/leafs and their child-parent relationships.
411 Terms and Definitions
412 =====================
414 .. list-table:: Terms and Definitions
420 - A physical PCI device having one or more PCI buses consists of one or
421 more PCI controllers.
422 * - ``PCI controller``
423 - A controller consists of potentially multiple physical functions,
424 virtual functions and subfunctions.
425 * - ``Port function``
426 - An object to manage the function of a port.
428 - A lightweight function that has parent PCI function on which it is
430 * - ``Subfunction device``
431 - A bus device of the subfunction, usually on a auxiliary bus.
432 * - ``Subfunction driver``
433 - A device driver for the subfunction auxiliary device.
434 * - ``Subfunction management device``
435 - A PCI physical function that supports subfunction management.
436 * - ``Subfunction management driver``
437 - A device driver for PCI physical function that supports
438 subfunction management using devlink port interface.
439 * - ``Subfunction host driver``
440 - A device driver for PCI physical function that hosts subfunction
441 devices. In most cases it is same as subfunction management driver. When
442 subfunction is used on external controller, subfunction management and
443 host drivers are different.