Mention branches and keyring.
[releases.git] / devlink / devlink-health.rst
1 .. SPDX-License-Identifier: GPL-2.0
2
3 ==============
4 Devlink Health
5 ==============
6
7 Background
8 ==========
9
10 The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11 order to know when something bad happened to a PCI device.
12
13   * Provide alert debug information.
14   * Self healing.
15   * If problem needs vendor support, provide a way to gather all needed
16     debugging information.
17
18 Overview
19 ========
20
21 The main idea is to unify and centralize driver health reports in the
22 generic ``devlink`` instance and allow the user to set different
23 attributes of the health reporting and recovery procedures.
24
25 The ``devlink`` health reporter:
26 Device driver creates a "health reporter" per each error/health type.
27 Error/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
28 or unknown (driver specific).
29 For each registered health reporter a driver can issue error/health reports
30 asynchronously. All health reports handling is done by ``devlink``.
31 Device driver can provide specific callbacks for each "health reporter", e.g.:
32
33   * Recovery procedures
34   * Diagnostics procedures
35   * Object dump procedures
36   * OOB initial parameters
37
38 Different parts of the driver can register different types of health reporters
39 with different handlers.
40
41 Actions
42 =======
43
44 Once an error is reported, devlink health will perform the following actions:
45
46   * A log is being send to the kernel trace events buffer
47   * Health status and statistics are being updated for the reporter instance
48   * Object dump is being taken and saved at the reporter instance (as long as
49     there is no other dump which is already stored)
50   * Auto recovery attempt is being done. Depends on:
51
52     - Auto-recovery configuration
53     - Grace period vs. time passed since last recover
54
55 User Interface
56 ==============
57
58 User can access/change each reporter's parameters and driver specific callbacks
59 via ``devlink``, e.g per error type (per health reporter):
60
61   * Configure reporter's generic parameters (like: disable/enable auto recovery)
62   * Invoke recovery procedure
63   * Run diagnostics
64   * Object dump
65
66 .. list-table:: List of devlink health interfaces
67    :widths: 10 90
68
69    * - Name
70      - Description
71    * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
72      - Retrieves status and configuration info per DEV and reporter.
73    * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
74      - Allows reporter-related configuration setting.
75    * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
76      - Triggers reporter's recovery procedure.
77    * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
78      - Triggers a fake health event on the reporter. The effects of the test
79        event in terms of recovery flow should follow closely that of a real
80        event.
81    * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
82      - Retrieves current device state related to the reporter.
83    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
84      - Retrieves the last stored dump. Devlink health
85        saves a single dump. If an dump is not already stored by devlink
86        for this reporter, devlink generates a new dump.
87        Dump output is defined by the reporter.
88    * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
89      - Clears the last saved dump file for the specified reporter.
90
91 The following diagram provides a general overview of ``devlink-health``::
92
93                                                    netlink
94                                           +--------------------------+
95                                           |                          |
96                                           |            +             |
97                                           |            |             |
98                                           +--------------------------+
99                                                        |request for ops
100                                                        |(diagnose,
101       driver                               devlink     |recover,
102                                                        |dump)
103     +--------+                            +--------------------------+
104     |        |                            |    reporter|             |
105     |        |                            |  +---------v----------+  |
106     |        |   ops execution            |  |                    |  |
107     |     <----------------------------------+                    |  |
108     |        |                            |  |                    |  |
109     |        |                            |  + ^------------------+  |
110     |        |                            |    | request for ops     |
111     |        |                            |    | (recover, dump)     |
112     |        |                            |    |                     |
113     |        |                            |  +-+------------------+  |
114     |        |     health report          |  | health handler     |  |
115     |        +------------------------------->                    |  |
116     |        |                            |  +--------------------+  |
117     |        |     health reporter create |                          |
118     |        +---------------------------->                          |
119     +--------+                            +--------------------------+