Documentation/driver-api/edac.rst

   1 Error Detection And Correction (EDAC) Devices
   2 =============================================
   3
   4 Main Concepts used at the EDAC subsystem
   5 ----------------------------------------
   6
   7 There are several things to be aware of that aren't at all obvious, like
   8 *sockets, *socket sets*, *banks*, *rows*, *chip-select rows*, *channels*,
   9 etc...
  10
  11 These are some of the many terms that are thrown about that don't always
  12 mean what people think they mean (Inconceivable!).  In the interest of
  13 creating a common ground for discussion, terms and their definitions
  14 will be established.
  15
  16 * Memory devices
  17
  18 The individual DRAM chips on a memory stick.  These devices commonly
  19 output 4 and 8 bits each (x4, x8). Grouping several of these in parallel
  20 provides the number of bits that the memory controller expects:
  21 typically 72 bits, in order to provide 64 bits + 8 bits of ECC data.
  22
  23 * Memory Stick
  24
  25 A printed circuit board that aggregates multiple memory devices in
  26 parallel.  In general, this is the Field Replaceable Unit (FRU) which
  27 gets replaced, in the case of excessive errors. Most often it is also
  28 called DIMM (Dual Inline Memory Module).
  29
  30 * Memory Socket
  31
  32 A physical connector on the motherboard that accepts a single memory
  33 stick. Also called as "slot" on several datasheets.
  34
  35 * Channel
  36
  37 A memory controller channel, responsible to communicate with a group of
  38 DIMMs. Each channel has its own independent control (command) and data
  39 bus, and can be used independently or grouped with other channels.
  40
  41 * Branch
  42
  43 It is typically the highest hierarchy on a Fully-Buffered DIMM memory
  44 controller. Typically, it contains two channels. Two channels at the
  45 same branch can be used in single mode or in lockstep mode. When
  46 lockstep is enabled, the cacheline is doubled, but it generally brings
  47 some performance penalty. Also, it is generally not possible to point to
  48 just one memory stick when an error occurs, as the error correction code
  49 is calculated using two DIMMs instead of one. Due to that, it is capable
  50 of correcting more errors than on single mode.
  51
  52 * Single-channel
  53
  54 The data accessed by the memory controller is contained into one dimm
  55 only. E. g. if the data is 64 bits-wide, the data flows to the CPU using
  56 one 64 bits parallel access. Typically used with SDR, DDR, DDR2 and DDR3
  57 memories. FB-DIMM and RAMBUS use a different concept for channel, so
  58 this concept doesn't apply there.
  59
  60 * Double-channel
  61
  62 The data size accessed by the memory controller is interlaced into two
  63 dimms, accessed at the same time. E. g. if the DIMM is 64 bits-wide (72
  64 bits with ECC), the data flows to the CPU using a 128 bits parallel
  65 access.
  66
  67 * Chip-select row
  68
  69 This is the name of the DRAM signal used to select the DRAM ranks to be
  70 accessed. Common chip-select rows for single channel are 64 bits, for
  71 dual channel 128 bits. It may not be visible by the memory controller,
  72 as some DIMM types have a memory buffer that can hide direct access to
  73 it from the Memory Controller.
  74
  75 * Single-Ranked stick
  76
  77 A Single-ranked stick has 1 chip-select row of memory. Motherboards
  78 commonly drive two chip-select pins to a memory stick. A single-ranked
  79 stick, will occupy only one of those rows. The other will be unused.
  80
  81 .. _doubleranked:
  82
  83 * Double-Ranked stick
  84
  85 A double-ranked stick has two chip-select rows which access different
  86 sets of memory devices.  The two rows cannot be accessed concurrently.
  87
  88 * Double-sided stick
  89
  90 **DEPRECATED TERM**, see :ref:`Double-Ranked stick <doubleranked>`.
  91
  92 A double-sided stick has two chip-select rows which access different sets
  93 of memory devices. The two rows cannot be accessed concurrently.
  94 "Double-sided" is irrespective of the memory devices being mounted on
  95 both sides of the memory stick.
  96
  97 * Socket set
  98
  99 All of the memory sticks that are required for a single memory access or
 100 all of the memory sticks spanned by a chip-select row.  A single socket
 101 set has two chip-select rows and if double-sided sticks are used these
 102 will occupy those chip-select rows.
 103
 104 * Bank
 105
 106 This term is avoided because it is unclear when needing to distinguish
 107 between chip-select rows and socket sets.
 108
 109
 110 Memory Controllers
 111 ------------------
 112
 113 Most of the EDAC core is focused on doing Memory Controller error detection.
 114 The :c:func:`edac_mc_alloc`. It uses internally the struct ``mem_ctl_info``
 115 to describe the memory controllers, with is an opaque struct for the EDAC
 116 drivers. Only the EDAC core is allowed to touch it.
 117
 118 .. kernel-doc:: include/linux/edac.h
 119
 120 .. kernel-doc:: drivers/edac/edac_mc.h
 121
 122 PCI Controllers
 123 ---------------
 124
 125 The EDAC subsystem provides a mechanism to handle PCI controllers by calling
 126 the :c:func:`edac_pci_alloc_ctl_info`. It will use the struct
 127 :c:type:`edac_pci_ctl_info` to describe the PCI controllers.
 128
 129 .. kernel-doc:: drivers/edac/edac_pci.h
 130
 131 EDAC Blocks
 132 -----------
 133
 134 The EDAC subsystem also provides a generic mechanism to report errors on
 135 other parts of the hardware via :c:func:`edac_device_alloc_ctl_info` function.
 136
 137 The structures :c:type:`edac_dev_sysfs_block_attribute`,
 138 :c:type:`edac_device_block`, :c:type:`edac_device_instance` and
 139 :c:type:`edac_device_ctl_info` provide a generic or abstract 'edac_device'
 140 representation at sysfs.
 141
 142 This set of structures and the code that implements the APIs for the same, provide for registering EDAC type devices which are NOT standard memory or
 143 PCI, like:
 144
 145 - CPU caches (L1 and L2)
 146 - DMA engines
 147 - Core CPU switches
 148 - Fabric switch units
 149 - PCIe interface controllers
 150 - other EDAC/ECC type devices that can be monitored for
 151   errors, etc.
 152
 153 It allows for a 2 level set of hierarchy.
 154
 155 For example, a cache could be composed of L1, L2 and L3 levels of cache.
 156 Each CPU core would have its own L1 cache, while sharing L2 and maybe L3
 157 caches. On such case, those can be represented via the following sysfs
 158 nodes::
 159
 160         /sys/devices/system/edac/..
 161
 162         pci/            <existing pci directory (if available)>
 163         mc/             <existing memory device directory>
 164         cpu/cpu0/..     <L1 and L2 block directory>
 165                 /L1-cache/ce_count
 166                          /ue_count
 167                 /L2-cache/ce_count
 168                          /ue_count
 169         cpu/cpu1/..     <L1 and L2 block directory>
 170                 /L1-cache/ce_count
 171                          /ue_count
 172                 /L2-cache/ce_count
 173                          /ue_count
 174         ...
 175
 176         the L1 and L2 directories would be "edac_device_block's"
 177
 178 .. kernel-doc:: drivers/edac/edac_device.h