Documentation/sparc/oradax/oracle-dax.txt

   1 Oracle Data Analytics Accelerator (DAX)
   2 ---------------------------------------
   3
   4 DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
   5 (DAX2) processor chips, and has direct access to the CPU's L3 caches
   6 as well as physical memory. It can perform several operations on data
   7 streams with various input and output formats.  A driver provides a
   8 transport mechanism and has limited knowledge of the various opcodes
   9 and data formats. A user space library provides high level services
  10 and translates these into low level commands which are then passed
  11 into the driver and subsequently the Hypervisor and the coprocessor.
  12 The library is the recommended way for applications to use the
  13 coprocessor, and the driver interface is not intended for general use.
  14 This document describes the general flow of the driver, its
  15 structures, and its programmatic interface. It also provides example
  16 code sufficient to write user or kernel applications that use DAX
  17 functionality.
  18
  19 The user library is open source and available at:
  20     https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
  21
  22 The Hypervisor interface to the coprocessor is described in detail in
  23 the accompanying document, dax-hv-api.txt, which is a plain text
  24 excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
  25 Specification" version 3.0.20+15, dated 2017-09-25.
  26
  27
  28 High Level Overview
  29 -------------------
  30
  31 A coprocessor request is described by a Command Control Block
  32 (CCB). The CCB contains an opcode and various parameters. The opcode
  33 specifies what operation is to be done, and the parameters specify
  34 options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
  35 is passed to the Hypervisor, which handles queueing and scheduling of
  36 requests to the available coprocessor execution units. A status code
  37 returned indicates if the request was submitted successfully or if
  38 there was an error.  One of the addresses given in each CCB is a
  39 pointer to a "completion area", which is a 128 byte memory block that
  40 is written by the coprocessor to provide execution status. No
  41 interrupt is generated upon completion; the completion area must be
  42 polled by software to find out when a transaction has finished, but
  43 the M7 and later processors provide a mechanism to pause the virtual
  44 processor until the completion status has been updated by the
  45 coprocessor. This is done using the monitored load and mwait
  46 instructions, which are described in more detail later.  The DAX
  47 coprocessor was designed so that after a request is submitted, the
  48 kernel is no longer involved in the processing of it.  The polling is
  49 done at the user level, which results in almost zero latency between
  50 completion of a request and resumption of execution of the requesting
  51 thread.
  52
  53
  54 Addressing Memory
  55 -----------------
  56
  57 The kernel does not have access to physical memory in the Sun4v
  58 architecture, as there is an additional level of memory virtualization
  59 present. This intermediate level is called "real" memory, and the
  60 kernel treats this as if it were physical.  The Hypervisor handles the
  61 translations between real memory and physical so that each logical
  62 domain (LDOM) can have a partition of physical memory that is isolated
  63 from that of other LDOMs.  When the kernel sets up a virtual mapping,
  64 it specifies a virtual address and the real address to which it should
  65 be mapped.
  66
  67 The DAX coprocessor can only operate on physical memory, so before a
  68 request can be fed to the coprocessor, all the addresses in a CCB must
  69 be converted into physical addresses. The kernel cannot do this since
  70 it has no visibility into physical addresses. So a CCB may contain
  71 either the virtual or real addresses of the buffers or a combination
  72 of them. An "address type" field is available for each address that
  73 may be given in the CCB. In all cases, the Hypervisor will translate
  74 all the addresses to physical before dispatching to hardware. Address
  75 translations are performed using the context of the process initiating
  76 the request.
  77
  78
  79 The Driver API
  80 --------------
  81
  82 An application makes requests to the driver via the write() system
  83 call, and gets results (if any) via read(). The completion areas are
  84 made accessible via mmap(), and are read-only for the application.
  85
  86 The request may either be an immediate command or an array of CCBs to
  87 be submitted to the hardware.
  88
  89 Each open instance of the device is exclusive to the thread that
  90 opened it, and must be used by that thread for all subsequent
  91 operations. The driver open function creates a new context for the
  92 thread and initializes it for use.  This context contains pointers and
  93 values used internally by the driver to keep track of submitted
  94 requests. The completion area buffer is also allocated, and this is
  95 large enough to contain the completion areas for many concurrent
  96 requests.  When the device is closed, any outstanding transactions are
  97 flushed and the context is cleaned up.
  98
  99 On a DAX1 system (M7), the device will be called "oradax1", while on a
 100 DAX2 system (M8) it will be "oradax2". If an application requires one
 101 or the other, it should simply attempt to open the appropriate
 102 device. Only one of the devices will exist on any given system, so the
 103 name can be used to determine what the platform supports.
 104
 105 The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
 106 all of these, success is indicated by a return value from write()
 107 equal to the number of bytes given in the call. Otherwise -1 is
 108 returned and errno is set.
 109
 110 CCB_DEQUEUE
 111
 112 Tells the driver to clean up resources associated with past
 113 requests. Since no interrupt is generated upon the completion of a
 114 request, the driver must be told when it may reclaim resources.  No
 115 further status information is returned, so the user should not
 116 subsequently call read().
 117
 118 CCB_KILL
 119
 120 Kills a CCB during execution. The CCB is guaranteed to not continue
 121 executing once this call returns successfully. On success, read() must
 122 be called to retrieve the result of the action.
 123
 124 CCB_INFO
 125
 126 Retrieves information about a currently executing CCB. Note that some
 127 Hypervisors might return 'notfound' when the CCB is in 'inprogress'
 128 state. To ensure a CCB in the 'notfound' state will never be executed,
 129 CCB_KILL must be invoked on that CCB. Upon success, read() must be
 130 called to retrieve the details of the action.
 131
 132 Submission of an array of CCBs for execution
 133
 134 A write() whose length is a multiple of the CCB size is treated as a
 135 submit operation. The file offset is treated as the index of the
 136 completion area to use, and may be set via lseek() or using the
 137 pwrite() system call. If -1 is returned then errno is set to indicate
 138 the error. Otherwise, the return value is the length of the array that
 139 was actually accepted by the coprocessor. If the accepted length is
 140 equal to the requested length, then the submission was completely
 141 successful and there is no further status needed; hence, the user
 142 should not subsequently call read(). Partial acceptance of the CCB
 143 array is indicated by a return value less than the requested length,
 144 and read() must be called to retrieve further status information.  The
 145 status will reflect the error caused by the first CCB that was not
 146 accepted, and status_data will provide additional data in some cases.
 147
 148 MMAP
 149
 150 The mmap() function provides access to the completion area allocated
 151 in the driver.  Note that the completion area is not writeable by the
 152 user process, and the mmap call must not specify PROT_WRITE.
 153
 154
 155 Completion of a Request
 156 -----------------------
 157
 158 The first byte in each completion area is the command status which is
 159 updated by the coprocessor hardware. Software may take advantage of
 160 new M7/M8 processor capabilities to efficiently poll this status byte.
 161 First, a "monitored load" is achieved via a Load from Alternate Space
 162 (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
 163 "monitored wait" is achieved via the mwait instruction (a write to
 164 %asr28). This instruction is like pause in that it suspends execution
 165 of the virtual processor for the given number of nanoseconds, but in
 166 addition will terminate early when one of several events occur. If the
 167 block of data containing the monitored location is modified, then the
 168 mwait terminates. This causes software to resume execution immediately
 169 (without a context switch or kernel to user transition) after a
 170 transaction completes. Thus the latency between transaction completion
 171 and resumption of execution may be just a few nanoseconds.
 172
 173
 174 Application Life Cycle of a DAX Submission
 175 ------------------------------------------
 176
 177  - open dax device
 178  - call mmap() to get the completion area address
 179  - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
 180  - submit CCB via write() or pwrite()
 181  - go into a loop executing monitored load + monitored wait and
 182    terminate when the command status indicates the request is complete
 183    (CCB_KILL or CCB_INFO may be used any time as necessary)
 184  - perform a CCB_DEQUEUE
 185  - call munmap() for completion area
 186  - close the dax device
 187
 188
 189 Memory Constraints
 190 ------------------
 191
 192 The DAX hardware operates only on physical addresses. Therefore, it is
 193 not aware of virtual memory mappings and the discontiguities that may
 194 exist in the physical memory that a virtual buffer maps to. There is
 195 no I/O TLB or any scatter/gather mechanism. All buffers, whether input
 196 or output, must reside in a physically contiguous region of memory.
 197
 198 The Hypervisor translates all addresses within a CCB to physical
 199 before handing off the CCB to DAX. The Hypervisor determines the
 200 virtual page size for each virtual address given, and uses this to
 201 program a size limit for each address. This prevents the coprocessor
 202 from reading or writing beyond the bound of the virtual page, even
 203 though it is accessing physical memory directly. A simpler way of
 204 saying this is that a DAX operation will never "cross" a virtual page
 205 boundary. If an 8k virtual page is used, then the data is strictly
 206 limited to 8k. If a user's buffer is larger than 8k, then a larger
 207 page size must be used, or the transaction size will be truncated to
 208 8k.
 209
 210 Huge pages. A user may allocate huge pages using standard interfaces.
 211 Memory buffers residing on huge pages may be used to achieve much
 212 larger DAX transaction sizes, but the rules must still be followed,
 213 and no transaction will cross a page boundary, even a huge page.  A
 214 major caveat is that Linux on Sparc presents 8Mb as one of the huge
 215 page sizes. Sparc does not actually provide a 8Mb hardware page size,
 216 and this size is synthesized by pasting together two 4Mb pages. The
 217 reasons for this are historical, and it creates an issue because only
 218 half of this 8Mb page can actually be used for any given buffer in a
 219 DAX request, and it must be either the first half or the second half;
 220 it cannot be a 4Mb chunk in the middle, since that crosses a
 221 (hardware) page boundary. Note that this entire issue may be hidden by
 222 higher level libraries.
 223
 224
 225 CCB Structure
 226 -------------
 227 A CCB is an array of 8 64-bit words. Several of these words provide
 228 command opcodes, parameters, flags, etc., and the rest are addresses
 229 for the completion area, output buffer, and various inputs:
 230
 231    struct ccb {
 232        u64   control;
 233        u64   completion;
 234        u64   input0;
 235        u64   access;
 236        u64   input1;
 237        u64   op_data;
 238        u64   output;
 239        u64   table;
 240    };
 241
 242 See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
 243 each of these fields, and see dax-hv-api.txt for a complete description
 244 of the Hypervisor API available to the guest OS (ie, Linux kernel).
 245
 246 The first word (control) is examined by the driver for the following:
 247  - CCB version, which must be consistent with hardware version
 248  - Opcode, which must be one of the documented allowable commands
 249  - Address types, which must be set to "virtual" for all the addresses
 250    given by the user, thereby ensuring that the application can
 251    only access memory that it owns
 252
 253
 254 Example Code
 255 ------------
 256
 257 The DAX is accessible to both user and kernel code.  The kernel code
 258 can make hypercalls directly while the user code must use wrappers
 259 provided by the driver. The setup of the CCB is nearly identical for
 260 both; the only difference is in preparation of the completion area. An
 261 example of user code is given now, with kernel code afterwards.
 262
 263 In order to program using the driver API, the file
 264 arch/sparc/include/uapi/asm/oradax.h must be included.
 265
 266 First, the proper device must be opened. For M7 it will be
 267 /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
 268 procedure is to attempt to open both, as only one will succeed:
 269
 270         fd = open("/dev/oradax1", O_RDWR);
 271         if (fd < 0)
 272                 fd = open("/dev/oradax2", O_RDWR);
 273         if (fd < 0)
 274                /* No DAX found */
 275
 276 Next, the completion area must be mapped:
 277
 278       completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
 279
 280 All input and output buffers must be fully contained in one hardware
 281 page, since as explained above, the DAX is strictly constrained by
 282 virtual page boundaries.  In addition, the output buffer must be
 283 64-byte aligned and its size must be a multiple of 64 bytes because
 284 the coprocessor writes in units of cache lines.
 285
 286 This example demonstrates the DAX Scan command, which takes as input a
 287 vector and a match value, and produces a bitmap as the output. For
 288 each input element that matches the value, the corresponding bit is
 289 set in the output.
 290
 291 In this example, the input vector consists of a series of single bits,
 292 and the match value is 0. So each 0 bit in the input will produce a 1
 293 in the output, and vice versa, which produces an output bitmap which
 294 is the input bitmap inverted.
 295
 296 For details of all the parameters and bits used in this CCB, please
 297 refer to section 36.2.1.3 of the DAX Hypervisor API document, which
 298 describes the Scan command in detail.
 299
 300         ccb->control =       /* Table 36.1, CCB Header Format */
 301                   (2L << 48)     /* command = Scan Value */
 302                 | (3L << 40)     /* output address type = primary virtual */
 303                 | (3L << 34)     /* primary input address type = primary virtual */
 304                              /* Section 36.2.1, Query CCB Command Formats */
 305                 | (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
 306                 | (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
 307                 | (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
 308                 | (0 <<  5)     /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
 309                 | (31 << 0);    /* 36.2.1.3 Disable second scan criteria */
 310
 311         ccb->completion = 0;    /* Completion area address, to be filled in by driver */
 312
 313         ccb->input0 = (unsigned long) input; /* primary input address */
 314
 315         ccb->access =       /* Section 36.2.1.2, Data Access Control */
 316                   (2 << 24)    /* Primary input length format = bits */
 317                 | (nbits - 1); /* number of bits in primary input stream, minus 1 */
 318
 319         ccb->input1 = 0;       /* secondary input address, unused */
 320
 321         ccb->op_data = 0;      /* scan criteria (value to be matched) */
 322
 323         ccb->output = (unsigned long) output;   /* output address */
 324
 325         ccb->table = 0;        /* table address, unused */
 326
 327 The CCB submission is a write() or pwrite() system call to the
 328 driver. If the call fails, then a read() must be used to retrieve the
 329 status:
 330
 331         if (pwrite(fd, ccb, 64, 0) != 64) {
 332                 struct ccb_exec_result status;
 333                 read(fd, &status, sizeof(status));
 334                 /* bail out */
 335         }
 336
 337 After a successful submission of the CCB, the completion area may be
 338 polled to determine when the DAX is finished. Detailed information on
 339 the contents of the completion area can be found in section 36.2.2 of
 340 the DAX HV API document.
 341
 342         while (1) {
 343                 /* Monitored Load */
 344                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
 345                                      : "=r" (status)
 346                                      : "r"  (completion_area));
 347
 348                 if (status)          /* 0 indicates command in progress */
 349                         break;
 350
 351                 /* MWAIT */
 352                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
 353         }
 354
 355 A completion area status of 1 indicates successful completion of the
 356 CCB and validity of the output bitmap, which may be used immediately.
 357 All other non-zero values indicate error conditions which are
 358 described in section 36.2.2.
 359
 360         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
 361                 /* completion_area[0] contains the completion status */
 362                 /* completion_area[1] contains an error code, see 36.2.2 */
 363         }
 364
 365 After the completion area has been processed, the driver must be
 366 notified that it can release any resources associated with the
 367 request. This is done via the dequeue operation:
 368
 369         struct dax_command cmd;
 370         cmd.command = CCB_DEQUEUE;
 371         if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
 372                 /* bail out */
 373         }
 374
 375 Finally, normal program cleanup should be done, i.e., unmapping
 376 completion area, closing the dax device, freeing memory etc.
 377
 378 [Kernel example]
 379
 380 The only difference in using the DAX in kernel code is the treatment
 381 of the completion area. Unlike user applications which mmap the
 382 completion area allocated by the driver, kernel code must allocate its
 383 own memory to use for the completion area, and this address and its
 384 type must be given in the CCB:
 385
 386         ccb->control |=      /* Table 36.1, CCB Header Format */
 387                 (3L << 32);     /* completion area address type = primary virtual */
 388
 389         ccb->completion = (unsigned long) completion_area;   /* Completion area address */
 390
 391 The dax submit hypercall is made directly. The flags used in the
 392 ccb_submit call are documented in the DAX HV API in section 36.3.1.
 393
 394 #include <asm/hypervisor.h>
 395
 396         hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
 397                                  HV_CCB_QUERY_CMD |
 398                                  HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
 399                                  HV_CCB_VA_PRIVILEGED,
 400                                  0, &bytes_accepted, &status_data);
 401
 402         if (hv_rv != HV_EOK) {
 403                 /* hv_rv is an error code, status_data contains */
 404                 /* potential additional status, see 36.3.1.1 */
 405         }
 406
 407 After the submission, the completion area polling code is identical to
 408 that in user land:
 409
 410         while (1) {
 411                 /* Monitored Load */
 412                 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
 413                                      : "=r" (status)
 414                                      : "r"  (completion_area));
 415
 416                 if (status)          /* 0 indicates command in progress */
 417                         break;
 418
 419                 /* MWAIT */
 420                 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
 421         }
 422
 423         if (completion_area[0] != 1) {  /* section 36.2.2, 1 = command ran and succeeded */
 424                 /* completion_area[0] contains the completion status */
 425                 /* completion_area[1] contains an error code, see 36.2.2 */
 426         }
 427
 428 The output bitmap is ready for consumption immediately after the
 429 completion status indicates success.