doc/mim.md

   1 # Component Relationships
   2
   3 ## MIM
   4
   5 Language with Muddle's control flow, object model, and syntax, but
   6 much more static semantics, straightforwardly compilable to MIM
   7 ops. Restrictions compared to Muddle:
   8 - no EVAL, BIND, or QUOTEd args
   9 - local variables must be resolvable lexically (explicit LVAL
  10   operations available for creating SPECIAL variables)
  11 - the first position of a form is bound to a function at compile time
  12 - FUNCTIONs can't be rebound from the perspective of MIM code (MUM can
  13   override anything for other MUM code, but the implementation of
  14   builtins written in MIM won't use the new versions)
  15 Most of these restrictions would have been in place in Muddle 105's
  16 MIMOC--at the time it was common practice for dynamic languages to
  17 have different semantics when compiled (full Muddle semantics are
  18 impossible to properly AOT compile; see *The Theory of Fexprs is
  19 Trivial*, M. Wand, '98).
  20
  21 ## MIMC
  22
  23 Library defining MIM as a collection of SUBRs that construct
  24 bytecode when invoked, so that a program written in MIM can be
  25 compiled by running it, and arbitrary code can be used in a macro-like
  26 manner. E.g. after importing MIMC, this call will push a single MIM
  27 operation into the current code buffer, looking up the buffer and the
  28 local slots in the current dynamic scope:
  29     <cons x xs = xxs>
  30 (This is actually a great example of the usefulness of dynamic scoping
  31 in enabling DSLs when context is important.) More complex MIM builtins
  32 (e.g. FCN, MAPF) will emit multiple opcodes, insert branches, build
  33 MSUBR headers, etc.
  34
  35 Generally, libraries written in MIM (incl. most of the stdlib) will be
  36 compiled AOT to reduce load time. We'll probably want to compile
  37 per-file and link separately (faster turnaround time when working on
  38 large libraries; enforces modularity).
  39
  40 ## MUMC
  41
  42 MUMC can be a method JIT implemented as an extension of MIMC. This
  43 eliminates the work of managing the correspondence between MUM's
  44 environment and control flow and MIM's. This mainly requires
  45 implementing some helper functions for complex MUM builtins, and
  46 supporting alternate semantics for some operations (e.g. when
  47 compiling in MUM mode locals are SPECIAL by default [just like
  48 changing Muddle 105's SPECIAL-MODE flag], and so new scopes will emit
  49 the LBIND-managing ops).
  50
  51 As mentioned above, full Muddle can't be AOT-compiled; but it's
  52 straightforward to method-JIT it to MIM *speculatively*, when the
  53 source is still available as a fallback if "dynamic behavior" is
  54 observed. This in itself doesn't have any benefit over only
  55 source-interpreting with an efficient format for structure constants,
  56 (packed into a vector in depth-first order). What it gets us is the
  57 power to perform some important optimizations relatively easily.
  58 - inline short MIM builtins (many operations are simpler to perform
  59   directly than to APPLY)
  60 - inline FEXPRs with constant folding and DCE
  61 We can eliminate the guards on the hotpath and pay almost no runtime
  62 cost to have the dynamic paths available (that is, when the user isn't
  63 actually redefining core builtins, UNASSIGNing the locals of their
  64 ancestors' scopes, and modifying the code of their callsites), if we
  65 set hooks for GVAL modifications to functions that are called by
  66 functions that have a JIT resident.
  67
  68 Also, once we have a basic working system and want to make it decently
  69 fast, for such a dynamic language the only way to go is with a tracing
  70 JIT for the hot loops--but just tracing an interpreter accomplishes
  71 nothing (see *Tracing The Meta-Level*, C. Bolz, '09). With the MUM
  72 method-jitted to MIM ops, a simple MIM bytecode JIT will also be a
  73 tracing JIT for MUM.
  74
  75 Example of what compiling some MUM-calls to MIM ops might look like
  76 (and where guards would be needed), along with structure constants
  77 (for fallback to EVAL) for the call
  78 `<A <B1> <B2 <C>>>`:
  79         consts:
  80                 call = [A B1 FORM:1 B2 C FORM:1 FORM:2 FORM:3] ;structured objects stored in depth-first vector
  81         entry:
  82                 ; [GVAL modification hook (A): enable bypass to slowpath-a]
  83                 <APPLYGLI B1 0 = PUSH> ; <APPLY G/LVAL IMM local:atom args:byte =>
  84                 ; [GVAL modification hook (B2): rewrite B1's return to slowpath-b2]
  85                 <APPLYGLI C 0 = PUSH>
  86                 <APPLYGLI B2 1 = PUSH>
  87         b2-done:
  88                 <APPLYGLI A 2 = ret>
  89         a-done:
  90                 <RET ret>
  91         slowpath-a:
  92                 <LOADRPN A.. = call>
  93                 <EVAL call = ret>
  94                 <JUMP a-done>
  95         slowpath-b2:
  96                 <LOADRPN B2.. = b2-call>
  97                 <EVAL b2-call = PUSH>
  98                 <JUMP b2-done>
  99
 100 # The Bytecode
 101
 102 ## Format
 103
 104 The MIM design is a 3-operand register machine. This is an excellent
 105 instruction format for enabling the implementation of an efficient
 106 interpreter, though it has a higher minimum complexity than other
 107 common layouts.
 108
 109 ## Encoding
 110
 111 The docs we have are mum about instruction encoding, but given this
 112 set of opcodes we're going to want something like this:
 113
 114 bit31                        bit0
 115 v                               v
 116 .-------------------------------.
 117 |   D   |   C   |   B   |  op   |
 118 '-------------------------------'
 119
 120 bit31                        bit0
 121 v                               v
 122 .-------------------------------.
 123 |      CD       |   B   |   op  |
 124 '-------------------------------'
 125
 126 32-bit fixed-width operations with 1-byte fields are easy to decode.
 127 Storing the op at the numerically low end makes it easiest to read
 128 first. Operand B can be decoded with the opcode before dispatching,
 129 which enables simplifications like a fallthrough implementation for
 130 ops that have one variant that takes an immediate in the B position
 131 and another variant (with a different opcode) that takes a local.
 132
 133 The address modes are:
 134 - B: immediate Byte (incl. signed offset for most jump operations)
 135 - H: immediate Halfword (16-bit operand, incl long jumps JUMP/EMP?/FALSE?)
 136 - W: IP-relative Word (32 bit constant inlined in code)
 137 - L: index into the current function's Locals table
 138 - LL: index into the Locals table, result indirected through a locative
 139 The RVECTOR is placed at local0, so that static values can be accessed
 140 via NTHR and locals with static names via NTHRL.
 141
 142 An opcode interprets each of its arguments according to a fixed
 143 address mode; any necessary variants are encoded as separate opcodes
 144 (distinguished in assembly by the suffixes listed above) rather than
 145 branching on flag bits. The only common non-guard branches are the
 146 opcode dispatch, (occasionally prefixed-opcode dispatch), and dispatch
 147 for polymorphic/mistyped ops.
 148 Some ops use use base/length pairs to blit ranges of locals/constants.
 149
 150 Some ~1/3 of the operations in the MIM docs are both rare (unlikely to
 151 occur in hot loops) and short (don't need all 3 operand bytes), so we
 152 can encode them all under 1 prefix opcode and sub-dispatch on operand
 153 B, freeing a lot of opcode space for more critical instructions.
 154
 155 ## Type System
 156
 157 The type system is mostly the same between the level of MIM ops and
 158 MUM, with some caveats. MIM mostly looks at PRIMTYPEs and doesn't care
 159 about the high bits, while MIM-level PRIMTYPEs have a couple of bits
 160 that are masked out from MUM's view, to support the implementation of
 161 weak pointers (STACK objects) and large allocations (see below).
 162
 163 STRING will need to have a different PRIMTYPE from BYTES, because it
 164 would be unsafe to allow CHTYPE-ing arbitrary BYTES to utf-8 STRINGs
 165 (and would expose the implementation detail of our string encoding;
 166 eventually we might want to switch to Python 3's uniform-1/2/4 format,
 167 trading some type system complexity for more speed).
 168
 169 To be compatible with MUM's polymorphic builtins, operations like
 170 <REST> are polymorphic in bytecode. Operations that need to be fast
 171 (like the primary looping operator, used for MAPF) also have
 172 type-specialized versions for the major types. At opcode level, typing
 173 is weak: when a typed opcode is applied to the wrong kind of object,
 174 it redispatches to the polymorphic implementation; an ERROR only
 175 occurs if no version of the operation is applicable to the
 176 operand. This allows the implementation to support inline caching:
 177 when a polymorphic loop operator is invoked, after dispatching on the
 178 collection's type it overwrites the instruction pointer with the
 179 type-specialized version of its opcode. Thus only the first iteration
 180 of any loop requires polymorphic dispatch.
 181
 182 ## Object Format
 183
 184 We can fit them in 64 bits!
 185
 186 A typical object layout, common to UBLOCK/BYTES/STRING:
 187 bit63
 188 v
 189 .-------------------------------.
 190 |            end-ptr            |
 191 +-------------------------------+
 192 |     length    |  Typ  | Prim  |
 193 '-------------------------------'
 194                                 ^
 195                                                          bit0
 196 The address is 32-bit. Length is 16-bit, with an interpretation that
 197 scales by object size. REST and EMPTY? don't need to tell the 3 object
 198 types apart (less dispatching), but NTH needs to know the item
 199 length. A "normal" allocation can be used for lengths up to 2**16
 200 objects.
 201
 202 Large object layout for UBLOCK/BYTES/STRING:
 203 .-------------------------------.
 204 |            cur-pos            |
 205 +-------------------------------+
 206 |   end-align   |  Typ  | Prim  |
 207 '-------------------------------'
 208 Cur-pos is a pointer, potentially offset from a static
 209 large-allocations base and scaled by object size. End-align indicates
 210 the number of trailing zero bits in the allocation's past-the-end
 211 address.
 212
 213 This large object model comes out of two observations:
 214 - Language semantics require that typical operations on collections be
 215   polymorphic anyway, so there's little cost to potentially
 216   dispatching to one more type of alternate format (especially since
 217   we'll never have to dispatch in loops).
 218 - For objects larger than 64KB we probably want to mmap the allocation
 219   outside the normal heap in any case, so at that point we can find a
 220   highly-aligned location and use the alignment to only need to store
 221   the log2 of the length.
 222
 223 ## Additional/Modified Opcodes
 224
 225 - <FALSE? local:any tgt:2byte>
 226   - This seems essential, not sure why there's nothing like it in the
 227     docs.
 228 - LENL: added a max-value
 229   - LENGTH can be implemented in terms of LENGTH? with an optional
 230     max, but not the other way around
 231 - <ERROR local:atom local:any local:any>
 232   - Seems necessary.
 233 - <DESTRUC(L/U/B/R/.../x)? target:byte =>
 234   - Single-op implementation of idiomatic loops (MAPF/MAPR). Has the
 235     semantics of REST-NTH_1-EMPTY?-JUMP, but with much less work.
 236 - implementing RETURN:
 237   - CALLs (and RESUMEs) are decoded twice: the 1st time does a call
 238     (ignoring the parameter specifying the result destination); the
 239     2nd reads only the result byte (of the instruction that is
 240     *before* the return address, e.g. last executed before call) so it
 241     knows what local to set with its result. (Result parameters are
 242     always in the D position, so RETURN doesn't need to know *what*
 243     it's returning to.)
 244
 245 # GC interface
 246
 247 The bulk of the GC (marker/sweeper, compactor) can be written in MIM,
 248 and run in an internal PROCESS, though some fundamental ops need to be
 249 compiled for a particular type of GC implementation (because hooks for
 250 allocation and read/write barriers need to be as fast as possible).