docs/gettingstarted/developers/vlib.md

   1
   2 VLIB (Vector Processing Library)
   3 ================================
   4
   5 The files associated with vlib are located in the ./src/{vlib,
   6 vlibapi, vlibmemory} folders. These libraries provide vector
   7 processing support including graph-node scheduling, reliable multicast
   8 support, ultra-lightweight cooperative multi-tasking threads, a CLI,
   9 plug in .DLL support, physical memory and Linux epoll support. Parts of
  10 this library embody US Patent 7,961,636.
  11
  12 Init function discovery
  13 -----------------------
  14
  15 vlib applications register for various \[initialization\] events by
  16 placing structures and \_\_attribute\_\_((constructor)) functions into
  17 the image. At appropriate times, the vlib framework walks
  18 constructor-generated singly-linked structure lists, calling the
  19 indicated functions. vlib applications create graph nodes, add CLI
  20 functions, start cooperative multi-tasking threads, etc. etc. using this
  21 mechanism.
  22
  23 vlib applications invariably include a number of VLIB\_INIT\_FUNCTION
  24 (my\_init\_function) macros.
  25
  26 Each init / configure / etc. function has the return type clib\_error\_t
  27 \*. Make sure that the function returns 0 if all is well, otherwise the
  28 framework will announce an error and exit.
  29
  30 vlib applications must link against vppinfra, and often link against
  31 other libraries such as VNET. In the latter case, it may be necessary to
  32 explicitly reference symbol(s) otherwise large portions of the library
  33 may be AWOL at runtime.
  34
  35 Node Graph Initialization
  36 -------------------------
  37
  38 vlib packet-processing applications invariably define a set of graph
  39 nodes to process packets.
  40
  41 One constructs a vlib\_node\_registration\_t, most often via the
  42 VLIB\_REGISTER\_NODE macro. At runtime, the framework processes the set
  43 of such registrations into a directed graph. It is easy enough to add
  44 nodes to the graph at runtime. The framework does not support removing
  45 nodes.
  46
  47 vlib provides several types of vector-processing graph nodes, primarily
  48 to control framework dispatch behaviors. The type member of the
  49 vlib\_node\_registration\_t functions as follows:
  50
  51 -   VLIB\_NODE\_TYPE\_PRE\_INPUT - run before all other node types
  52 -   VLIB\_NODE\_TYPE\_INPUT - run as often as possible, after pre\_input
  53     nodes
  54 -   VLIB\_NODE\_TYPE\_INTERNAL - only when explicitly made runnable by
  55     adding pending frames for processing
  56 -   VLIB\_NODE\_TYPE\_PROCESS - only when explicitly made runnable.
  57     "Process" nodes are actually cooperative multi-tasking threads. They
  58     **must** explicitly suspend after a reasonably short period of time.
  59
  60 For a precise understanding of the graph node dispatcher, please read
  61 ./src/vlib/main.c:vlib\_main\_loop.
  62
  63 Graph node dispatcher
  64 ---------------------
  65
  66 Vlib\_main\_loop() dispatches graph nodes. The basic vector processing
  67 algorithm is diabolically simple, but may not be obvious from even a
  68 long stare at the code. Here's how it works: some input node, or set of
  69 input nodes, produce a vector of work to process. The graph node
  70 dispatcher pushes the work vector through the directed graph,
  71 subdividing it as needed, until the original work vector has been
  72 completely processed. At that point, the process recurs.
  73
  74 This scheme yields a stable equilibrium in frame size, by construction.
  75 Here's why: as the frame size increases, the per-frame-element
  76 processing time decreases. There are several related forces at work; the
  77 simplest to describe is the effect of vector processing on the CPU L1
  78 I-cache. The first frame element \[packet\] processed by a given node
  79 warms up the node dispatch function in the L1 I-cache. All subsequent
  80 frame elements profit. As we increase the number of frame elements, the
  81 cost per element goes down.
  82
  83 Under light load, it is a crazy waste of CPU cycles to run the graph
  84 node dispatcher flat-out. So, the graph node dispatcher arranges to wait
  85 for work by sitting in a timed epoll wait if the prevailing frame size
  86 is low. The scheme has a certain amount of hysteresis to avoid
  87 constantly toggling back and forth between interrupt and polling mode.
  88 Although the graph dispatcher supports interrupt and polling modes, our
  89 current default device drivers do not.
  90
  91 The graph node scheduler uses a hierarchical timer wheel to reschedule
  92 process nodes upon timer expiration.
  93
  94 Graph dispatcher internals
  95 --------------------------
  96
  97 This section may be safely skipped. It's not necessary to understand
  98 graph dispatcher internals to create graph nodes.
  99
 100 Vector Data Structure
 101 ---------------------
 102
 103 In vpp / vlib, we represent vectors as instances of the vlib_frame_t type:
 104
 105 ```c
 106     typedef struct vlib_frame_t
 107     {
 108       /* Frame flags. */
 109       u16 flags;
 110
 111       /* Number of scalar bytes in arguments. */
 112       u8 scalar_size;
 113
 114       /* Number of bytes per vector argument. */
 115       u8 vector_size;
 116
 117       /* Number of vector elements currently in frame. */
 118       u16 n_vectors;
 119
 120       /* Scalar and vector arguments to next node. */
 121       u8 arguments[0];
 122     } vlib_frame_t;
 123 ```
 124
 125 Note that one _could_ construct all kinds of vectors - including
 126 vectors with some associated scalar data - using this structure. In
 127 the vpp application, vectors typically use a 4-byte vector element
 128 size, and zero bytes' worth of associated per-frame scalar data.
 129
 130 Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries.
 131 Frames have u32 indices which make use of the alignment property, so
 132 the maximum feasible main heap offset of a frame is
 133 CLIB_CACHE_LINE_BYTES * 0xFFFFFFFF: 64*4 = 256 Gbytes.
 134
 135 Scheduling Vectors
 136 ------------------
 137
 138 As you can see, vectors are not directly associated with graph
 139 nodes. We represent that association in a couple of ways.  The
 140 simplest is the vlib\_pending\_frame\_t:
 141
 142 ```c
 143     /* A frame pending dispatch by main loop. */
 144     typedef struct
 145     {
 146       /* Node and runtime for this frame. */
 147       u32 node_runtime_index;
 148
 149       /* Frame index (in the heap). */
 150       u32 frame_index;
 151
 152       /* Start of next frames for this node. */
 153       u32 next_frame_index;
 154
 155       /* Special value for next_frame_index when there is no next frame. */
 156     #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0)
 157     } vlib_pending_frame_t;
 158 ```
 159
 160 Here is the code in .../src/vlib/main.c:vlib_main_or_worker_loop()
 161 which processes frames:
 162
 163 ```c
 164       /*
 165        * Input nodes may have added work to the pending vector.
 166        * Process pending vector until there is nothing left.
 167        * All pending vectors will be processed from input -> output.
 168        */
 169       for (i = 0; i < _vec_len (nm->pending_frames); i++)
 170         cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
 171       /* Reset pending vector for next iteration. */
 172 ```
 173
 174 The pending frame node_runtime_index associates the frame with the
 175 node which will process it.
 176
 177 Complications
 178 -------------
 179
 180 Fasten your seatbelt. Here's where the story - and the data structures
 181 \- become quite complicated...
 182
 183 At 100,000 feet: vpp uses a directed graph, not a directed _acyclic_
 184 graph. It's really quite normal for a packet to visit ip\[46\]-lookup
 185 multiple times. The worst-case: a graph node which enqueues packets to
 186 itself.
 187
 188 To deal with this issue, the graph dispatcher must force allocation of
 189 a new frame if the current graph node's dispatch function happens to
 190 enqueue a packet back to itself.
 191
 192 There are no guarantees that a pending frame will be processed
 193 immediately, which means that more packets may be added to the
 194 underlying vlib_frame_t after it has been attached to a
 195 vlib_pending_frame_t. Care must be taken to allocate new
 196 frames and pending frames if a (pending\_frame, frame) pair fills.
 197
 198 Next frames, next frame ownership
 199 ---------------------------------
 200
 201 The vlib\_next\_frame\_t is the last key graph dispatcher data structure:
 202
 203 ```c
 204     typedef struct
 205     {
 206       /* Frame index. */
 207       u32 frame_index;
 208
 209       /* Node runtime for this next. */
 210       u32 node_runtime_index;
 211
 212       /* Next frame flags. */
 213       u32 flags;
 214
 215       /* Reflects node frame-used flag for this next. */
 216     #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \
 217       VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH
 218
 219       /* This next frame owns enqueue to node
 220          corresponding to node_runtime_index. */
 221     #define VLIB_FRAME_OWNER (1 << 15)
 222
 223       /* Set when frame has been allocated for this next. */
 224     #define VLIB_FRAME_IS_ALLOCATED     VLIB_NODE_FLAG_IS_OUTPUT
 225
 226       /* Set when frame has been added to pending vector. */
 227     #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP
 228
 229       /* Set when frame is to be freed after dispatch. */
 230     #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT
 231
 232       /* Set when frame has traced packets. */
 233     #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE
 234
 235       /* Number of vectors enqueue to this next since last overflow. */
 236       u32 vectors_since_last_overflow;
 237     } vlib_next_frame_t;
 238 ```
 239
 240 Graph node dispatch functions call vlib\_get\_next\_frame (...)  to
 241 set "(u32 \*)to_next" to the right place in the vlib_frame_t
 242 corresponding to the ith arc (aka next0) from the current node to the
 243 indicated next node.
 244
 245 After some scuffling around - two levels of macros - processing
 246 reaches vlib\_get\_next\_frame_internal (...). Get-next-frame-internal
 247 digs up the vlib\_next\_frame\_t corresponding to the desired graph
 248 arc.
 249
 250 The next frame data structure amounts to a graph-arc-centric frame
 251 cache. Once a node finishes adding element to a frame, it will acquire
 252 a vlib_pending_frame_t and end up on the graph dispatcher's
 253 run-queue. But there's no guarantee that more vector elements won't be
 254 added to the underlying frame from the same (source\_node,
 255 next\_index) arc or from a different (source\_node, next\_index) arc.
 256
 257 Maintaining consistency of the arc-to-frame cache is necessary. The
 258 first step in maintaining consistency is to make sure that only one
 259 graph node at a time thinks it "owns" the target vlib\_frame\_t.
 260
 261 Back to the graph node dispatch function. In the usual case, a certain
 262 number of packets will be added to the vlib\_frame\_t acquired by
 263 calling vlib\_get\_next\_frame (...).
 264
 265 Before a dispatch function returns, it's required to call
 266 vlib\_put\_next\_frame (...) for all of the graph arcs it actually
 267 used.  This action adds a vlib\_pending\_frame\_t to the graph
 268 dispatcher's pending frame vector.
 269
 270 Vlib\_put\_next\_frame makes a note in the pending frame of the frame
 271 index, and also of the vlib\_next\_frame\_t index.
 272
 273 dispatch\_pending\_node actions
 274 -------------------------------
 275
 276 The main graph dispatch loop calls dispatch pending node as shown
 277 above.
 278
 279 Dispatch\_pending\_node recovers the pending frame, and the graph node
 280 runtime / dispatch function. Further, it recovers the next\_frame
 281 currently associated with the vlib\_frame\_t, and detaches the
 282 vlib\_frame\_t from the next\_frame.
 283
 284 In .../src/vlib/main.c:dispatch\_pending\_node(...), note this stanza:
 285
 286 ```c
 287   /* Force allocation of new frame while current frame is being
 288      dispatched. */
 289   restore_frame_index = ~0;
 290   if (nf->frame_index == p->frame_index)
 291     {
 292       nf->frame_index = ~0;
 293       nf->flags &= ~VLIB_FRAME_IS_ALLOCATED;
 294       if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH))
 295         restore_frame_index = p->frame_index;
 296     }
 297 ```
 298
 299 dispatch\_pending\_node is worth a hard stare due to the several
 300 second-order optimizations it implements. Almost as an afterthought,
 301 it calls dispatch_node which actually calls the graph node dispatch
 302 function.
 303
 304 Process / thread model
 305 ----------------------
 306
 307 vlib provides an ultra-lightweight cooperative multi-tasking thread
 308 model. The graph node scheduler invokes these processes in much the same
 309 way as traditional vector-processing run-to-completion graph nodes;
 310 plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply
 311 set the vlib\_node\_registration\_t type field to
 312 vlib\_NODE\_TYPE\_PROCESS. Yes, process is a misnomer. These are
 313 cooperative multi-tasking threads.
 314
 315 As of this writing, the default stack size is 2<<15 = 32kb.
 316 Initialize the node registration's process\_log2\_n\_stack\_bytes member
 317 as needed. The graph node dispatcher makes some effort to detect stack
 318 overrun, e.g. by mapping a no-access page below each thread stack.
 319
 320 Process node dispatch functions are expected to be "while(1) { }" loops
 321 which suspend when not otherwise occupied, and which must not run for
 322 unreasonably long periods of time.
 323
 324 "Unreasonably long" is an application-dependent concept. Over the years,
 325 we have constructed frame-size sensitive control-plane nodes which will
 326 use a much higher fraction of the available CPU bandwidth when the frame
 327 size is low. The classic example: modifying forwarding tables. So long
 328 as the table-builder leaves the forwarding tables in a valid state, one
 329 can suspend the table builder to avoid dropping packets as a result of
 330 control-plane activity.
 331
 332 Process nodes can suspend for fixed amounts of time, or until another
 333 entity signals an event, or both. See the next section for a description
 334 of the vlib process event mechanism.
 335
 336 When running in vlib process context, one must pay strict attention to
 337 loop invariant issues. If one walks a data structure and calls a
 338 function which may suspend, one had best know by construction that it
 339 cannot change. Often, it's best to simply make a snapshot copy of a data
 340 structure, walk the copy at leisure, then free the copy.
 341
 342 Process events
 343 --------------
 344
 345 The vlib process event mechanism API is extremely lightweight and easy
 346 to use. Here is a typical example:
 347
 348 ```c
 349     vlib_main_t *vm = &vlib_global_main;
 350     uword event_type, * event_data = 0;
 351
 352     while (1)
 353     {
 354        vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */);
 355
 356        event_type = vlib_process_get_events (vm, &event_data);
 357
 358        switch (event_type) {
 359        case EVENT1:
 360            handle_event1s (event_data);
 361            break;
 362
 363        case EVENT2:
 364            handle_event2s (event_data);
 365            break;
 366
 367        case ~0: /* 5-second idle/periodic */
 368            handle_idle ();
 369            break;
 370
 371        default: /* bug! */
 372            ASSERT (0);
 373        }
 374
 375        vec_reset_length(event_data);
 376     }
 377 ```
 378
 379 In this example, the VLIB process node waits for an event to occur, or
 380 for 5 seconds to elapse. The code demuxes on the event type, calling
 381 the appropriate handler function. Each call to
 382 vlib\_process\_get\_events returns a vector of per-event-type data
 383 passed to successive vlib\_process\_signal\_event calls; it is a
 384 serious error to process only event\_data\[0\].
 385
 386 Resetting the event\_data vector-length to 0 \[instead of calling
 387 vec\_free\] means that the event scheme doesn't burn cycles continuously
 388 allocating and freeing the event data vector. This is a common vppinfra
 389 / vlib coding pattern, well worth using when appropriate.
 390
 391 Signaling an event is easy, for example:
 392
 393 ```c
 394     vlib_process_signal_event (vm, process_node_index, EVENT1,
 395         (uword)arbitrary_event1_data); /* and so forth */
 396 ```
 397
 398 One can either know the process node index by construction - dig it out
 399 of the appropriate vlib\_node\_registration\_t - or by finding the
 400 vlib\_node\_t with vlib\_get\_node\_by\_name(...).
 401
 402 Buffers
 403 -------
 404
 405 vlib buffering solves the usual set of packet-processing problems,
 406 albeit at high performance. Key in terms of performance: one ordinarily
 407 allocates / frees N buffers at a time rather than one at a time. Except
 408 when operating directly on a specific buffer, one deals with buffers by
 409 index, not by pointer.
 410
 411 Packet-processing frames are u32\[\] arrays, not
 412 vlib\_buffer\_t\[\] arrays.
 413
 414 Packets comprise one or more vlib buffers, chained together as required.
 415 Multiple particle sizes are supported; hardware input nodes simply ask
 416 for the required size(s). Coalescing support is available. For obvious
 417 reasons one is discouraged from writing one's own wild and wacky buffer
 418 chain traversal code.
 419
 420 vlib buffer headers are allocated immediately prior to the buffer data
 421 area. In typical packet processing this saves a dependent read wait:
 422 given a buffer's address, one can prefetch the buffer header
 423 \[metadata\] at the same time as the first cache line of buffer data.
 424
 425 Buffer header metadata (vlib\_buffer\_t) includes the usual rewrite
 426 expansion space, a current\_data offset, RX and TX interface indices,
 427 packet trace information, and a opaque areas.
 428
 429 The opaque data is intended to control packet processing in arbitrary
 430 subgraph-dependent ways. The programmer shoulders responsibility for
 431 data lifetime analysis, type-checking, etc.
 432
 433 Buffers have reference-counts in support of e.g. multicast replication.
 434
 435 Shared-memory message API
 436 -------------------------
 437
 438 Local control-plane and application processes interact with the vpp
 439 dataplane via asynchronous message-passing in shared memory over
 440 unidirectional queues. The same application APIs are available via
 441 sockets.
 442
 443 Capturing API traces and replaying them in a simulation environment
 444 requires a disciplined approach to the problem. This seems like a
 445 make-work task, but it is not. When something goes wrong in the
 446 control-plane after 300,000 or 3,000,000 operations, high-speed replay
 447 of the events leading up to the accident is a huge win.
 448
 449 The shared-memory message API message allocator vl\_api\_msg\_alloc uses
 450 a particularly cute trick. Since messages are processed in order, we try
 451 to allocate message buffering from a set of fixed-size, preallocated
 452 rings. Each ring item has a "busy" bit. Freeing one of the preallocated
 453 message buffers merely requires the message consumer to clear the busy
 454 bit. No locking required.
 455
 456 Debug CLI
 457 ---------
 458
 459 Adding debug CLI commands to VLIB applications is very simple.
 460
 461 Here is a complete example:
 462
 463 ```c
 464     static clib_error_t *
 465     show_ip_tuple_match (vlib_main_t * vm,
 466                          unformat_input_t * input,
 467                          vlib_cli_command_t * cmd)
 468     {
 469         vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main);
 470         return 0;
 471     }
 472
 473     /* *INDENT-OFF* */
 474     static VLIB_CLI_COMMAND (show_ip_tuple_command) =
 475     {
 476         .path = "show ip tuple match",
 477         .short_help = "Show ip 5-tuple match-and-broadcast tables",
 478         .function = show_ip_tuple_match,
 479     };
 480     /* *INDENT-ON* */
 481 ```
 482
 483 This example implements the "show ip tuple match" debug cli
 484 command. In ordinary usage, the vlib cli is available via the "vppctl"
 485 applicationn, which sends traffic to a named pipe. One can configure
 486 debug CLI telnet access on a configurable port.
 487
 488 The cli implementation has an output redirection facility which makes it
 489 simple to deliver cli output via shared-memory API messaging,
 490
 491 Particularly for debug or "show tech support" type commands, it would be
 492 wasteful to write vlib application code to pack binary data, write more
 493 code elsewhere to unpack the data and finally print the answer. If a
 494 certain cli command has the potential to hurt packet processing
 495 performance by running for too long, do the work incrementally in a
 496 process node. The client can wait.
 497
 498 Handing off buffers between threads
 499 -----------------------------------
 500
 501 Vlib includes an easy-to-use mechanism for handing off buffers between
 502 worker threads. A typical use-case: software ingress flow hashing. At
 503 a high level, one creates a per-worker-thread queue which sends packets
 504 to a specific graph node in the indicated worker thread. With the
 505 queue in hand, enqueue packets to the worker thread of your choice.
 506
 507 ### Initialize a handoff queue
 508
 509 Simple enough, call vlib_frame_queue_main_init:
 510
 511 ```c
 512    main_ptr->frame_queue_index
 513        = vlib_frame_queue_main_init (dest_node.index, frame_queue_size);
 514 ```
 515
 516 Frame_queue_size means what it says: the number of frames which may be
 517 queued. Since frames contain 1...256 packets, frame_queue_size should
 518 be a reasonably small number (32...64). If the frame queue producer(s)
 519 are faster than the frame queue consumer(s), congestion will
 520 occur. Suggest letting the enqueue operator deal with queue
 521 congestion, as shown in the enqueue example below.
 522
 523 Under the floorboards, vlib_frame_queue_main_init creates an input queue
 524 for each worker thread.
 525
 526 Please do NOT create frame queues until it's clear that they will be
 527 used. Although the main dispatch loop is reasonably smart about how
 528 often it polls the (entire set of) frame queues, polling unused frame
 529 queues is a waste of clock cycles.
 530
 531 ### Hand off packets
 532
 533 The actual handoff mechanics are simple, and integrate nicely with
 534 a typical graph-node dispatch function:
 535
 536 ```c
 537     always_inline uword
 538     do_handoff_inline (vlib_main_t * vm,
 539                        vlib_node_runtime_t * node, vlib_frame_t * frame,
 540                        int is_ip4, int is_trace)
 541     {
 542       u32 n_left_from, *from;
 543       vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
 544       u16 thread_indices [VLIB_FRAME_SIZE];
 545       u16 nexts[VLIB_FRAME_SIZE], *next;
 546       u32 n_enq;
 547       htest_main_t *hmp = &htest_main;
 548       int i;
 549
 550       from = vlib_frame_vector_args (frame);
 551       n_left_from = frame->n_vectors;
 552
 553       vlib_get_buffers (vm, from, bufs, n_left_from);
 554       next = nexts;
 555       b = bufs;
 556
 557       /*
 558        * Typical frame traversal loop, details vary with
 559        * use case. Make sure to set thread_indices[i] with
 560        * the desired destination thread index. You may
 561        * or may not bother to set next[i].
 562        */
 563
 564       for (i = 0; i < frame->n_vectors; i++)
 565         {
 566           <snip>
 567           /* Pick a thread to handle this packet */
 568           thread_indices[i] = f (packet_data_or_whatever);
 569           <snip>
 570
 571           b += 1;
 572           next += 1;
 573           n_left_from -= 1;
 574         }
 575
 576        /* Enqueue buffers to threads */
 577        n_enq =
 578         vlib_buffer_enqueue_to_thread (vm, hmp->frame_queue_index,
 579                                        from, thread_indices, frame->n_vectors,
 580                                        1 /* drop on congestion */);
 581        /* Typical counters,
 582       if (n_enq < frame->n_vectors)
 583         vlib_node_increment_counter (vm, node->node_index,
 584                                  XXX_ERROR_CONGESTION_DROP,
 585                                  frame->n_vectors - n_enq);
 586       vlib_node_increment_counter (vm, node->node_index,
 587                                  XXX_ERROR_HANDED_OFF, n_enq);
 588       return frame->n_vectors;
 589 }
 590 ```
 591
 592 Notes about calling vlib_buffer_enqueue_to_thread(...):
 593
 594 * If you pass "drop on congestion" non-zero, all packets in the
 595 inbound frame will be consumed one way or the other. This is the
 596 recommended setting.
 597
 598 * In the drop-on-congestion case, please don't try to "help" in the
 599 enqueue node by freeing dropped packets, or by pushing them to
 600 "error-drop." Either of those actions would be a severe error.
 601
 602 * It's perfectly OK to enqueue packets to the current thread.