2 VLIB (Vector Processing Library)
3 ================================
5 The files associated with vlib are located in the ./src/{vlib,
6 vlibapi, vlibmemory} folders. These libraries provide vector
7 processing support including graph-node scheduling, reliable multicast
8 support, ultra-lightweight cooperative multi-tasking threads, a CLI,
9 plug in .DLL support, physical memory and Linux epoll support. Parts of
10 this library embody US Patent 7,961,636.
12 Init function discovery
13 -----------------------
15 vlib applications register for various \[initialization\] events by
16 placing structures and \_\_attribute\_\_((constructor)) functions into
17 the image. At appropriate times, the vlib framework walks
18 constructor-generated singly-linked structure lists, performs a
19 topological sort based on specified constraints, and calls the
20 indicated functions. Vlib applications create graph nodes, add CLI
21 functions, start cooperative multi-tasking threads, etc. etc. using
24 vlib applications invariably include a number of VLIB\_INIT\_FUNCTION
25 (my\_init\_function) macros.
27 Each init / configure / etc. function has the return type clib\_error\_t
28 \*. Make sure that the function returns 0 if all is well, otherwise the
29 framework will announce an error and exit.
31 vlib applications must link against vppinfra, and often link against
32 other libraries such as VNET. In the latter case, it may be necessary to
33 explicitly reference symbol(s) otherwise large portions of the library
34 may be AWOL at runtime.
36 ### Init function construction and constraint specification
38 It's easy to add an init function:
41 static clib_error_t *my_init_function (vlib_main_t *vm)
43 /* ... initialize things ... */
45 return 0; // or return clib_error_return (0, "BROKEN!");
47 VLIB_INIT_FUNCTION(my_init_function);
50 As given, my_init_function will be executed "at some point," but with
51 no ordering guarantees.
53 Specifying ordering constraints is easy:
56 VLIB_INIT_FUNCTION(my_init_function) =
58 .runs_before = VLIB_INITS("we_run_before_function_1",
59 "we_run_before_function_2"),
60 .runs_after = VLIB_INITS("we_run_after_function_1",
61 "we_run_after_function_2),
65 It's also easy to specify bulk ordering constraints of the form "a
66 then b then c then d":
69 VLIB_INIT_FUNCTION(my_init_function) =
71 .init_order = VLIB_INITS("a", "b", "c", "d"),
75 It's OK to specify all three sorts of ordering constraints for a
76 single init function, although it's hard to imagine why it would be
80 Node Graph Initialization
81 -------------------------
83 vlib packet-processing applications invariably define a set of graph
84 nodes to process packets.
86 One constructs a vlib\_node\_registration\_t, most often via the
87 VLIB\_REGISTER\_NODE macro. At runtime, the framework processes the set
88 of such registrations into a directed graph. It is easy enough to add
89 nodes to the graph at runtime. The framework does not support removing
92 vlib provides several types of vector-processing graph nodes, primarily
93 to control framework dispatch behaviors. The type member of the
94 vlib\_node\_registration\_t functions as follows:
96 - VLIB\_NODE\_TYPE\_PRE\_INPUT - run before all other node types
97 - VLIB\_NODE\_TYPE\_INPUT - run as often as possible, after pre\_input
99 - VLIB\_NODE\_TYPE\_INTERNAL - only when explicitly made runnable by
100 adding pending frames for processing
101 - VLIB\_NODE\_TYPE\_PROCESS - only when explicitly made runnable.
102 "Process" nodes are actually cooperative multi-tasking threads. They
103 **must** explicitly suspend after a reasonably short period of time.
105 For a precise understanding of the graph node dispatcher, please read
106 ./src/vlib/main.c:vlib\_main\_loop.
108 Graph node dispatcher
109 ---------------------
111 Vlib\_main\_loop() dispatches graph nodes. The basic vector processing
112 algorithm is diabolically simple, but may not be obvious from even a
113 long stare at the code. Here's how it works: some input node, or set of
114 input nodes, produce a vector of work to process. The graph node
115 dispatcher pushes the work vector through the directed graph,
116 subdividing it as needed, until the original work vector has been
117 completely processed. At that point, the process recurs.
119 This scheme yields a stable equilibrium in frame size, by construction.
120 Here's why: as the frame size increases, the per-frame-element
121 processing time decreases. There are several related forces at work; the
122 simplest to describe is the effect of vector processing on the CPU L1
123 I-cache. The first frame element \[packet\] processed by a given node
124 warms up the node dispatch function in the L1 I-cache. All subsequent
125 frame elements profit. As we increase the number of frame elements, the
126 cost per element goes down.
128 Under light load, it is a crazy waste of CPU cycles to run the graph
129 node dispatcher flat-out. So, the graph node dispatcher arranges to wait
130 for work by sitting in a timed epoll wait if the prevailing frame size
131 is low. The scheme has a certain amount of hysteresis to avoid
132 constantly toggling back and forth between interrupt and polling mode.
133 Although the graph dispatcher supports interrupt and polling modes, our
134 current default device drivers do not.
136 The graph node scheduler uses a hierarchical timer wheel to reschedule
137 process nodes upon timer expiration.
139 Graph dispatcher internals
140 --------------------------
142 This section may be safely skipped. It's not necessary to understand
143 graph dispatcher internals to create graph nodes.
145 Vector Data Structure
146 ---------------------
148 In vpp / vlib, we represent vectors as instances of the vlib_frame_t type:
151 typedef struct vlib_frame_t
156 /* Number of scalar bytes in arguments. */
159 /* Number of bytes per vector argument. */
162 /* Number of vector elements currently in frame. */
165 /* Scalar and vector arguments to next node. */
170 Note that one _could_ construct all kinds of vectors - including
171 vectors with some associated scalar data - using this structure. In
172 the vpp application, vectors typically use a 4-byte vector element
173 size, and zero bytes' worth of associated per-frame scalar data.
175 Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries.
176 Frames have u32 indices which make use of the alignment property, so
177 the maximum feasible main heap offset of a frame is
178 CLIB_CACHE_LINE_BYTES * 0xFFFFFFFF: 64*4 = 256 Gbytes.
183 As you can see, vectors are not directly associated with graph
184 nodes. We represent that association in a couple of ways. The
185 simplest is the vlib\_pending\_frame\_t:
188 /* A frame pending dispatch by main loop. */
191 /* Node and runtime for this frame. */
192 u32 node_runtime_index;
194 /* Frame index (in the heap). */
197 /* Start of next frames for this node. */
198 u32 next_frame_index;
200 /* Special value for next_frame_index when there is no next frame. */
201 #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0)
202 } vlib_pending_frame_t;
205 Here is the code in .../src/vlib/main.c:vlib_main_or_worker_loop()
206 which processes frames:
210 * Input nodes may have added work to the pending vector.
211 * Process pending vector until there is nothing left.
212 * All pending vectors will be processed from input -> output.
214 for (i = 0; i < _vec_len (nm->pending_frames); i++)
215 cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
216 /* Reset pending vector for next iteration. */
219 The pending frame node_runtime_index associates the frame with the
220 node which will process it.
225 Fasten your seatbelt. Here's where the story - and the data structures
226 \- become quite complicated...
228 At 100,000 feet: vpp uses a directed graph, not a directed _acyclic_
229 graph. It's really quite normal for a packet to visit ip\[46\]-lookup
230 multiple times. The worst-case: a graph node which enqueues packets to
233 To deal with this issue, the graph dispatcher must force allocation of
234 a new frame if the current graph node's dispatch function happens to
235 enqueue a packet back to itself.
237 There are no guarantees that a pending frame will be processed
238 immediately, which means that more packets may be added to the
239 underlying vlib_frame_t after it has been attached to a
240 vlib_pending_frame_t. Care must be taken to allocate new
241 frames and pending frames if a (pending\_frame, frame) pair fills.
243 Next frames, next frame ownership
244 ---------------------------------
246 The vlib\_next\_frame\_t is the last key graph dispatcher data structure:
254 /* Node runtime for this next. */
255 u32 node_runtime_index;
257 /* Next frame flags. */
260 /* Reflects node frame-used flag for this next. */
261 #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \
262 VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH
264 /* This next frame owns enqueue to node
265 corresponding to node_runtime_index. */
266 #define VLIB_FRAME_OWNER (1 << 15)
268 /* Set when frame has been allocated for this next. */
269 #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT
271 /* Set when frame has been added to pending vector. */
272 #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP
274 /* Set when frame is to be freed after dispatch. */
275 #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT
277 /* Set when frame has traced packets. */
278 #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE
280 /* Number of vectors enqueue to this next since last overflow. */
281 u32 vectors_since_last_overflow;
285 Graph node dispatch functions call vlib\_get\_next\_frame (...) to
286 set "(u32 \*)to_next" to the right place in the vlib_frame_t
287 corresponding to the ith arc (aka next0) from the current node to the
290 After some scuffling around - two levels of macros - processing
291 reaches vlib\_get\_next\_frame_internal (...). Get-next-frame-internal
292 digs up the vlib\_next\_frame\_t corresponding to the desired graph
295 The next frame data structure amounts to a graph-arc-centric frame
296 cache. Once a node finishes adding element to a frame, it will acquire
297 a vlib_pending_frame_t and end up on the graph dispatcher's
298 run-queue. But there's no guarantee that more vector elements won't be
299 added to the underlying frame from the same (source\_node,
300 next\_index) arc or from a different (source\_node, next\_index) arc.
302 Maintaining consistency of the arc-to-frame cache is necessary. The
303 first step in maintaining consistency is to make sure that only one
304 graph node at a time thinks it "owns" the target vlib\_frame\_t.
306 Back to the graph node dispatch function. In the usual case, a certain
307 number of packets will be added to the vlib\_frame\_t acquired by
308 calling vlib\_get\_next\_frame (...).
310 Before a dispatch function returns, it's required to call
311 vlib\_put\_next\_frame (...) for all of the graph arcs it actually
312 used. This action adds a vlib\_pending\_frame\_t to the graph
313 dispatcher's pending frame vector.
315 Vlib\_put\_next\_frame makes a note in the pending frame of the frame
316 index, and also of the vlib\_next\_frame\_t index.
318 dispatch\_pending\_node actions
319 -------------------------------
321 The main graph dispatch loop calls dispatch pending node as shown
324 Dispatch\_pending\_node recovers the pending frame, and the graph node
325 runtime / dispatch function. Further, it recovers the next\_frame
326 currently associated with the vlib\_frame\_t, and detaches the
327 vlib\_frame\_t from the next\_frame.
329 In .../src/vlib/main.c:dispatch\_pending\_node(...), note this stanza:
332 /* Force allocation of new frame while current frame is being
334 restore_frame_index = ~0;
335 if (nf->frame_index == p->frame_index)
337 nf->frame_index = ~0;
338 nf->flags &= ~VLIB_FRAME_IS_ALLOCATED;
339 if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH))
340 restore_frame_index = p->frame_index;
344 dispatch\_pending\_node is worth a hard stare due to the several
345 second-order optimizations it implements. Almost as an afterthought,
346 it calls dispatch_node which actually calls the graph node dispatch
349 Process / thread model
350 ----------------------
352 vlib provides an ultra-lightweight cooperative multi-tasking thread
353 model. The graph node scheduler invokes these processes in much the same
354 way as traditional vector-processing run-to-completion graph nodes;
355 plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply
356 set the vlib\_node\_registration\_t type field to
357 vlib\_NODE\_TYPE\_PROCESS. Yes, process is a misnomer. These are
358 cooperative multi-tasking threads.
360 As of this writing, the default stack size is 2<<15 = 32kb.
361 Initialize the node registration's process\_log2\_n\_stack\_bytes member
362 as needed. The graph node dispatcher makes some effort to detect stack
363 overrun, e.g. by mapping a no-access page below each thread stack.
365 Process node dispatch functions are expected to be "while(1) { }" loops
366 which suspend when not otherwise occupied, and which must not run for
367 unreasonably long periods of time.
369 "Unreasonably long" is an application-dependent concept. Over the years,
370 we have constructed frame-size sensitive control-plane nodes which will
371 use a much higher fraction of the available CPU bandwidth when the frame
372 size is low. The classic example: modifying forwarding tables. So long
373 as the table-builder leaves the forwarding tables in a valid state, one
374 can suspend the table builder to avoid dropping packets as a result of
375 control-plane activity.
377 Process nodes can suspend for fixed amounts of time, or until another
378 entity signals an event, or both. See the next section for a description
379 of the vlib process event mechanism.
381 When running in vlib process context, one must pay strict attention to
382 loop invariant issues. If one walks a data structure and calls a
383 function which may suspend, one had best know by construction that it
384 cannot change. Often, it's best to simply make a snapshot copy of a data
385 structure, walk the copy at leisure, then free the copy.
390 The vlib process event mechanism API is extremely lightweight and easy
391 to use. Here is a typical example:
394 vlib_main_t *vm = &vlib_global_main;
395 uword event_type, * event_data = 0;
399 vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */);
401 event_type = vlib_process_get_events (vm, &event_data);
403 switch (event_type) {
405 handle_event1s (event_data);
409 handle_event2s (event_data);
412 case ~0: /* 5-second idle/periodic */
420 vec_reset_length(event_data);
424 In this example, the VLIB process node waits for an event to occur, or
425 for 5 seconds to elapse. The code demuxes on the event type, calling
426 the appropriate handler function. Each call to
427 vlib\_process\_get\_events returns a vector of per-event-type data
428 passed to successive vlib\_process\_signal\_event calls; it is a
429 serious error to process only event\_data\[0\].
431 Resetting the event\_data vector-length to 0 \[instead of calling
432 vec\_free\] means that the event scheme doesn't burn cycles continuously
433 allocating and freeing the event data vector. This is a common vppinfra
434 / vlib coding pattern, well worth using when appropriate.
436 Signaling an event is easy, for example:
439 vlib_process_signal_event (vm, process_node_index, EVENT1,
440 (uword)arbitrary_event1_data); /* and so forth */
443 One can either know the process node index by construction - dig it out
444 of the appropriate vlib\_node\_registration\_t - or by finding the
445 vlib\_node\_t with vlib\_get\_node\_by\_name(...).
450 vlib buffering solves the usual set of packet-processing problems,
451 albeit at high performance. Key in terms of performance: one ordinarily
452 allocates / frees N buffers at a time rather than one at a time. Except
453 when operating directly on a specific buffer, one deals with buffers by
454 index, not by pointer.
456 Packet-processing frames are u32\[\] arrays, not
457 vlib\_buffer\_t\[\] arrays.
459 Packets comprise one or more vlib buffers, chained together as required.
460 Multiple particle sizes are supported; hardware input nodes simply ask
461 for the required size(s). Coalescing support is available. For obvious
462 reasons one is discouraged from writing one's own wild and wacky buffer
463 chain traversal code.
465 vlib buffer headers are allocated immediately prior to the buffer data
466 area. In typical packet processing this saves a dependent read wait:
467 given a buffer's address, one can prefetch the buffer header
468 \[metadata\] at the same time as the first cache line of buffer data.
470 Buffer header metadata (vlib\_buffer\_t) includes the usual rewrite
471 expansion space, a current\_data offset, RX and TX interface indices,
472 packet trace information, and a opaque areas.
474 The opaque data is intended to control packet processing in arbitrary
475 subgraph-dependent ways. The programmer shoulders responsibility for
476 data lifetime analysis, type-checking, etc.
478 Buffers have reference-counts in support of e.g. multicast replication.
480 Shared-memory message API
481 -------------------------
483 Local control-plane and application processes interact with the vpp
484 dataplane via asynchronous message-passing in shared memory over
485 unidirectional queues. The same application APIs are available via
488 Capturing API traces and replaying them in a simulation environment
489 requires a disciplined approach to the problem. This seems like a
490 make-work task, but it is not. When something goes wrong in the
491 control-plane after 300,000 or 3,000,000 operations, high-speed replay
492 of the events leading up to the accident is a huge win.
494 The shared-memory message API message allocator vl\_api\_msg\_alloc uses
495 a particularly cute trick. Since messages are processed in order, we try
496 to allocate message buffering from a set of fixed-size, preallocated
497 rings. Each ring item has a "busy" bit. Freeing one of the preallocated
498 message buffers merely requires the message consumer to clear the busy
499 bit. No locking required.
504 Adding debug CLI commands to VLIB applications is very simple.
506 Here is a complete example:
509 static clib_error_t *
510 show_ip_tuple_match (vlib_main_t * vm,
511 unformat_input_t * input,
512 vlib_cli_command_t * cmd)
514 vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main);
519 static VLIB_CLI_COMMAND (show_ip_tuple_command) =
521 .path = "show ip tuple match",
522 .short_help = "Show ip 5-tuple match-and-broadcast tables",
523 .function = show_ip_tuple_match,
528 This example implements the "show ip tuple match" debug cli
529 command. In ordinary usage, the vlib cli is available via the "vppctl"
530 application, which sends traffic to a named pipe. One can configure
531 debug CLI telnet access on a configurable port.
533 The cli implementation has an output redirection facility which makes it
534 simple to deliver cli output via shared-memory API messaging,
536 Particularly for debug or "show tech support" type commands, it would be
537 wasteful to write vlib application code to pack binary data, write more
538 code elsewhere to unpack the data and finally print the answer. If a
539 certain cli command has the potential to hurt packet processing
540 performance by running for too long, do the work incrementally in a
541 process node. The client can wait.
543 Handing off buffers between threads
544 -----------------------------------
546 Vlib includes an easy-to-use mechanism for handing off buffers between
547 worker threads. A typical use-case: software ingress flow hashing. At
548 a high level, one creates a per-worker-thread queue which sends packets
549 to a specific graph node in the indicated worker thread. With the
550 queue in hand, enqueue packets to the worker thread of your choice.
552 ### Initialize a handoff queue
554 Simple enough, call vlib_frame_queue_main_init:
557 main_ptr->frame_queue_index
558 = vlib_frame_queue_main_init (dest_node.index, frame_queue_size);
561 Frame_queue_size means what it says: the number of frames which may be
562 queued. Since frames contain 1...256 packets, frame_queue_size should
563 be a reasonably small number (32...64). If the frame queue producer(s)
564 are faster than the frame queue consumer(s), congestion will
565 occur. Suggest letting the enqueue operator deal with queue
566 congestion, as shown in the enqueue example below.
568 Under the floorboards, vlib_frame_queue_main_init creates an input queue
569 for each worker thread.
571 Please do NOT create frame queues until it's clear that they will be
572 used. Although the main dispatch loop is reasonably smart about how
573 often it polls the (entire set of) frame queues, polling unused frame
574 queues is a waste of clock cycles.
578 The actual handoff mechanics are simple, and integrate nicely with
579 a typical graph-node dispatch function:
583 do_handoff_inline (vlib_main_t * vm,
584 vlib_node_runtime_t * node, vlib_frame_t * frame,
585 int is_ip4, int is_trace)
587 u32 n_left_from, *from;
588 vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
589 u16 thread_indices [VLIB_FRAME_SIZE];
590 u16 nexts[VLIB_FRAME_SIZE], *next;
592 htest_main_t *hmp = &htest_main;
595 from = vlib_frame_vector_args (frame);
596 n_left_from = frame->n_vectors;
598 vlib_get_buffers (vm, from, bufs, n_left_from);
603 * Typical frame traversal loop, details vary with
604 * use case. Make sure to set thread_indices[i] with
605 * the desired destination thread index. You may
606 * or may not bother to set next[i].
609 for (i = 0; i < frame->n_vectors; i++)
612 /* Pick a thread to handle this packet */
613 thread_indices[i] = f (packet_data_or_whatever);
621 /* Enqueue buffers to threads */
623 vlib_buffer_enqueue_to_thread (vm, hmp->frame_queue_index,
624 from, thread_indices, frame->n_vectors,
625 1 /* drop on congestion */);
627 if (n_enq < frame->n_vectors)
628 vlib_node_increment_counter (vm, node->node_index,
629 XXX_ERROR_CONGESTION_DROP,
630 frame->n_vectors - n_enq);
631 vlib_node_increment_counter (vm, node->node_index,
632 XXX_ERROR_HANDED_OFF, n_enq);
633 return frame->n_vectors;
637 Notes about calling vlib_buffer_enqueue_to_thread(...):
639 * If you pass "drop on congestion" non-zero, all packets in the
640 inbound frame will be consumed one way or the other. This is the
643 * In the drop-on-congestion case, please don't try to "help" in the
644 enqueue node by freeing dropped packets, or by pushing them to
645 "error-drop." Either of those actions would be a severe error.
647 * It's perfectly OK to enqueue packets to the current thread.