docs/gettingstarted/developers/infrastructure.md

   1 VPPINFRA (Infrastructure)
   2 =========================
   3
   4 The files associated with the VPP Infrastructure layer are located in
   5 the ./src/vppinfra folder.
   6
   7 VPPinfra is a collection of basic c-library services, quite
   8 sufficient to build standalone programs to run directly on bare metal.
   9 It also provides high-performance dynamic arrays, hashes, bitmaps,
  10 high-precision real-time clock support, fine-grained event-logging, and
  11 data structure serialization.
  12
  13 One fair comment / fair warning about vppinfra: you can\'t always tell a
  14 macro from an inline function from an ordinary function simply by name.
  15 Macros are used to avoid function calls in the typical case, and to
  16 cause (intentional) side-effects.
  17
  18 Vppinfra has been around for almost 20 years and tends not to change
  19 frequently. The VPP Infrastructure layer contains the following
  20 functions:
  21
  22 Vectors
  23 -------
  24
  25 Vppinfra vectors are ubiquitous dynamically resized arrays with by user
  26 defined \"headers\". Many vpppinfra data structures (e.g. hash, heap,
  27 pool) are vectors with various different headers.
  28
  29 The memory layout looks like this:
  30
  31 ```
  32                    User header (optional, uword aligned)
  33                    Alignment padding (if needed)
  34                    Vector length in elements
  35  User's pointer -> Vector element 0
  36                    Vector element 1
  37                    ...
  38                    Vector element N-1
  39 ```
  40
  41 As shown above, the vector APIs deal with pointers to the 0th element of
  42 a vector. Null pointers are valid vectors of length zero.
  43
  44 To avoid thrashing the memory allocator, one often resets the length of
  45 a vector to zero while retaining the memory allocation. Set the vector
  46 length field to zero via the vec\_reset\_length(v) macro. \[Use the
  47 macro! It's smart about NULL pointers.\]
  48
  49 Typically, the user header is not present. User headers allow for other
  50 data structures to be built atop vppinfra vectors. Users may specify the
  51 alignment for first data element of a vector via the \[vec\]()\*\_aligned
  52 macros.
  53
  54 Vector elements can be any C type e.g. (int, double, struct bar). This
  55 is also true for data types built atop vectors (e.g. heap, pool, etc.).
  56 Many macros have \_a variants supporting alignment of vector elements
  57 and \_h variants supporting non-zero-length vector headers. The \_ha
  58 variants support both.  Additionally cacheline alignment within a
  59 vector element structure can be specified using the
  60 \[CLIB_CACHE_LINE_ALIGN_MARK\]() macro.
  61
  62 Inconsistent usage of header and/or alignment related macro variants
  63 will cause delayed, confusing failures.
  64
  65 Standard programming error: memorize a pointer to the ith element of a
  66 vector, and then expand the vector. Vectors expand by 3/2, so such code
  67 may appear to work for a period of time. Correct code almost always
  68 memorizes vector **indices** which are invariant across reallocations.
  69
  70 In typical application images, one supplies a set of global functions
  71 designed to be called from gdb. Here are a few examples:
  72
  73 -   vl(v) - prints vec\_len(v)
  74 -   pe(p) - prints pool\_elts(p)
  75 -   pifi(p, index) - prints pool\_is\_free\_index(p, index)
  76 -   debug\_hex\_bytes (p, nbytes) - hex memory dump nbytes starting at p
  77
  78 Use the "show gdb" debug CLI command to print the current set.
  79
  80 Bitmaps
  81 -------
  82
  83 Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
  84 Quite handy for a variety jobs.
  85
  86 Pools
  87 -----
  88
  89 Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
  90 fixed-size data structures with independent lifetimes. Pools are perfect
  91 for allocating per-session structures.
  92
  93 Hashes
  94 ------
  95
  96 Vppinfra provides several hash flavors. Data plane problems involving
  97 packet classification / session lookup often use
  98 ./src/vppinfra/bihash\_template.\[ch\] bounded-index extensible
  99 hashes. These templates are instantiated multiple times, to efficiently
 100 service different fixed-key sizes.
 101
 102 Bihashes are thread-safe. Read-locking is not required. A simple
 103 spin-lock ensures that only one thread writes an entry at a time.
 104
 105 The original vppinfra hash implementation in
 106 ./src/vppinfra/hash.\[ch\] are simple to use, and are often used in
 107 control-plane code which needs exact-string-matching.
 108
 109 In either case, one almost always looks up a key in a hash table to
 110 obtain an index in a related vector or pool. The APIs are simple enough,
 111 but one must take care when using the unmanaged arbitrary-sized key
 112 variant. Hash\_set\_mem (hash\_table, key\_pointer, value) memorizes
 113 key\_pointer. It is usually a bad mistake to pass the address of a
 114 vector element as the second argument to hash\_set\_mem. It is perfectly
 115 fine to memorize constant string addresses in the text segment.
 116
 117 Timekeeping
 118 -----------
 119
 120 Vppinfra includes high-precision, low-cost timing services. The
 121 datatype clib_time_t and associated functions reside in
 122 ./src/vppinfra/time.\[ch\]. Call clib_time_init (clib_time_t \*cp) to
 123 initialize the clib_time_t object.
 124
 125 Clib_time_init(...) can use a variety of different ways to establish
 126 the hardware clock frequency. At the end of the day, vppinfra
 127 timekeeping takes the attitude that the operating system's clock is
 128 the closest thing to a gold standard it has handy.
 129
 130 When properly configured, NTP maintains kernel clock synchronization
 131 with a highly accurate off-premises reference clock.  Notwithstanding
 132 network propagation delays, a synchronized NTP client will keep the
 133 kernel clock accurate to within 50ms or so.
 134
 135 Why should one care? Simply put, oscillators used to generate CPU
 136 ticks aren't super accurate. They work pretty well, but a 0.1% error
 137 wouldn't be out of the question. That's a minute and a half's worth of
 138 error in 1 day. The error changes constantly, due to temperature
 139 variation, and a host of other physical factors.
 140
 141 It's far too expensive to use system calls for timing, so we're left
 142 with the problem of continously adjusting our view of the CPU tick
 143 register's clocks_per_second parameter.
 144
 145 The clock rate adjustment algorithm measures the number of cpu ticks
 146 and the "gold standard" reference time across an interval of
 147 approximately 16 seconds. We calculate clocks_per_second for the
 148 interval: use rdtsc (on x86_64) and a system call to get the latest
 149 cpu tick count and the kernel's latest nanosecond timestamp. We
 150 subtract the previous interval end values, and use exponential
 151 smoothing to merge the new clock rate sample into the clocks_per_second
 152 parameter.
 153
 154 As of this writing, we maintain the clock rate by way of the following
 155 first-order differential equation:
 156
 157
 158 ```
 159    clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
 160    where K = e**(-1.0/3.75);
 161 ```
 162
 163 This yields a per observation "half-life" of 1 minute. Empirically,
 164 the clock rate converges within 5 minutes, and appears to maintain
 165 near-perfect agreement with the kernel clock in the face of ongoing
 166 NTP time adjustments.
 167
 168 See ./src/vppinfra/time.c:clib_time_verify_frequency(...) to look at
 169 the rate adjustment algorithm. The code rejects frequency samples
 170 corresponding to the sort of adjustment which might occur if someone
 171 changes the gold standard kernel clock by several seconds.
 172
 173 ### Monotonic timebase support
 174
 175 Particularly during system initialization, the "gold standard" system
 176 reference clock can change by a large amount, in an instant. It's not
 177 a best practice to yank the reference clock - in either direction - by
 178 hours or days. In fact, some poorly-constructed use-cases do so.
 179
 180 To deal with this reality, clib_time_now(...) returns the number of
 181 seconds since vpp started, *guaranteed to be monotonically
 182 increasing, no matter what happens to the system reference clock*.
 183
 184 This is first-order important, to avoid breaking every active timer in
 185 the system. The vpp host stack alone may account for tens of millions
 186 of active timers. It's utterly impractical to track down and fix
 187 timers, so we must deal with the issue at the timebase level.
 188
 189 Here's how it works. Prior to adjusting the clock rate, we collect the
 190 kernel reference clock and the cpu clock:
 191
 192 ```
 193   /* Ask the kernel and the CPU what time it is... */
 194   now_reference = unix_time_now ();
 195   now_clock = clib_cpu_time_now ();
 196 ```
 197
 198 Compute changes for both clocks since the last rate adjustment,
 199 roughly 15 seconds ago:
 200
 201 ```
 202   /* Compute change in the reference clock */
 203   delta_reference = now_reference - c->last_verify_reference_time;
 204
 205   /* And change in the CPU clock */
 206   delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
 207     c->seconds_per_clock;
 208 ```
 209
 210 Delta_reference is key. Almost 100% of the time, delta_reference and
 211 delta_clock_in_seconds are identical modulo one system-call
 212 time. However, NTP or a privileged user can yank the system reference
 213 time - in either direction - by an hour, a day, or a decade.
 214
 215 As described above, clib_time_now(...) must return monotonically
 216 increasing answers to the question "how long has it been since vpp
 217 started, in seconds." To do that, the clock rate adjustment algorithm
 218 begins by recomputing the initial reference time:
 219
 220 ```
 221   c->init_reference_time += (delta_reference - delta_clock_in_seconds);
 222 ```
 223
 224 It's easy to convince yourself that if the reference clock changes by
 225 15.000000 seconds and the cpu clock tick time changes by 15.000000
 226 seconds, the initial reference time won't change.
 227
 228 If, on the other hand, delta_reference is -86400.0 and delta clock is
 229 15.0 - reference time jumped backwards by exactly one day in a
 230 15-second rate update interval - we add -86415.0 to the initial
 231 reference time.
 232
 233 Given the corrected initial reference time, we recompute the total
 234 number of cpu ticks which have occurred since the corrected initial
 235 reference time, at the current clock tick rate:
 236
 237 ```
 238   c->total_cpu_time = (now_reference - c->init_reference_time)
 239     * c->clocks_per_second;
 240 ```
 241
 242 ### Timebase precision
 243
 244 Cognoscenti may notice that vlib/clib\_time\_now(...) return a 64-bit
 245 floating-point value; the number of seconds since vpp started.
 246
 247 Please see [this Wikipedia
 248 article](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)
 249 for more information. C double-precision floating point numbers
 250 (called f64 in the vpp code base) have a 53-bit effective mantissa,
 251 and can accurately represent 15 decimal digits' worth of precision.
 252
 253 There are 315,360,000.000001 seconds in ten years plus one
 254 microsecond. That string has exactly 15 decimal digits. The vpp time
 255 base retains 1us precision for roughly 30 years.
 256
 257 vlib/clib\_time\_now do *not* provide precision in excess of 1e-6
 258 seconds. If necessary, please use clib_cpu_time_now(...) for direct
 259 access to the CPU clock-cycle counter. Note that the number of CPU
 260 clock cycles per second varies significantly across CPU architectures.
 261
 262 Timer Wheels
 263 ------------
 264
 265 Vppinfra includes configurable timer wheel support. See the source
 266 code in .../src/vppinfra/tw_timer_template.[ch], as well as a
 267 considerable number of template instances defined in
 268 .../src/vppinfra/tw_timer_<wheel-geometry-spec>.[ch].
 269
 270 Instantiation of tw_timer_template.h generates named structures to
 271 implement specific timer wheel geometries. Choices include: number of
 272 timer wheels (currently, 1 or 2), number of slots per ring (a power of
 273 two), and the number of timers per "object handle".
 274
 275 Internally, user object/timer handles are 32-bit integers, so if one
 276 selects 16 timers/object (4 bits), the resulting timer wheel handle is
 277 limited to 2**28 objects.
 278
 279 Here are the specific settings required to generate a single 2048 slot
 280 wheel which supports 2 timers per object:
 281
 282 ```
 283     #define TW_TIMER_WHEELS 1
 284     #define TW_SLOTS_PER_RING 2048
 285     #define TW_RING_SHIFT 11
 286     #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
 287     #define TW_TIMERS_PER_OBJECT 2
 288     #define LOG2_TW_TIMERS_PER_OBJECT 1
 289     #define TW_SUFFIX _2t_1w_2048sl
 290     #define TW_FAST_WHEEL_BITMAP 0
 291     #define TW_TIMER_ALLOW_DUPLICATE_STOP 0
 292 ```
 293
 294 See tw_timer_2t_1w_2048sl.h for a complete
 295 example.
 296
 297 tw_timer_template.h is not intended to be #included directly. Client
 298 codes can include multiple timer geometry header files, although
 299 extreme caution would required to use the TW and TWT macros in such a
 300 case.
 301
 302 ### API usage examples
 303
 304 The unit test code in .../src/vppinfra/test_tw_timer.c provides a
 305 concrete API usage example. It uses a synthetic clock to rapidly
 306 exercise the underlying tw_timer_expire_timers(...) template.
 307
 308 There are not many API routines to call.
 309
 310 #### Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
 311
 312 ```
 313     tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
 314                                      expired_timer_single_callback,
 315                                       1.0 / * timer interval * / );
 316 ```
 317
 318 #### Start a timer
 319
 320 ```
 321     handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
 322                                           [0 | 1] / * timer id * / ,
 323                                           expiration_time_in_u32_ticks);
 324 ```
 325
 326 #### Stop a timer
 327
 328 ```
 329     tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
 330 ```
 331
 332 #### An expired timer callback
 333
 334 ```
 335     static void
 336     expired_timer_single_callback (u32 * expired_timers)
 337     {
 338         int i;
 339         u32 pool_index, timer_id;
 340         tw_timer_test_elt_t *e;
 341         tw_timer_test_main_t *tm = &tw_timer_test_main;
 342
 343         for (i = 0; i < vec_len (expired_timers);
 344             {
 345             pool_index = expired_timers[i] & 0x7FFFFFFF;
 346             timer_id = expired_timers[i] >> 31;
 347
 348             ASSERT (timer_id == 1);
 349
 350             e = pool_elt_at_index (tm->test_elts, pool_index);
 351
 352             if (e->expected_to_expire != tm->single_wheel.current_tick)
 353               {
 354                 fformat (stdout, "[%d] expired at %d not %d\n",
 355                          e - tm->test_elts, tm->single_wheel.current_tick,
 356                          e->expected_to_expire);
 357               }
 358          pool_put (tm->test_elts, e);
 359          }
 360      }
 361 ```
 362
 363 We use wheel timers extensively in the vpp host stack. Each TCP
 364 session needs 5 timers, so supporting 10 million flows requires up to
 365 50 million concurrent timers.
 366
 367 Timers rarely expire, so it's of utmost important that stopping and
 368 restarting a timer costs as few clock cycles as possible.
 369
 370 Stopping a timer costs a doubly-linked list dequeue. Starting a timer
 371 involves modular arithmetic to determine the correct timer wheel and
 372 slot, and a list head enqueue.
 373
 374 Expired timer processing generally involves bulk link-list retirement
 375 with user callback presentation. Some additional complexity at wheel
 376 wrap time, to relocate timers from slower-turning timer wheels into
 377 faster-turning wheels.
 378
 379 Format
 380 ------
 381
 382 Vppinfra format is roughly equivalent to printf.
 383
 384 Format has a few properties worth mentioning. Format's first argument is
 385 a (u8 \*) vector to which it appends the result of the current format
 386 operation. Chaining calls is very easy:
 387
 388 ```c
 389     u8 * result;
 390
 391     result = format (0, "junk = %d, ", junk);
 392     result = format (result, "more junk = %d\n", more_junk);
 393 ```
 394
 395 As previously noted, NULL pointers are perfectly proper 0-length
 396 vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
 397 wish to print a (u8 \*) vector, use the "%v" format string. If you need
 398 a (u8 \*) vector which is also a proper C-string, either of these
 399 schemes may be used:
 400
 401 ```c
 402     vec_add1 (result, 0)
 403     or
 404     result = format (result, "<whatever>%c", 0);
 405 ```
 406
 407 Remember to vec\_free() the result if appropriate. Be careful not to
 408 pass format an uninitialized (u8 \*).
 409
 410 Format implements a particularly handy user-format scheme via the "%U"
 411 format specification. For example:
 412
 413 ```c
 414     u8 * format_junk (u8 * s, va_list *va)
 415     {
 416       junk = va_arg (va, u32);
 417       s = format (s, "%s", junk);
 418       return s;
 419     }
 420
 421     result = format (0, "junk = %U, format_junk, "This is some junk");
 422 ```
 423
 424 format\_junk() can invoke other user-format functions if desired. The
 425 programmer shoulders responsibility for argument type-checking. It is
 426 typical for user format functions to blow up spectacularly if the
 427 va\_arg(va, type) macros don't match the caller's idea of reality.
 428
 429 Unformat
 430 --------
 431
 432 Vppinfra unformat is vaguely related to scanf, but considerably more
 433 general.
 434
 435 A typical use case involves initializing an unformat\_input\_t from
 436 either a C-string or a (u8 \*) vector, then parsing via unformat() as
 437 follows:
 438
 439 ```c
 440     unformat_input_t input;
 441     u8 *s = "<some-C-string>";
 442
 443     unformat_init_string (&input, (char *) s, strlen((char *) s));
 444     /* or */
 445     unformat_init_vector (&input, <u8-vector>);
 446 ```
 447
 448 Then loop parsing individual elements:
 449
 450 ```c
 451     while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
 452     {
 453       if (unformat (&input, "value1 %d", &value1))
 454         ;/* unformat sets value1 */
 455       else if (unformat (&input, "value2 %d", &value2)
 456         ;/* unformat sets value2 */
 457       else
 458         return clib_error_return (0, "unknown input '%U'",
 459                                   format_unformat_error, input);
 460     }
 461 ```
 462
 463 As with format, unformat implements a user-unformat function capability
 464 via a "%U" user unformat function scheme. Generally, one can trivially
 465 transform "format (s, "foo %d", foo) -> "unformat (input, "foo %d", &foo)".
 466
 467 Unformat implements a couple of handy non-scanf-like format specifiers:
 468
 469 ```c
 470     unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
 471     unformat (input, "bitzero %|", &mask, (1<<0));
 472     unformat (input, "bitone %|", &mask, (1<<1));
 473     <etc>
 474 ```
 475
 476 The phrase "enable %=" means "set the supplied variable to the default
 477 value" if unformat parses the "enable" keyword all by itself. If
 478 unformat parses "enable 123" set the supplied variable to 123.
 479
 480 We could clean up a number of hand-rolled "verbose" + "verbose %d"
 481 argument parsing codes using "%=".
 482
 483 The phrase "bitzero %|" means "set the specified bit in the supplied
 484 bitmask" if unformat parses "bitzero". Although it looks like it could
 485 be fairly handy, it's very lightly used in the code base.
 486
 487 `%_` toggles whether or not to skip input white space.
 488
 489 For transition from skip to no-skip in middle of format string, skip input white space.  For example, the following:
 490
 491 ```c
 492 fmt = "%_%d.%d%_->%_%d.%d%_"
 493 unformat (input, fmt, &one, &two, &three, &four);
 494 ```
 495 matches input "1.2 -> 3.4".
 496 Without this, the space after -> does not get skipped.
 497
 498
 499 ```
 500
 501 ### How to parse a single input line
 502
 503 Debug CLI command functions MUST NOT accidentally consume input
 504 belonging to other debug CLI commands. Otherwise, it's impossible to
 505 script a set of debug CLI commands which "work fine" when issued one
 506 at a time.
 507
 508 This bit of code is NOT correct:
 509
 510 ```c
 511   /* Eats script input NOT beloging to it, and chokes! */
 512   while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
 513     {
 514       if (unformat (input, ...))
 515         ;
 516       else if (unformat (input, ...))
 517         ;
 518       else
 519         return clib_error_return (0, "parse error: '%U'",
 520                                      format_unformat_error, input);
 521         }
 522     }
 523 ```
 524
 525 When executed as part of a script, such a function will return "parse
 526 error: '<next-command-text>'" every time, unless it happens to be the
 527 last command in the script.
 528
 529 Instead, use "unformat_line_input" to consume the rest of a line's
 530 worth of input - everything past the path specified in the
 531 VLIB_CLI_COMMAND declaration.
 532
 533 For example, unformat_line_input with "my_command" set up as shown
 534 below and user input "my path is clear" will produce an
 535 unformat_input_t that contains "is clear".
 536
 537 ```c
 538     VLIB_CLI_COMMAND (...) = {
 539         .path = "my path",
 540     };
 541 ```
 542
 543 Here's a bit of code which shows the required mechanics, in full:
 544
 545 ```c
 546     static clib_error_t *
 547     my_command_fn (vlib_main_t * vm,
 548                    unformat_input_t * input,
 549                    vlib_cli_command_t * cmd)
 550     {
 551       unformat_input_t _line_input, *line_input = &_line_input;
 552       u32 this, that;
 553       clib_error_t *error = 0;
 554
 555       if (!unformat_user (input, unformat_line_input, line_input))
 556         return 0;
 557
 558       /*
 559        * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
 560        * not at the end of the script...
 561        */
 562       while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
 563         {
 564            if (unformat (line_input, "this %u", &this))
 565              ;
 566            else if (unformat (line_input, "that %u", &that))
 567              ;
 568            else
 569              {
 570                error = clib_error_return (0, "parse error: '%U'",
 571                                      format_unformat_error, line_input);
 572                goto done;
 573              }
 574           }
 575
 576     <do something based on "this" and "that", etc>
 577
 578     done:
 579       unformat_free (line_input);
 580       return error;
 581     }
 582    /* *INDENT-OFF* */
 583    VLIB_CLI_COMMAND (my_command, static) = {
 584      .path = "my path",
 585      .function = my_command_fn",
 586    };
 587    /* *INDENT-ON* */
 588
 589 ```
 590
 591
 592 Vppinfra errors and warnings
 593 ----------------------------
 594
 595 Many functions within the vpp dataplane have return-values of type
 596 clib\_error\_t \*. Clib\_error\_t's are arbitrary strings with a bit of
 597 metadata \[fatal, warning\] and are easy to announce. Returning a NULL
 598 clib\_error\_t \* indicates "A-OK, no error."
 599
 600 Clib\_warning(format-args) is a handy way to add debugging
 601 output; clib warnings prepend function:line info to unambiguously locate
 602 the message source. Clib\_unix\_warning() adds perror()-style Linux
 603 system-call information. In production images, clib\_warnings result in
 604 syslog entries.
 605
 606 Serialization
 607 -------------
 608
 609 Vppinfra serialization support allows the programmer to easily serialize
 610 and unserialize complex data structures.
 611
 612 The underlying primitive serialize/unserialize functions use network
 613 byte-order, so there are no structural issues serializing on a
 614 little-endian host and unserializing on a big-endian host.