docs/developer/corearchitecture/infrastructure.rst

   1 VPPINFRA (Infrastructure)
   2 =========================
   3
   4 The files associated with the VPP Infrastructure layer are located in
   5 the ``./src/vppinfra`` folder.
   6
   7 VPPinfra is a collection of basic c-library services, quite sufficient
   8 to build standalone programs to run directly on bare metal. It also
   9 provides high-performance dynamic arrays, hashes, bitmaps,
  10 high-precision real-time clock support, fine-grained event-logging, and
  11 data structure serialization.
  12
  13 One fair comment / fair warning about vppinfra: you can't always tell a
  14 macro from an inline function from an ordinary function simply by name.
  15 Macros are used to avoid function calls in the typical case, and to
  16 cause (intentional) side-effects.
  17
  18 Vppinfra has been around for almost 20 years and tends not to change
  19 frequently. The VPP Infrastructure layer contains the following
  20 functions:
  21
  22 Vectors
  23 -------
  24
  25 Vppinfra vectors are ubiquitous dynamically resized arrays with by user
  26 defined "headers". Many vpppinfra data structures (e.g. hash, heap,
  27 pool) are vectors with various different headers.
  28
  29 The memory layout looks like this:
  30
  31 ::
  32
  33                       User header (optional, uword aligned)
  34                       Alignment padding (if needed)
  35                       Vector length in elements
  36     User's pointer -> Vector element 0
  37                       Vector element 1
  38                       ...
  39                       Vector element N-1
  40
  41 As shown above, the vector APIs deal with pointers to the 0th element of
  42 a vector. Null pointers are valid vectors of length zero.
  43
  44 To avoid thrashing the memory allocator, one often resets the length of
  45 a vector to zero while retaining the memory allocation. Set the vector
  46 length field to zero via the vec_reset_length(v) macro. [Use the macro!
  47 It’s smart about NULL pointers.]
  48
  49 Typically, the user header is not present. User headers allow for other
  50 data structures to be built atop vppinfra vectors. Users may specify the
  51 alignment for first data element of a vector via the [vec]()*_aligned
  52 macros.
  53
  54 Vector elements can be any C type e.g. (int, double, struct bar). This
  55 is also true for data types built atop vectors (e.g. heap, pool, etc.).
  56 Many macros have \_a variants supporting alignment of vector elements
  57 and \_h variants supporting non-zero-length vector headers. The \_ha
  58 variants support both. Additionally cacheline alignment within a vector
  59 element structure can be specified using the
  60 ``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.
  61
  62 Inconsistent usage of header and/or alignment related macro variants
  63 will cause delayed, confusing failures.
  64
  65 Standard programming error: memorize a pointer to the ith element of a
  66 vector, and then expand the vector. Vectors expand by 3/2, so such code
  67 may appear to work for a period of time. Correct code almost always
  68 memorizes vector **indices** which are invariant across reallocations.
  69
  70 In typical application images, one supplies a set of global functions
  71 designed to be called from gdb. Here are a few examples:
  72
  73 -  vl(v) - prints vec_len(v)
  74 -  pe(p) - prints pool_elts(p)
  75 -  pifi(p, index) - prints pool_is_free_index(p, index)
  76 -  debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p
  77
  78 Use the “show gdb” debug CLI command to print the current set.
  79
  80 Bitmaps
  81 -------
  82
  83 Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
  84 Quite handy for a variety jobs.
  85
  86 Pools
  87 -----
  88
  89 Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
  90 fixed-size data structures with independent lifetimes. Pools are perfect
  91 for allocating per-session structures.
  92
  93 Hashes
  94 ------
  95
  96 Vppinfra provides several hash flavors. Data plane problems involving
  97 packet classification / session lookup often use
  98 ./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
  99 These templates are instantiated multiple times, to efficiently service
 100 different fixed-key sizes.
 101
 102 Bihashes are thread-safe. Read-locking is not required. A simple
 103 spin-lock ensures that only one thread writes an entry at a time.
 104
 105 The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
 106 are simple to use, and are often used in control-plane code which needs
 107 exact-string-matching.
 108
 109 In either case, one almost always looks up a key in a hash table to
 110 obtain an index in a related vector or pool. The APIs are simple enough,
 111 but one must take care when using the unmanaged arbitrary-sized key
 112 variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
 113 key_pointer. It is usually a bad mistake to pass the address of a vector
 114 element as the second argument to hash_set_mem. It is perfectly fine to
 115 memorize constant string addresses in the text segment.
 116
 117 Timekeeping
 118 -----------
 119
 120 Vppinfra includes high-precision, low-cost timing services. The datatype
 121 clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
 122 Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
 123 object.
 124
 125 Clib_time_init(…) can use a variety of different ways to establish the
 126 hardware clock frequency. At the end of the day, vppinfra timekeeping
 127 takes the attitude that the operating system’s clock is the closest
 128 thing to a gold standard it has handy.
 129
 130 When properly configured, NTP maintains kernel clock synchronization
 131 with a highly accurate off-premises reference clock. Notwithstanding
 132 network propagation delays, a synchronized NTP client will keep the
 133 kernel clock accurate to within 50ms or so.
 134
 135 Why should one care? Simply put, oscillators used to generate CPU ticks
 136 aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t
 137 be out of the question. That’s a minute and a half’s worth of error in 1
 138 day. The error changes constantly, due to temperature variation, and a
 139 host of other physical factors.
 140
 141 It’s far too expensive to use system calls for timing, so we’re left
 142 with the problem of continuously adjusting our view of the CPU tick
 143 register’s clocks_per_second parameter.
 144
 145 The clock rate adjustment algorithm measures the number of cpu ticks and
 146 the “gold standard” reference time across an interval of approximately
 147 16 seconds. We calculate clocks_per_second for the interval: use rdtsc
 148 (on x86_64) and a system call to get the latest cpu tick count and the
 149 kernel’s latest nanosecond timestamp. We subtract the previous interval
 150 end values, and use exponential smoothing to merge the new clock rate
 151 sample into the clocks_per_second parameter.
 152
 153 As of this writing, we maintain the clock rate by way of the following
 154 first-order differential equation:
 155
 156 .. code:: c
 157
 158       clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
 159       where K = e**(-1.0/3.75);
 160
 161 This yields a per observation “half-life” of 1 minute. Empirically, the
 162 clock rate converges within 5 minutes, and appears to maintain
 163 near-perfect agreement with the kernel clock in the face of ongoing NTP
 164 time adjustments.
 165
 166 See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
 167 rate adjustment algorithm. The code rejects frequency samples
 168 corresponding to the sort of adjustment which might occur if someone
 169 changes the gold standard kernel clock by several seconds.
 170
 171 Monotonic timebase support
 172 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 173
 174 Particularly during system initialization, the “gold standard” system
 175 reference clock can change by a large amount, in an instant. It’s not a
 176 best practice to yank the reference clock - in either direction - by
 177 hours or days. In fact, some poorly-constructed use-cases do so.
 178
 179 To deal with this reality, clib_time_now(…) returns the number of
 180 seconds since vpp started, *guaranteed to be monotonically increasing,
 181 no matter what happens to the system reference clock*.
 182
 183 This is first-order important, to avoid breaking every active timer in
 184 the system. The vpp host stack alone may account for tens of millions of
 185 active timers. It’s utterly impractical to track down and fix timers, so
 186 we must deal with the issue at the timebase level.
 187
 188 Here’s how it works. Prior to adjusting the clock rate, we collect the
 189 kernel reference clock and the cpu clock:
 190
 191 .. code:: c
 192
 193      /* Ask the kernel and the CPU what time it is... */
 194      now_reference = unix_time_now ();
 195      now_clock = clib_cpu_time_now ();
 196
 197 Compute changes for both clocks since the last rate adjustment, roughly
 198 15 seconds ago:
 199
 200 .. code:: c
 201
 202      /* Compute change in the reference clock */
 203      delta_reference = now_reference - c->last_verify_reference_time;
 204
 205      /* And change in the CPU clock */
 206      delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
 207        c->seconds_per_clock;
 208
 209 Delta_reference is key. Almost 100% of the time, delta_reference and
 210 delta_clock_in_seconds are identical modulo one system-call time.
 211 However, NTP or a privileged user can yank the system reference time -
 212 in either direction - by an hour, a day, or a decade.
 213
 214 As described above, clib_time_now(…) must return monotonically
 215 increasing answers to the question “how long has it been since vpp
 216 started, in seconds.” To do that, the clock rate adjustment algorithm
 217 begins by recomputing the initial reference time:
 218
 219 .. code:: c
 220
 221      c->init_reference_time += (delta_reference - delta_clock_in_seconds);
 222
 223 It’s easy to convince yourself that if the reference clock changes by
 224 15.000000 seconds and the cpu clock tick time changes by 15.000000
 225 seconds, the initial reference time won’t change.
 226
 227 If, on the other hand, delta_reference is -86400.0 and delta clock is
 228 15.0 - reference time jumped backwards by exactly one day in a 15-second
 229 rate update interval - we add -86415.0 to the initial reference time.
 230
 231 Given the corrected initial reference time, we recompute the total
 232 number of cpu ticks which have occurred since the corrected initial
 233 reference time, at the current clock tick rate:
 234
 235 .. code:: c
 236
 237      c->total_cpu_time = (now_reference - c->init_reference_time)
 238        * c->clocks_per_second;
 239
 240 Timebase precision
 241 ~~~~~~~~~~~~~~~~~~
 242
 243 Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
 244 floating-point value; the number of seconds since vpp started.
 245
 246 Please see `this Wikipedia
 247 article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
 248 for more information. C double-precision floating point numbers (called
 249 f64 in the vpp code base) have a 53-bit effective mantissa, and can
 250 accurately represent 15 decimal digits’ worth of precision.
 251
 252 There are 315,360,000.000001 seconds in ten years plus one microsecond.
 253 That string has exactly 15 decimal digits. The vpp time base retains 1us
 254 precision for roughly 30 years.
 255
 256 vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds.
 257 If necessary, please use clib_cpu_time_now(…) for direct access to the
 258 CPU clock-cycle counter. Note that the number of CPU clock cycles per
 259 second varies significantly across CPU architectures.
 260
 261 Timer Wheels
 262 ------------
 263
 264 Vppinfra includes configurable timer wheel support. See the source code
 265 in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
 266 number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].
 267
 268 Instantiation of tw_timer_template.h generates named structures to
 269 implement specific timer wheel geometries. Choices include: number of
 270 timer wheels (currently, 1 or 2), number of slots per ring (a power of
 271 two), and the number of timers per “object handle”.
 272
 273 Internally, user object/timer handles are 32-bit integers, so if one
 274 selects 16 timers/object (4 bits), the resulting timer wheel handle is
 275 limited to 2**28 objects.
 276
 277 Here are the specific settings required to generate a single 2048 slot
 278 wheel which supports 2 timers per object:
 279
 280 .. code:: c
 281
 282        #define TW_TIMER_WHEELS 1
 283        #define TW_SLOTS_PER_RING 2048
 284        #define TW_RING_SHIFT 11
 285        #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
 286        #define TW_TIMERS_PER_OBJECT 2
 287        #define LOG2_TW_TIMERS_PER_OBJECT 1
 288        #define TW_SUFFIX _2t_1w_2048sl
 289        #define TW_FAST_WHEEL_BITMAP 0
 290        #define TW_TIMER_ALLOW_DUPLICATE_STOP 0
 291
 292 See tw_timer_2t_1w_2048sl.h for a complete example.
 293
 294 tw_timer_template.h is not intended to be #included directly. Client
 295 codes can include multiple timer geometry header files, although extreme
 296 caution would required to use the TW and TWT macros in such a case.
 297
 298 API usage examples
 299 ~~~~~~~~~~~~~~~~~~
 300
 301 The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
 302 API usage example. It uses a synthetic clock to rapidly exercise the
 303 underlying tw_timer_expire_timers(…) template.
 304
 305 There are not many API routines to call.
 306
 307 Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
 308 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 309
 310 .. code:: c
 311
 312        tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
 313                                         expired_timer_single_callback,
 314                          1.0 / * timer interval * / );
 315
 316 Start a timer
 317 ^^^^^^^^^^^^^
 318
 319 .. code:: c
 320
 321        handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
 322                                              [0 | 1] / * timer id * / ,
 323                                              expiration_time_in_u32_ticks);
 324
 325 Stop a timer
 326 ^^^^^^^^^^^^
 327
 328 .. code:: c
 329
 330        tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
 331
 332 An expired timer callback
 333 ^^^^^^^^^^^^^^^^^^^^^^^^^
 334
 335 .. code:: c
 336
 337        static void
 338        expired_timer_single_callback (u32 * expired_timers)
 339        {
 340            int i;
 341            u32 pool_index, timer_id;
 342            tw_timer_test_elt_t *e;
 343            tw_timer_test_main_t *tm = &tw_timer_test_main;
 344
 345            for (i = 0; i < vec_len (expired_timers);
 346                {
 347                pool_index = expired_timers[i] & 0x7FFFFFFF;
 348                timer_id = expired_timers[i] >> 31;
 349
 350                ASSERT (timer_id == 1);
 351
 352                e = pool_elt_at_index (tm->test_elts, pool_index);
 353
 354                if (e->expected_to_expire != tm->single_wheel.current_tick)
 355                  {
 356                    fformat (stdout, "[%d] expired at %d not %d\n",
 357                             e - tm->test_elts, tm->single_wheel.current_tick,
 358                             e->expected_to_expire);
 359                  }
 360             pool_put (tm->test_elts, e);
 361             }
 362         }
 363
 364 We use wheel timers extensively in the vpp host stack. Each TCP session
 365 needs 5 timers, so supporting 10 million flows requires up to 50 million
 366 concurrent timers.
 367
 368 Timers rarely expire, so it’s of utmost important that stopping and
 369 restarting a timer costs as few clock cycles as possible.
 370
 371 Stopping a timer costs a doubly-linked list dequeue. Starting a timer
 372 involves modular arithmetic to determine the correct timer wheel and
 373 slot, and a list head enqueue.
 374
 375 Expired timer processing generally involves bulk link-list retirement
 376 with user callback presentation. Some additional complexity at wheel
 377 wrap time, to relocate timers from slower-turning timer wheels into
 378 faster-turning wheels.
 379
 380 Format
 381 ------
 382
 383 Vppinfra format is roughly equivalent to printf.
 384
 385 Format has a few properties worth mentioning. Format’s first argument is
 386 a (u8 \*) vector to which it appends the result of the current format
 387 operation. Chaining calls is very easy:
 388
 389 .. code:: c
 390
 391        u8 * result;
 392
 393        result = format (0, "junk = %d, ", junk);
 394        result = format (result, "more junk = %d\n", more_junk);
 395
 396 As previously noted, NULL pointers are perfectly proper 0-length
 397 vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
 398 wish to print a (u8 \*) vector, use the “%v” format string. If you need
 399 a (u8 \*) vector which is also a proper C-string, either of these
 400 schemes may be used:
 401
 402 .. code:: c
 403
 404        vec_add1 (result, 0)
 405        or
 406        result = format (result, "<whatever>%c", 0);
 407
 408 Remember to vec_free() the result if appropriate. Be careful not to pass
 409 format an uninitialized (u8 \*).
 410
 411 Format implements a particularly handy user-format scheme via the “%U”
 412 format specification. For example:
 413
 414 .. code:: c
 415
 416        u8 * format_junk (u8 * s, va_list *va)
 417        {
 418          junk = va_arg (va, u32);
 419          s = format (s, "%s", junk);
 420          return s;
 421        }
 422
 423        result = format (0, "junk = %U, format_junk, "This is some junk");
 424
 425 format_junk() can invoke other user-format functions if desired. The
 426 programmer shoulders responsibility for argument type-checking. It is
 427 typical for user format functions to blow up spectacularly if the
 428 va_arg(va, type) macros don’t match the caller’s idea of reality.
 429
 430 Unformat
 431 --------
 432
 433 Vppinfra unformat is vaguely related to scanf, but considerably more
 434 general.
 435
 436 A typical use case involves initializing an unformat_input_t from either
 437 a C-string or a (u8 \*) vector, then parsing via unformat() as follows:
 438
 439 .. code:: c
 440
 441        unformat_input_t input;
 442        u8 *s = "<some-C-string>";
 443
 444        unformat_init_string (&input, (char *) s, strlen((char *) s));
 445        /* or */
 446        unformat_init_vector (&input, <u8-vector>);
 447
 448 Then loop parsing individual elements:
 449
 450 .. code:: c
 451
 452        while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
 453        {
 454          if (unformat (&input, "value1 %d", &value1))
 455            ;/* unformat sets value1 */
 456          else if (unformat (&input, "value2 %d", &value2)
 457            ;/* unformat sets value2 */
 458          else
 459            return clib_error_return (0, "unknown input '%U'",
 460                                      format_unformat_error, input);
 461        }
 462
 463 As with format, unformat implements a user-unformat function capability
 464 via a “%U” user unformat function scheme. Generally, one can trivially
 465 transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
 466 &foo)“.
 467
 468 Unformat implements a couple of handy non-scanf-like format specifiers:
 469
 470 .. code:: c
 471
 472        unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
 473        unformat (input, "bitzero %|", &mask, (1<<0));
 474        unformat (input, "bitone %|", &mask, (1<<1));
 475        <etc>
 476
 477 The phrase “enable %=” means “set the supplied variable to the default
 478 value” if unformat parses the “enable” keyword all by itself. If
 479 unformat parses “enable 123” set the supplied variable to 123.
 480
 481 We could clean up a number of hand-rolled “verbose” + “verbose %d”
 482 argument parsing codes using “%=”.
 483
 484 The phrase “bitzero %\|” means “set the specified bit in the supplied
 485 bitmask” if unformat parses “bitzero”. Although it looks like it could
 486 be fairly handy, it’s very lightly used in the code base.
 487
 488 ``%_`` toggles whether or not to skip input white space.
 489
 490 For transition from skip to no-skip in middle of format string, skip
 491 input white space. For example, the following:
 492
 493 .. code:: c
 494
 495    fmt = "%_%d.%d%_->%_%d.%d%_"
 496    unformat (input, fmt, &one, &two, &three, &four);
 497
 498 matches input “1.2 -> 3.4”. Without this, the space after -> does not
 499 get skipped.
 500
 501
 502 How to parse a single input line
 503 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 504
 505 Debug CLI command functions MUST NOT accidentally consume input
 506 belonging to other debug CLI commands. Otherwise, it's impossible to
 507 script a set of debug CLI commands which "work fine" when issued one
 508 at a time.
 509
 510 This bit of code is NOT correct:
 511
 512 .. code:: c
 513
 514      /* Eats script input NOT beloging to it, and chokes! */
 515      while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
 516        {
 517          if (unformat (input, ...))
 518        ;
 519          else if (unformat (input, ...))
 520        ;
 521          else
 522            return clib_error_return (0, "parse error: '%U'",
 523                                 format_unformat_error, input);
 524        }
 525        }
 526
 527 When executed as part of a script, such a function will return “parse
 528 error: ‘’” every time, unless it happens to be the last command in the
 529 script.
 530
 531 Instead, use “unformat_line_input” to consume the rest of a line’s worth
 532 of input - everything past the path specified in the VLIB_CLI_COMMAND
 533 declaration.
 534
 535 For example, unformat_line_input with “my_command” set up as shown below
 536 and user input “my path is clear” will produce an unformat_input_t that
 537 contains “is clear”.
 538
 539 .. code:: c
 540
 541        VLIB_CLI_COMMAND (...) = {
 542            .path = "my path",
 543        };
 544
 545 Here’s a bit of code which shows the required mechanics, in full:
 546
 547 .. code:: c
 548
 549        static clib_error_t *
 550        my_command_fn (vlib_main_t * vm,
 551                       unformat_input_t * input,
 552                       vlib_cli_command_t * cmd)
 553        {
 554          unformat_input_t _line_input, *line_input = &_line_input;
 555          u32 this, that;
 556          clib_error_t *error = 0;
 557
 558          if (!unformat_user (input, unformat_line_input, line_input))
 559            return 0;
 560
 561          /*
 562           * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
 563           * not at the end of the script...
 564           */
 565          while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
 566            {
 567               if (unformat (line_input, "this %u", &this))
 568                 ;
 569               else if (unformat (line_input, "that %u", &that))
 570                 ;
 571               else
 572                 {
 573                   error = clib_error_return (0, "parse error: '%U'",
 574                                     format_unformat_error, line_input);
 575                   goto done;
 576                 }
 577              }
 578
 579        <do something based on "this" and "that", etc>
 580
 581        done:
 582          unformat_free (line_input);
 583          return error;
 584        }
 585       VLIB_CLI_COMMAND (my_command, static) = {
 586         .path = "my path",
 587         .function = my_command_fn",
 588       };
 589
 590 Vppinfra errors and warnings
 591 ----------------------------
 592
 593 Many functions within the vpp dataplane have return-values of type
 594 clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of
 595 metadata [fatal, warning] and are easy to announce. Returning a NULL
 596 clib_error_t \* indicates “A-OK, no error.”
 597
 598 Clib_warning(format-args) is a handy way to add debugging output; clib
 599 warnings prepend function:line info to unambiguously locate the message
 600 source. Clib_unix_warning() adds perror()-style Linux system-call
 601 information. In production images, clib_warnings result in syslog
 602 entries.
 603
 604 Serialization
 605 -------------
 606
 607 Vppinfra serialization support allows the programmer to easily serialize
 608 and unserialize complex data structures.
 609
 610 The underlying primitive serialize/unserialize functions use network
 611 byte-order, so there are no structural issues serializing on a
 612 little-endian host and unserializing on a big-endian host.