docs/developer/corearchitecture/bihash.rst

   1 Bounded-index Extensible Hashing (bihash)
   2 =========================================
   3
   4 Vpp uses bounded-index extensible hashing to solve a variety of
   5 exact-match (key, value) lookup problems. Benefits of the current
   6 implementation:
   7
   8 -  Very high record count scaling, tested to 100,000,000 records.
   9 -  Lookup performance degrades gracefully as the number of records
  10    increases
  11 -  No reader locking required
  12 -  Template implementation, it’s easy to support arbitrary (key,value)
  13    types
  14
  15 Bounded-index extensible hashing has been widely used in databases for
  16 decades.
  17
  18 Bihash uses a two-level data structure:
  19
  20 ::
  21
  22        +-----------------+
  23        | bucket-0        |
  24        |  log2_size      |
  25        |  backing store  |
  26        +-----------------+
  27        | bucket-1        |
  28        |  log2_size      |           +--------------------------------+
  29        |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
  30        +-----------------+           | page 0                         |
  31             ...                      +--------------------------------+
  32        +-----------------+           | KVP_PER_PAGE * key-value-pairs |
  33        | bucket-2**N-1   |           | page 1                         |
  34        |  log2_size      |           +--------------------------------+
  35        |  backing store  |                       ---
  36        +-----------------+           +--------------------------------+
  37                                      | KVP_PER_PAGE * key-value-pairs |
  38                                      | page 2**(log2(size)) - 1       |
  39                                      +--------------------------------+
  40
  41 Discussion of the algorithm
  42 ---------------------------
  43
  44 This structure has a couple of major advantages. In practice, each
  45 bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
  46 CPU architectures support 64-bit atomic operations. When modifying the
  47 contents of a specific bucket, we do the following:
  48
  49 -  Make a working copy of the bucket’s backing storage
  50 -  Atomically swap a pointer to the working copy into the bucket array
  51 -  Change the original backing store data
  52 -  Atomically swap back to the original
  53
  54 So, no reader locking is required to search a bihash table.
  55
  56 At lookup time, the implementation computes a key hash code. We use the
  57 least-significant N bits of the hash to select the bucket.
  58
  59 With the bucket in hand, we learn log2 (nBackingPages) for the selected
  60 bucket. At this point, we use the next log2_size bits from the hash code
  61 to select the specific backing page in which the (key,value) page will
  62 be found.
  63
  64 Net result: we search **one** backing page, not 2**log2_size pages. This
  65 is a key property of the algorithm.
  66
  67 When sufficient collisions occur to fill the backing pages for a given
  68 bucket, we double the bucket size, rehash, and deal the bucket contents
  69 into a double-sized set of backing pages. In the future, we may
  70 represent the size as a linear combination of two powers-of-two, to
  71 increase space efficiency.
  72
  73 To solve the “jackpot case” where a set of records collide under hashing
  74 in a bad way, the implementation will fall back to linear search across
  75 2**log2_size backing pages on a per-bucket basis.
  76
  77 To maintain *space* efficiency, we should configure the bucket array so
  78 that backing pages are effectively utilized. Lookup performance tends to
  79 change *very little* if the bucket array is too small or too large.
  80
  81 Bihash depends on selecting an effective hash function. If one were to
  82 use a truly broken hash function such as “return 1ULL.” bihash would
  83 still work, but it would be equivalent to poorly-programmed linear
  84 search.
  85
  86 We often use cpu intrinsic functions - think crc32 - to rapidly compute
  87 a hash code which has decent statistics.
  88
  89 Bihash Cookbook
  90 ---------------
  91
  92 Using current (key,value) template instance types
  93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  94
  95 It’s quite easy to use one of the template instance types. As of this
  96 writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
  97 40, and 48 byte keys, u8 \* vector keys, and 8 byte values.
  98
  99 See …/src/vppinfra/{bihash\_\_8}.h
 100
 101 To define the data types, #include a specific template instance, most
 102 often in a subsystem header file:
 103
 104 .. code:: c
 105
 106         #include <vppinfra/bihash_8_8.h>
 107
 108 If you’re building a standalone application, you’ll need to define the
 109 various functions by #including the method implementation file in a C
 110 source file.
 111
 112 The core vpp engine currently uses most if not all of the known bihash
 113 types, so you probably won’t need to #include the method implementation
 114 file.
 115
 116 .. code:: c
 117
 118         #include <vppinfra/bihash_template.c>
 119
 120 Add an instance of the selected bihash data structure to e.g. a “main_t”
 121 structure:
 122
 123 .. code:: c
 124
 125        typedef struct
 126        {
 127          ...
 128          BVT (clib_bihash) hash_table;
 129          or
 130          clib_bihash_8_8_t hash_table;
 131          ...
 132        } my_main_t;
 133
 134 The BV macro concatenate its argument with the value of the preprocessor
 135 symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
 136 value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
 137 BVT (clib_bihash) generates “clib_bihash_8_8_t”.
 138
 139 If you’re sure you won’t decide to change the template / type name
 140 later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.
 141
 142 In fact, if you #include multiple template instances in a single source
 143 file, you **must** use fully-enumerated type names. The macros stand no
 144 chance of working.
 145
 146 Initializing a bihash table
 147 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 148
 149 Call the init function as shown. As a rough guide, pick a number of
 150 buckets which is approximately
 151 number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
 152 template instance header-file. See previous discussion.
 153
 154 The amount of memory selected should easily contain all of the records,
 155 with a generous allowance for hash collisions. Bihash memory is
 156 allocated separately from the main heap, and won’t cost anything except
 157 kernel PTE’s until touched, so it’s OK to be reasonably generous.
 158
 159 For example:
 160
 161 .. code:: c
 162
 163        my_main_t *mm = &my_main;
 164        clib_bihash_8_8_t *h;
 165
 166        h = &mm->hash_table;
 167
 168        clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
 169                               (uword) memory_size);
 170
 171 Add or delete a key/value pair
 172 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 173
 174 Use BV(clib_bihash_add_del), or the explicit type variant:
 175
 176 .. code:: c
 177
 178       clib_bihash_kv_8_8_t kv;
 179       clib_bihash_8_8_t * h;
 180       my_main_t *mm = &my_main;
 181       clib_bihash_8_8_t *h;
 182
 183       h = &mm->hash_table;
 184       kv.key = key_to_add_or_delete;
 185       kv.value = value_to_add_or_delete;
 186
 187       clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
 188
 189 In the delete case, kv.value is irrelevant. To change the value
 190 associated with an existing (key,value) pair, simply re-add the [new]
 191 pair.
 192
 193 Simple search
 194 ~~~~~~~~~~~~~
 195
 196 The simplest possible (key, value) search goes like so:
 197
 198 .. code:: c
 199
 200       clib_bihash_kv_8_8_t search_kv, return_kv;
 201       clib_bihash_8_8_t * h;
 202       my_main_t *mm = &my_main;
 203       clib_bihash_8_8_t *h;
 204
 205       h = &mm->hash_table;
 206       search_kv.key = key_to_add_or_delete;
 207
 208       if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
 209         key_not_found();
 210       else
 211         key_found();
 212
 213 Note that it’s perfectly fine to collect the lookup result
 214
 215 .. code:: c
 216
 217       if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
 218         key_not_found();
 219       etc.
 220
 221 Bihash vector processing
 222 ~~~~~~~~~~~~~~~~~~~~~~~~
 223
 224 When processing a vector of packets which need a certain lookup
 225 performed, it’s worth the trouble to compute the key hash, and prefetch
 226 the correct bucket ahead of time.
 227
 228 Here’s a sketch of one way to write the required code:
 229
 230 Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
 231 data required to form the record keys \* 4 packets ahead, form 2x record
 232 keys and call BV(clib_bihash_hash) or the explicit hash function to
 233 calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
 234 prefetch the buckets \* 2 packets ahead, call 2x
 235 BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
 236 In the processing section, call 2x
 237 BV(clib_bihash_search_inline_with_hash) to perform the search
 238
 239 Programmer’s choice whether to stash the hash code somewhere in
 240 vnet_buffer(b) metadata, or to use local variables.
 241
 242 Single-loop: \* Use simple search as shown above.
 243
 244 Walking a bihash table
 245 ~~~~~~~~~~~~~~~~~~~~~~
 246
 247 A fairly common scenario to build “show” commands involves walking a
 248 bihash table. It’s simple enough:
 249
 250 .. code:: c
 251
 252       my_main_t *mm = &my_main;
 253       clib_bihash_8_8_t *h;
 254       void callback_fn (clib_bihash_kv_8_8_t *, void *);
 255
 256       h = &mm->hash_table;
 257
 258       BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
 259
 260 To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
 261 across the entire table, calling callback_fn with active entries.
 262
 263 Bihash table iteration safety
 264 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 265
 266 The iterator template “clib_bihash_foreach_key_value_pair” must be used
 267 with a certain amount of care. For one thing, the iterator template does
 268 *not* take the bihash hash table writer lock. If your use-case requires
 269 it, lock the table.
 270
 271 For another, the iterator template is not safe under all conditions:
 272
 273 -  It’s **OK to delete** bihash table entries during a table-walk. The
 274    iterator checks whether the current bucket has been freed after each
 275    *callback_fn(…)* invocation.
 276
 277 -  It is **not OK to add** entries during a table-walk.
 278
 279 The add-during-walk case involves a jackpot: while processing a
 280 key-value-pair in a particular bucket, add a certain number of entries.
 281 By luck, assume that one or more of the added entries causes the
 282 **current bucket** to split-and-rehash.
 283
 284 Since we rehash KVP’s to different pages based on what amounts to a
 285 different hash function, either of these things can go wrong:
 286
 287 -  We may revisit previously-visited entries. Depending on how one coded
 288    the use-case, we could end up in a recursive-add situation.
 289
 290 -  We may skip entries that have not been visited
 291
 292 One could build an add-safe iterator, at a significant cost in
 293 performance: copy the entire bucket, and walk the copy.
 294
 295 It’s hard to imagine a worthwhile add-during walk use-case in the first
 296 place; let alone one which couldn’t be implemented by walking the table
 297 without modifying it, then adding a set of records.
 298
 299 Creating a new template instance
 300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 301
 302 Creating a new template is easy. Use one of the existing templates as a
 303 model, and make the obvious changes. The hash and key_compare methods
 304 are performance-critical in multiple senses.
 305
 306 If the key compare method is slow, every lookup will be slow. If the
 307 hash function is slow, same story. If the hash function has poor
 308 statistical properties, space efficiency will suffer. In the limit, a
 309 bad enough hash function will cause large portions of the table to
 310 revert to linear search.
 311
 312 Use of the best available vector unit is well worth the trouble in the
 313 hash and key_compare functions.