docs/gettingstarted/developers/bihash.md

   1 Bounded-index Extensible Hashing (bihash)
   2 =========================================
   3
   4 Vpp uses bounded-index extensible hashing to solve a variety of
   5 exact-match (key, value) lookup problems. Benefits of the current
   6 implementation:
   7
   8 * Very high record count scaling, tested to 100,000,000 records.
   9 * Lookup performance degrades gracefully as the number of records increases
  10 * No reader locking required
  11 * Template implementation, it's easy to support arbitrary (key,value) types
  12
  13 Bounded-index extensible hashing has been widely used in databases for
  14 decades.
  15
  16 Bihash uses a two-level data structure:
  17
  18 ```
  19     +-----------------+
  20     | bucket-0        |
  21     |  log2_size      |
  22     |  backing store  |
  23     +-----------------+
  24     | bucket-1        |
  25     |  log2_size      |           +--------------------------------+
  26     |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
  27     +-----------------+           | page 0                         |
  28          ...                      +--------------------------------+
  29     +-----------------+           | KVP_PER_PAGE * key-value-pairs |
  30     | bucket-2**N-1   |           | page 1                         |
  31     |  log2_size      |           +--------------------------------+
  32     |  backing store  |                       ---
  33     +-----------------+           +--------------------------------+
  34                                   | KVP_PER_PAGE * key-value-pairs |
  35                                   | page 2**(log2(size)) - 1       |
  36                                   +--------------------------------+
  37 ```
  38
  39 Discussion of the algorithm
  40 ---------------------------
  41
  42 This structure has a couple of major advantages. In practice, each
  43 bucket entry fits into a 64-bit integer. Coincidentally, vpp's target
  44 CPU architectures support 64-bit atomic operations. When modifying the
  45 contents of a specific bucket, we do the following:
  46
  47 * Make a working copy of the bucket's backing storage
  48 * Atomically swap a pointer to the working copy into the bucket array
  49 * Change the original backing store data
  50 * Atomically swap back to the original
  51
  52 So, no reader locking is required to search a bihash table.
  53
  54 At lookup time, the implementation computes a key hash code. We use
  55 the least-significant N bits of the hash to select the bucket.
  56
  57 With the bucket in hand, we learn log2 (nBackingPages) for the
  58 selected bucket. At this point, we use the next log2_size bits from
  59 the hash code to select the specific backing page in which the
  60 (key,value) page will be found.
  61
  62 Net result: we search **one** backing page, not 2**log2_size
  63 pages. This is a key property of the algorithm.
  64
  65 When sufficient collisions occur to fill the backing pages for a given
  66 bucket, we double the bucket size, rehash, and deal the bucket
  67 contents into a double-sized set of backing pages. In the future, we
  68 may represent the size as a linear combination of two powers-of-two,
  69 to increase space efficiency.
  70
  71 To solve the "jackpot case" where a set of records collide under
  72 hashing in a bad way, the implementation will fall back to linear
  73 search across 2**log2_size backing pages on a per-bucket basis.
  74
  75 To maintain *space* efficiency, we should configure the bucket array
  76 so that backing pages are effectively utilized. Lookup performance
  77 tends to change *very little* if the bucket array is too small or too
  78 large.
  79
  80 Bihash depends on selecting an effective hash function. If one were to
  81 use a truly broken hash function such as "return 1ULL." bihash would
  82 still work, but it would be equivalent to poorly-programmed linear
  83 search.
  84
  85 We often use cpu intrinsic functions - think crc32 - to rapidly
  86 compute a hash code which has decent statistics.
  87
  88 Bihash Cookbook
  89 ---------------
  90
  91 ### Using current (key,value) template instance types
  92
  93 It's quite easy to use one of the template instance types. As of this
  94 writing, .../src/vppinfra provides pre-built templates for 8, 16, 20,
  95 24, 40, and 48 byte keys, u8 * vector keys, and 8 byte values.
  96
  97 See .../src/vppinfra/{bihash_<key-size>_8}.h
  98
  99 To define the data types, #include a specific template instance, most
 100 often in a subsystem header file:
 101
 102 ```c
 103      #include <vppinfra/bihash_8_8.h>
 104 ```
 105
 106 If you're building a standalone application, you'll need to define the
 107 various functions by #including the method implementation file in a C
 108 source file.
 109
 110 The core vpp engine currently uses most if not all of the known bihash
 111 types, so you probably won't need to #include the method
 112 implementation file.
 113
 114
 115 ```c
 116      #include <vppinfra/bihash_template.c>
 117 ```
 118
 119 Add an instance of the selected bihash data structure to e.g. a
 120 "main_t" structure:
 121
 122 ```c
 123     typedef struct
 124     {
 125       ...
 126       BVT (clib_bihash) hash;
 127       or
 128       clib_bihash_8_8_t hash;
 129       ...
 130     } my_main_t;
 131 ```
 132
 133 The BV macro concatenate its argument with the value of the
 134 preprocessor symbol BIHASH_TYPE. The BVT macro concatenates its
 135 argument with the value of BIHASH_TYPE and the fixed-string "_t". So
 136 in the above example, BVT (clib_bihash) generates "clib_bihash_8_8_t".
 137
 138 If you're sure you won't decide to change the template / type name
 139 later, it's perfectly OK to code "clib_bihash_8_8_t" and so forth.
 140
 141 In fact, if you #include multiple template instances in a single
 142 source file, you **must** use fully-enumerated type names. The macros
 143 stand no chance of working.
 144
 145 ### Initializing a bihash table
 146
 147 Call the init function as shown. As a rough guide, pick a number of
 148 buckets which is approximately
 149 number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
 150 template instance header-file.  See previous discussion.
 151
 152 The amount of memory selected should easily contain all of the
 153 records, with a generous allowance for hash collisions. Bihash memory
 154 is allocated separately from the main heap, and won't cost anything
 155 except kernel PTE's until touched, so it's OK to be reasonably
 156 generous.
 157
 158 For example:
 159
 160 ```c
 161     my_main_t *mm = &my_main;
 162     clib_bihash_8_8_t *h;
 163
 164     h = &mm->hash_table;
 165
 166     clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
 167                            (uword) memory_size);
 168 ```
 169
 170 ### Add or delete a key/value pair
 171
 172 Use BV(clib_bihash_add_del), or the explicit type variant:
 173
 174 ```c
 175    clib_bihash_kv_8_8_t kv;
 176    clib_bihash_8_8_t * h;
 177    my_main_t *mm = &my_main;
 178    clib_bihash_8_8_t *h;
 179
 180    h = &mm->hash_table;
 181    kv.key = key_to_add_or_delete;
 182    kv.value = value_to_add_or_delete;
 183
 184    clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
 185 ```
 186
 187 In the delete case, kv.value is irrelevant. To change the value associated
 188 with an existing (key,value) pair, simply re-add the [new] pair.
 189
 190 ### Simple search
 191
 192 The simplest possible (key, value) search goes like so:
 193
 194 ```c
 195    clib_bihash_kv_8_8_t search_kv, return_kv;
 196    clib_bihash_8_8_t * h;
 197    my_main_t *mm = &my_main;
 198    clib_bihash_8_8_t *h;
 199
 200    h = &mm->hash_table;
 201    search_kv.key = key_to_add_or_delete;
 202
 203    if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
 204      key_not_found()
 205    else
 206      key_not_found();
 207 ```
 208
 209 Note that it's perfectly fine to collect the lookup result
 210
 211 ```c
 212    if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
 213      key_not_found();
 214    etc.
 215 ```
 216
 217 ### Bihash vector processing
 218
 219 When processing a vector of packets which need a certain lookup
 220 performed, it's worth the trouble to compute the key hash, and
 221 prefetch the correct bucket ahead of time.
 222
 223 Here's a sketch of one way to write the required code:
 224
 225 Dual-loop:
 226 * 6 packets ahead, prefetch 2x vlib_buffer_t's and 2x packet data
 227   required to form the record keys
 228 * 4 packets ahead, form 2x record keys and call BV(clib_bihash_hash)
 229   or the explicit hash function to calculate the record hashes.
 230   Call 2x BV(clib_bihash_prefetch_bucket) to prefetch the buckets
 231 * 2 packets ahead, call 2x BV(clib_bihash_prefetch_data) to prefetch
 232   2x (key,value) data pages.
 233 * In the processing section, call 2x BV(clib_bihash_search_inline_with_hash)
 234   to perform the search
 235
 236 Programmer's choice whether to stash the hash code somewhere in
 237 vnet_buffer(b) metadata, or to use local variables.
 238
 239 Single-loop:
 240 * Use simple search as shown above.
 241
 242 ### Walking a bihash table
 243
 244 A fairly common scenario to build "show" commands involves walking a
 245 bihash table. It's simple enough:
 246
 247 ```c
 248    my_main_t *mm = &my_main;
 249    clib_bihash_8_8_t *h;
 250    void callback_fn (clib_bihash_kv_8_8_t *, void *);
 251
 252    h = &mm->hash_table;
 253
 254    BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
 255 ```
 256 To nobody's great surprise: clib_bihash_foreach_key_value_pair
 257 iterates across the entire table, calling callback_fn with active
 258 entries.
 259
 260 #### Bihash table iteration safety
 261
 262 The iterator template "clib_bihash_foreach_key_value_pair" must be
 263 used with a certain amount of care. For one thing, the iterator
 264 template does _not_ take the bihash hash table writer lock. If your
 265 use-case requires it, lock the table.
 266
 267 For another, the iterator template is not safe under all conditions:
 268
 269 * It's __OK to delete__ bihash table entries during a table-walk. The
 270 iterator checks whether the current bucket has been freed after each
 271 _callback_fn(...)_ invocation.
 272
 273 * It is __not OK to add__ entries during a table-walk.
 274
 275 The add-during-walk case involves a jackpot: while processing a
 276 key-value-pair in a particular bucket, add a certain number of
 277 entries. By luck, assume that one or more of the added entries causes
 278 the __current bucket__ to split-and-rehash.
 279
 280 Since we rehash KVP's to different pages based on what amounts to a
 281 different hash function, either of these things can go wrong:
 282
 283 * We may revisit previously-visited entries. Depending on how one
 284 coded the use-case, we could end up in a recursive-add situation.
 285
 286 * We may skip entries that have not been visited
 287
 288 One could build an add-safe iterator, at a significant cost in
 289 performance: copy the entire bucket, and walk the copy.
 290
 291 It's hard to imagine a worthwhile add-during walk use-case in the
 292 first place; let alone one which couldn't be implemented by walking
 293 the table without modifying it, then adding a set of records.
 294
 295 ### Creating a new template instance
 296
 297 Creating a new template is easy. Use one of the existing templates as
 298 a model, and make the obvious changes. The hash and key_compare
 299 methods are performance-critical in multiple senses.
 300
 301 If the key compare method is slow, every lookup will be slow. If the
 302 hash function is slow, same story. If the hash function has poor
 303 statistical properties, space efficiency will suffer. In the limit, a
 304 bad enough hash function will cause large portions of the table to
 305 revert to linear search.
 306
 307 Use of the best available vector unit is well worth the trouble in the
 308 hash and key_compare functions.