1 Bounded-index Extensible Hashing (bihash)
2 =========================================
4 Vpp uses bounded-index extensible hashing to solve a variety of
5 exact-match (key, value) lookup problems. Benefits of the current
8 * Very high record count scaling, tested to 100,000,000 records.
9 * Lookup performance degrades gracefully as the number of records increases
10 * No reader locking required
11 * Template implementation, it's easy to support arbitrary (key,value) types
13 Bounded-index extensible hashing has been widely used in databases for
16 Bihash uses a two-level data structure:
25 | log2_size | +--------------------------------+
26 | backing store | --------> | KVP_PER_PAGE * key-value-pairs |
27 +-----------------+ | page 0 |
28 ... +--------------------------------+
29 +-----------------+ | KVP_PER_PAGE * key-value-pairs |
30 | bucket-2**N-1 | | page 1 |
31 | log2_size | +--------------------------------+
33 +-----------------+ +--------------------------------+
34 | KVP_PER_PAGE * key-value-pairs |
35 | page 2**(log2(size)) - 1 |
36 +--------------------------------+
39 Discussion of the algorithm
40 ---------------------------
42 This structure has a couple of major advantages. In practice, each
43 bucket entry fits into a 64-bit integer. Coincidentally, vpp's target
44 CPU architectures support 64-bit atomic operations. When modifying the
45 contents of a specific bucket, we do the following:
47 * Make a working copy of the bucket's backing storage
48 * Atomically swap a pointer to the working copy into the bucket array
49 * Change the original backing store data
50 * Atomically swap back to the original
52 So, no reader locking is required to search a bihash table.
54 At lookup time, the implementation computes a key hash code. We use
55 the least-significant N bits of the hash to select the bucket.
57 With the bucket in hand, we learn log2 (nBackingPages) for the
58 selected bucket. At this point, we use the next log2_size bits from
59 the hash code to select the specific backing page in which the
60 (key,value) page will be found.
62 Net result: we search **one** backing page, not 2**log2_size
63 pages. This is a key property of the algorithm.
65 When sufficient collisions occur to fill the backing pages for a given
66 bucket, we double the bucket size, rehash, and deal the bucket
67 contents into a double-sized set of backing pages. In the future, we
68 may represent the size as a linear combination of two powers-of-two,
69 to increase space efficiency.
71 To solve the "jackpot case" where a set of records collide under
72 hashing in a bad way, the implementation will fall back to linear
73 search across 2**log2_size backing pages on a per-bucket basis.
75 To maintain *space* efficiency, we should configure the bucket array
76 so that backing pages are effectively utilized. Lookup performance
77 tends to change *very little* if the bucket array is too small or too
80 Bihash depends on selecting an effective hash function. If one were to
81 use a truly broken hash function such as "return 1ULL." bihash would
82 still work, but it would be equivalent to poorly-programmed linear
85 We often use cpu intrinsic functions - think crc32 - to rapidly
86 compute a hash code which has decent statistics.
91 ### Using current (key,value) template instance types
93 It's quite easy to use one of the template instance types. As of this
94 writing, .../src/vppinfra provides pre-built templates for 8, 16, 20,
95 24, 40, and 48 byte keys, u8 * vector keys, and 8 byte values.
97 See .../src/vppinfra/{bihash_<key-size>_8}.h
99 To define the data types, #include a specific template instance, most
100 often in a subsystem header file:
103 #include <vppinfra/bihash_8_8.h>
106 If you're building a standalone application, you'll need to define the
107 various functions by #including the method implementation file in a C
110 The core vpp engine currently uses most if not all of the known bihash
111 types, so you probably won't need to #include the method
116 #include <vppinfra/bihash_template.c>
119 Add an instance of the selected bihash data structure to e.g. a
126 BVT (clib_bihash) hash;
128 clib_bihash_8_8_t hash;
133 The BV macro concatenate its argument with the value of the
134 preprocessor symbol BIHASH_TYPE. The BVT macro concatenates its
135 argument with the value of BIHASH_TYPE and the fixed-string "_t". So
136 in the above example, BVT (clib_bihash) generates "clib_bihash_8_8_t".
138 If you're sure you won't decide to change the template / type name
139 later, it's perfectly OK to code "clib_bihash_8_8_t" and so forth.
141 In fact, if you #include multiple template instances in a single
142 source file, you **must** use fully-enumerated type names. The macros
143 stand no chance of working.
145 ### Initializing a bihash table
147 Call the init function as shown. As a rough guide, pick a number of
148 buckets which is approximately
149 number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
150 template instance header-file. See previous discussion.
152 The amount of memory selected should easily contain all of the
153 records, with a generous allowance for hash collisions. Bihash memory
154 is allocated separately from the main heap, and won't cost anything
155 except kernel PTE's until touched, so it's OK to be reasonably
161 my_main_t *mm = &my_main;
162 clib_bihash_8_8_t *h;
166 clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
167 (uword) memory_size);
170 ### Add or delete a key/value pair
172 Use BV(clib_bihash_add_del), or the explicit type variant:
175 clib_bihash_kv_8_8_t kv;
176 clib_bihash_8_8_t * h;
177 my_main_t *mm = &my_main;
178 clib_bihash_8_8_t *h;
181 kv.key = key_to_add_or_delete;
182 kv.value = value_to_add_or_delete;
184 clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
187 In the delete case, kv.value is irrelevant. To change the value associated
188 with an existing (key,value) pair, simply re-add the [new] pair.
192 The simplest possible (key, value) search goes like so:
195 clib_bihash_kv_8_8_t search_kv, return_kv;
196 clib_bihash_8_8_t * h;
197 my_main_t *mm = &my_main;
198 clib_bihash_8_8_t *h;
201 search_kv.key = key_to_add_or_delete;
203 if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
209 Note that it's perfectly fine to collect the lookup result
212 if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
217 ### Bihash vector processing
219 When processing a vector of packets which need a certain lookup
220 performed, it's worth the trouble to compute the key hash, and
221 prefetch the correct bucket ahead of time.
223 Here's a sketch of one way to write the required code:
226 * 6 packets ahead, prefetch 2x vlib_buffer_t's and 2x packet data
227 required to form the record keys
228 * 4 packets ahead, form 2x record keys and call BV(clib_bihash_hash)
229 or the explicit hash function to calculate the record hashes.
230 Call 2x BV(clib_bihash_prefetch_bucket) to prefetch the buckets
231 * 2 packets ahead, call 2x BV(clib_bihash_prefetch_data) to prefetch
232 2x (key,value) data pages.
233 * In the processing section, call 2x BV(clib_bihash_search_inline_with_hash)
234 to perform the search
236 Programmer's choice whether to stash the hash code somewhere in
237 vnet_buffer(b) metadata, or to use local variables.
240 * Use simple search as shown above.
242 ### Walking a bihash table
244 A fairly common scenario to build "show" commands involves walking a
245 bihash table. It's simple enough:
248 my_main_t *mm = &my_main;
249 clib_bihash_8_8_t *h;
250 void callback_fn (clib_bihash_kv_8_8_t *, void *);
254 BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
256 To nobody's great surprise: clib_bihash_foreach_key_value_pair
257 iterates across the entire table, calling callback_fn with active
260 #### Bihash table iteration safety
262 The iterator template "clib_bihash_foreach_key_value_pair" must be
263 used with a certain amount of care. For one thing, the iterator
264 template does _not_ take the bihash hash table writer lock. If your
265 use-case requires it, lock the table.
267 For another, the iterator template is not safe under all conditions:
269 * It's __OK to delete__ bihash table entries during a table-walk. The
270 iterator checks whether the current bucket has been freed after each
271 _callback_fn(...)_ invocation.
273 * It is __not OK to add__ entries during a table-walk.
275 The add-during-walk case involves a jackpot: while processing a
276 key-value-pair in a particular bucket, add a certain number of
277 entries. By luck, assume that one or more of the added entries causes
278 the __current bucket__ to split-and-rehash.
280 Since we rehash KVP's to different pages based on what amounts to a
281 different hash function, either of these things can go wrong:
283 * We may revisit previously-visited entries. Depending on how one
284 coded the use-case, we could end up in a recursive-add situation.
286 * We may skip entries that have not been visited
288 One could build an add-safe iterator, at a significant cost in
289 performance: copy the entire bucket, and walk the copy.
291 It's hard to imagine a worthwhile add-during walk use-case in the
292 first place; let alone one which couldn't be implemented by walking
293 the table without modifying it, then adding a set of records.
295 ### Creating a new template instance
297 Creating a new template is easy. Use one of the existing templates as
298 a model, and make the obvious changes. The hash and key_compare
299 methods are performance-critical in multiple senses.
301 If the key compare method is slow, every lookup will be slow. If the
302 hash function is slow, same story. If the hash function has poor
303 statistical properties, space efficiency will suffer. In the limit, a
304 bad enough hash function will cause large portions of the table to
305 revert to linear search.
307 Use of the best available vector unit is well worth the trouble in the
308 hash and key_compare functions.