1 Bounded-index Extensible Hashing (bihash)
2 =========================================
4 Vpp uses bounded-index extensible hashing to solve a variety of
5 exact-match (key, value) lookup problems. Benefits of the current
8 - Very high record count scaling, tested to 100,000,000 records.
9 - Lookup performance degrades gracefully as the number of records
11 - No reader locking required
12 - Template implementation, it’s easy to support arbitrary (key,value)
15 Bounded-index extensible hashing has been widely used in databases for
18 Bihash uses a two-level data structure:
28 | log2_size | +--------------------------------+
29 | backing store | --------> | KVP_PER_PAGE * key-value-pairs |
30 +-----------------+ | page 0 |
31 ... +--------------------------------+
32 +-----------------+ | KVP_PER_PAGE * key-value-pairs |
33 | bucket-2**N-1 | | page 1 |
34 | log2_size | +--------------------------------+
36 +-----------------+ +--------------------------------+
37 | KVP_PER_PAGE * key-value-pairs |
38 | page 2**(log2(size)) - 1 |
39 +--------------------------------+
41 Discussion of the algorithm
42 ---------------------------
44 This structure has a couple of major advantages. In practice, each
45 bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
46 CPU architectures support 64-bit atomic operations. When modifying the
47 contents of a specific bucket, we do the following:
49 - Make a working copy of the bucket’s backing storage
50 - Atomically swap a pointer to the working copy into the bucket array
51 - Change the original backing store data
52 - Atomically swap back to the original
54 So, no reader locking is required to search a bihash table.
56 At lookup time, the implementation computes a key hash code. We use the
57 least-significant N bits of the hash to select the bucket.
59 With the bucket in hand, we learn log2 (nBackingPages) for the selected
60 bucket. At this point, we use the next log2_size bits from the hash code
61 to select the specific backing page in which the (key,value) page will
64 Net result: we search **one** backing page, not 2**log2_size pages. This
65 is a key property of the algorithm.
67 When sufficient collisions occur to fill the backing pages for a given
68 bucket, we double the bucket size, rehash, and deal the bucket contents
69 into a double-sized set of backing pages. In the future, we may
70 represent the size as a linear combination of two powers-of-two, to
71 increase space efficiency.
73 To solve the “jackpot case” where a set of records collide under hashing
74 in a bad way, the implementation will fall back to linear search across
75 2**log2_size backing pages on a per-bucket basis.
77 To maintain *space* efficiency, we should configure the bucket array so
78 that backing pages are effectively utilized. Lookup performance tends to
79 change *very little* if the bucket array is too small or too large.
81 Bihash depends on selecting an effective hash function. If one were to
82 use a truly broken hash function such as “return 1ULL.” bihash would
83 still work, but it would be equivalent to poorly-programmed linear
86 We often use cpu intrinsic functions - think crc32 - to rapidly compute
87 a hash code which has decent statistics.
92 Using current (key,value) template instance types
93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
95 It’s quite easy to use one of the template instance types. As of this
96 writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
97 40, and 48 byte keys, u8 \* vector keys, and 8 byte values.
99 See …/src/vppinfra/{bihash\_\_8}.h
101 To define the data types, #include a specific template instance, most
102 often in a subsystem header file:
106 #include <vppinfra/bihash_8_8.h>
108 If you’re building a standalone application, you’ll need to define the
109 various functions by #including the method implementation file in a C
112 The core vpp engine currently uses most if not all of the known bihash
113 types, so you probably won’t need to #include the method implementation
118 #include <vppinfra/bihash_template.c>
120 Add an instance of the selected bihash data structure to e.g. a “main_t”
128 BVT (clib_bihash) hash_table;
130 clib_bihash_8_8_t hash_table;
134 The BV macro concatenate its argument with the value of the preprocessor
135 symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
136 value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
137 BVT (clib_bihash) generates “clib_bihash_8_8_t”.
139 If you’re sure you won’t decide to change the template / type name
140 later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.
142 In fact, if you #include multiple template instances in a single source
143 file, you **must** use fully-enumerated type names. The macros stand no
146 Initializing a bihash table
147 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
149 Call the init function as shown. As a rough guide, pick a number of
150 buckets which is approximately
151 number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
152 template instance header-file. See previous discussion.
154 The amount of memory selected should easily contain all of the records,
155 with a generous allowance for hash collisions. Bihash memory is
156 allocated separately from the main heap, and won’t cost anything except
157 kernel PTE’s until touched, so it’s OK to be reasonably generous.
163 my_main_t *mm = &my_main;
164 clib_bihash_8_8_t *h;
168 clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
169 (uword) memory_size);
171 Add or delete a key/value pair
172 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
174 Use BV(clib_bihash_add_del), or the explicit type variant:
178 clib_bihash_kv_8_8_t kv;
179 clib_bihash_8_8_t * h;
180 my_main_t *mm = &my_main;
181 clib_bihash_8_8_t *h;
184 kv.key = key_to_add_or_delete;
185 kv.value = value_to_add_or_delete;
187 clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
189 In the delete case, kv.value is irrelevant. To change the value
190 associated with an existing (key,value) pair, simply re-add the [new]
196 The simplest possible (key, value) search goes like so:
200 clib_bihash_kv_8_8_t search_kv, return_kv;
201 clib_bihash_8_8_t * h;
202 my_main_t *mm = &my_main;
203 clib_bihash_8_8_t *h;
206 search_kv.key = key_to_add_or_delete;
208 if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
213 Note that it’s perfectly fine to collect the lookup result
217 if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
221 Bihash vector processing
222 ~~~~~~~~~~~~~~~~~~~~~~~~
224 When processing a vector of packets which need a certain lookup
225 performed, it’s worth the trouble to compute the key hash, and prefetch
226 the correct bucket ahead of time.
228 Here’s a sketch of one way to write the required code:
230 Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
231 data required to form the record keys \* 4 packets ahead, form 2x record
232 keys and call BV(clib_bihash_hash) or the explicit hash function to
233 calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
234 prefetch the buckets \* 2 packets ahead, call 2x
235 BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
236 In the processing section, call 2x
237 BV(clib_bihash_search_inline_with_hash) to perform the search
239 Programmer’s choice whether to stash the hash code somewhere in
240 vnet_buffer(b) metadata, or to use local variables.
242 Single-loop: \* Use simple search as shown above.
244 Walking a bihash table
245 ~~~~~~~~~~~~~~~~~~~~~~
247 A fairly common scenario to build “show” commands involves walking a
248 bihash table. It’s simple enough:
252 my_main_t *mm = &my_main;
253 clib_bihash_8_8_t *h;
254 void callback_fn (clib_bihash_kv_8_8_t *, void *);
258 BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
260 To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
261 across the entire table, calling callback_fn with active entries.
263 Bihash table iteration safety
264 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
266 The iterator template “clib_bihash_foreach_key_value_pair” must be used
267 with a certain amount of care. For one thing, the iterator template does
268 *not* take the bihash hash table writer lock. If your use-case requires
271 For another, the iterator template is not safe under all conditions:
273 - It’s **OK to delete** bihash table entries during a table-walk. The
274 iterator checks whether the current bucket has been freed after each
275 *callback_fn(…)* invocation.
277 - It is **not OK to add** entries during a table-walk.
279 The add-during-walk case involves a jackpot: while processing a
280 key-value-pair in a particular bucket, add a certain number of entries.
281 By luck, assume that one or more of the added entries causes the
282 **current bucket** to split-and-rehash.
284 Since we rehash KVP’s to different pages based on what amounts to a
285 different hash function, either of these things can go wrong:
287 - We may revisit previously-visited entries. Depending on how one coded
288 the use-case, we could end up in a recursive-add situation.
290 - We may skip entries that have not been visited
292 One could build an add-safe iterator, at a significant cost in
293 performance: copy the entire bucket, and walk the copy.
295 It’s hard to imagine a worthwhile add-during walk use-case in the first
296 place; let alone one which couldn’t be implemented by walking the table
297 without modifying it, then adding a set of records.
299 Creating a new template instance
300 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302 Creating a new template is easy. Use one of the existing templates as a
303 model, and make the obvious changes. The hash and key_compare methods
304 are performance-critical in multiple senses.
306 If the key compare method is slow, every lookup will be slow. If the
307 hash function is slow, same story. If the hash function has poor
308 statistical properties, space efficiency will suffer. In the limit, a
309 bad enough hash function will cause large portions of the table to
310 revert to linear search.
312 Use of the best available vector unit is well worth the trouble in the
313 hash and key_compare functions.