docs/gettingstarted/developers/fib20/fastconvergence.rst

   1 .. _fastconvergence:
   2
   3 Fast Convergence
   4 ------------------------------------
   5
   6 This is an excellent description of the topic:
   7
   8 'FIB <https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12>'_
   9
  10 but if you're interested in my take keep reading...
  11
  12 First some definitions:
  13
  14 - Convergence; When a FIB is forwarding all packets correctly based
  15   on the network topology (i.e. doing what the routing control plane
  16   has instructed it to do), then it is said to be 'converged'.
  17   Not being in a converged state is [hopefully] a transient state,
  18   when either the topology change (e.g. a link failure) has not been
  19   observed or processed by the routing control plane, or that the FIB
  20   is still processing routing updates. Convergence is the act of
  21   getting to the converged state.
  22 - Fast: In the shortest time possible. There are no absolute limits
  23   placed on how short this must be, although there is one number often
  24   mentioned. Apparently the human ear can detect loss/delay/jitter in
  25   VOIP of 50ms, therefore network failures should last no longer than
  26   this, and some technologies (notably link-free alternate fast
  27   reroute) are designed to converge in this time. However, it is
  28   generally accepted that it is not possible to converge a FIB with
  29   tens of millions of routes in this time scale, the industry
  30   'standard' is sub-second.
  31
  32 Converging the FIB quickly is thus a matter of:
  33
  34 - discovering something is down
  35 - updating as few objects as possible
  36 - to determine which objects to update as efficiently as possible
  37 - to update each object as quickly as possible
  38
  39 we'll discuss each in turn.
  40 All output came from VPP version 21.01rc0. In what follows I use IPv4
  41 prefixes, addresses and IPv4 host length masks, however, exactly the
  42 same applies to IPv6.
  43
  44
  45 Failure Detection
  46 ^^^^^^^^^^^^^^^^^
  47
  48 The two common forms (we'll see others later on) of failure detection
  49 are:
  50
  51 - link down
  52 - BFD
  53
  54 The FIB needs to hook into these notifications to trigger
  55 convergence.
  56
  57 Whenever an interface goes down, VPP issues a callback to all
  58 registerd clients. The adjacency code is such a client. The adjacency
  59 is a leaf node in the FIB control-plane graph (containing fib_path_t,
  60 fib_entry_t etc). A back-walk from the adjacnecy will trigger a
  61 re-resolution of the paths.
  62
  63 FIB is a client of BFD in order to receive BFD notifications. BFD
  64 comes in two flavours; single and multi hop. Single hop is to protect
  65 a specific peer on an interface, such peers are modelled by an
  66 adjacency. Multi hop is to protect a peer on an unspecified interface
  67 (i.e. a remote peer), this peer is represented by a host-prefix
  68 **fib_entry_t**. In both case FIB will add a delegate to the
  69 **ip_adjacency_t** or **fib_entry_t** that represents the association
  70 to the BFD session. If the BFD session signals up/down then a backwalk
  71 can be triggered from the object to trigger re-resolution and hence
  72 convergence.
  73
  74
  75 Few Updates
  76 ^^^^^^^^^^^
  77
  78 In order to talk about what 'a few' is we have to leave the realm of
  79 the FIB as an abstract graph based object DB and move into the
  80 concrete representation of forwarding in a large network. Large
  81 networks are built in layers, it's how you scale them. We'll take
  82 here a hypothetical service provider (SP) network, but the concepts
  83 apply equally to data center leaf-spines. This is a rudimentary
  84 description, but it should serve our purpose.
  85
  86 An SP manages a BGP autonomous system (AS). The SP's goal is both to
  87 attract traffic into its network to serve its customers, but also to
  88 serve transit traffic passing through it, we'll consider the latter here.
  89 The SP's network is all devices in that AS, these
  90 devices are split into those at the edge (provider edge (PE) routers)
  91 which peer with routers in other SP networks,
  92 and those in the core (termed provider (P) routers). Both the PE and P
  93 routers run the IGP (usually OSPF or ISIS). Only the reachability of the devices
  94 in the AS are advertised in the IGP - thus the scale (i.e. the number
  95 of routes) in the IGP is 'small' -  only the number of
  96 devices that the SP has (typically not more than a few 10k).
  97 PE routers run BGP; they have external BGP sessions to devices in
  98 other ASs and internal BGP sessions to devices in the same AS. BGP is
  99 used to advertise the routes to *all* networks on the internet - at
 100 the time of writing this number is approaching 900k IPv4 route, hopefully by
 101 the time you are reading this the number of IPv6 routes has caught up ...
 102 If we include the additional routes the SP carries to offering VPN service to its
 103 customers the number of BGP routes can grow to the tens of millions.
 104
 105 BGP scale thus exceeds IGP scale by two orders of magnitude... pause for
 106 a moment and let that sink in...
 107
 108 A comparison of BGP and an IGP is way way beyond the scope of this
 109 documentation (and frankly beyond me) so we'll note only the
 110 difference in the form of the routes they present to FIB. A routing
 111 protocol will produce routes that specify the prefixes that are
 112 reachable through its peers. A good IGP
 113 is link state based, it forms peerings to other devices over these
 114 links, hence its routes specify links/interfaces. In
 115 FIB nomenclature this means an IGP produces routes that are
 116 attached-nexthop, e.g.:
 117
 118 .. code-block:: console
 119
 120     ip route add 1.1.1.1/32 via 10.0.0.1 GigEthernet0/0/0
 121
 122 BGP on the other hand forms peerings only to neighbours, it does not
 123 know, nor care, what interface is used to reach the peer. In FIB
 124 nomenclature therefore BGP produces recursive routes, e.g.:
 125
 126 .. code-block:: console
 127
 128     ip route 8.0.0.0/16 via 1.1.1.1
 129
 130 where 1.1.1.1 is the BGP peer. It's no accident in this example that
 131 1.1.1.1/32 happens to be the route the IGP advertised... BGP installs
 132 routes for prefixes reachable via other BGP peers, and the IGP install
 133 the routes to those BGP peers.
 134
 135 This has been a very long winded way of describing why the scale of
 136 recursive routes is therefore 2 orders of magnitude greater than
 137 non-recursive/attached-nexthop routes.
 138
 139 If we step back for a moment and recall why we've crawled down this
 140 rabbit hole, we're trying to determine what 'a few' updates means,
 141 does it include all those recursive routes, probably not ... let's
 142 keep crawling.
 143
 144 We started this chapter with an abstract description of convergence,
 145 let's now make that more real. In the event of a network failure an SP
 146 is interested in moving to an alternate forwarding path as quickly as
 147 possible. If there is no alternate path, and a converged FIB will drop
 148 the packet, then who cares how fast it converges. In other words the
 149 interesting convergence scenarios are the scenarios where the network has
 150 alternate paths.
 151
 152 PIC Core
 153 ^^^^^^^^
 154
 155 First let's consider alternate paths in the IGP, e.g.;
 156
 157 .. code-block:: console
 158
 159     ip route add 1.1.1.1/32 via 10.0.0.2 GigEthernet0/0/0
 160     ip route add 1.1.1.1/32 via 10.0.1.2 GigEthernet0/0/1
 161
 162 this gives us in the FIB:
 163
 164 .. code-block:: console
 165
 166                 DBGvpp# sh ip fib 1.1.1.1/32
 167                   ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ]
 168                   1.1.1.1/32 fib:0 index:15 locks:2
 169                     API refs:1 src-flags:added,contributing,active,
 170                       path-list:[23] locks:2 flags:shared, uPRF-list:22 len:2 itfs:[1, 2, ]
 171                         path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:  oper-flags:resolved,
 172                           10.0.0.2 GigEthernet0/0/0
 173                             [@0]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
 174                         path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:  oper-flags:resolved,
 175                            10.0.1.2 GigEthernet0/0/1
 176                              [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 177
 178                     forwarding:   unicast-ip4-chain
 179                       [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
 180                         [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
 181                         [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 182
 183 There is ECMP across the two paths. Note that the instance/index of the
 184 load-balance present in the forwarding graph is 17.
 185
 186 Let's add a BGP route via this peer;
 187
 188 .. code-block:: console
 189
 190     ip route add 8.0.0.0/16 via 1.1.1.1
 191
 192 in the FIB we see:
 193
 194
 195 .. code-block:: console
 196
 197     DBGvpp# sh ip fib 8.0.0.0/16
 198         ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
 199         8.0.0.0/16 fib:0 index:18 locks:2
 200           API refs:1 src-flags:added,contributing,active,
 201             path-list:[24] locks:2 flags:shared, uPRF-list:21 len:2 itfs:[1, 2, ]
 202               path:[29] pl-index:24 ip4 weight=1 pref=0 recursive:  oper-flags:resolved,
 203                 via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 204
 205           forwarding:   unicast-ip4-chain
 206             [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:21 to:[0:0]]
 207                 [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]]
 208                   [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800
 209                   [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 210
 211 the load-balance object used by this route is index 20, but note that
 212 the next load-balance in the chain is index 17, i.e. it is exactly
 213 the same instance that appears in the forwarding chain for the IGP
 214 route. So in the forwarding plane the packet first encounters
 215 load-balance object 20 (which it will use in ip4-lookup) and then
 216 number 17 (in ip4-load-balance).
 217
 218 What's the significance? Let's shut down one of those IGP paths:
 219
 220 .. code-block:: console
 221
 222     DBGvpp# set in state GigEthernet0/0/0 down
 223
 224 the resulting update to the IGP route is:
 225
 226 .. code-block:: console
 227
 228     DBGvpp# sh ip fib 1.1.1.1/32
 229         ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ]
 230         1.1.1.1/32 fib:0 index:15 locks:4
 231           API refs:1 src-flags:added,contributing,active,
 232             path-list:[23] locks:2 flags:shared, uPRF-list:25 len:2 itfs:[1, 2, ]
 233               path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:
 234                 10.0.0.2 GigEthernet0/0/0
 235                   [@0]: arp-ipv4: via 10.0.0.2 GigEthernet0/0/0
 236               path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop:  oper-flags:resolved,
 237                 10.0.1.2 GigEthernet0/0/1
 238                   [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 239
 240           recursive-resolution refs:1 src-flags:added, cover:-1
 241
 242           forwarding:   unicast-ip4-chain
 243             [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
 244                 [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 245
 246
 247 notice that the path via 10.0.0.2 is no longer flagged as resolved,
 248 and the forwarding chain does not contain this path as a
 249 choice. However, the key thing to note is the load-balance
 250 instance is still index 17, i.e. it has been modified not
 251 exchanged. In the FIB vernacular we say it has been 'in-place
 252 modified', a somewhat linguistically redundant expression, but one that serves
 253 to emphasise that it was changed whilst still be part of the graph, it
 254 was never at any point removed from the graph and re-added, and it was
 255 modified without worker barrier lock held.
 256
 257 Still don't see the significance? In order to converge around the
 258 failure of the IGP link it was not necessary to update load-balance
 259 object number 20! It was not necessary to update the recursive
 260 route. i.e. convergence is achieved without updating any recursive
 261 routes, it is only necessary to update the affected IGP routes, this is
 262 the definition of 'a few'. We call this 'prefix independent
 263 convergence' (PIC) which should really be called 'recursive prefix
 264 independent convergence' but it isn't...
 265
 266 How was the trick done? As with all problems in computer science, it
 267 was solved by a layer of misdirection, I mean indirection. The
 268 indirection is the load-balance that belongs to the IGP route. By
 269 keeping this object in the forwarding graph and updating it in place,
 270 we get PIC. The alternative design would be to collapse the two layers of
 271 load-balancing into one, which would improve forwarding performance
 272 but would come at the cost of prefix dependent convergence. No doubt
 273 there are situations where the VPP deployment would favour forwarding
 274 performance over convergence, you know the drill, contributions welcome.
 275
 276 This failure scenario is known as PIC core, since it's one of the IGP's
 277 core links that has failed.
 278
 279 iBGP PIC Edge
 280 ^^^^^^^^^^^^^
 281
 282 Next, let's consider alternate paths in BGP, e.g:
 283
 284 .. code-block:: console
 285
 286     ip route add 8.0.0.0/16 via 1.1.1.1
 287     ip route add 8.0.0.0/16 via 1.1.1.2
 288
 289 the 8.0.0.0/16 prefix is reachable via two BGP next-hops (two PEs).
 290
 291 Our FIB now also contains:
 292
 293 .. code-block:: console
 294
 295     DBGvpp# sh ip fib 8.0.0.0/16
 296     ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
 297     8.0.0.0/16 fib:0 index:18 locks:2
 298       API refs:1 src-flags:added,contributing,active,
 299         path-list:[15] locks:2 flags:shared, uPRF-list:11 len:2 itfs:[1, 2, ]
 300           path:[17] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved,
 301             via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 302           path:[15] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved,
 303             via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
 304
 305       forwarding:   unicast-ip4-chain
 306         [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:2 uRPF:11 to:[0:0]]
 307            [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]]
 308              [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
 309              [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 310            [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:13 to:[0:0]]
 311              [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 312
 313 The first load-balance (LB) in the forwarding graph is index 20 (the astute
 314 reader will note this is the same index as in the previous
 315 section, I am adding paths to the same route, the load-balance is
 316 in-place modified again). Each choice in LB 20 is another LB
 317 contributed by the IGP route through which the route's paths recurse.
 318
 319 So what's the equivalent in BGP to a link down in the IGP? An IGP link
 320 down means it loses its peering out of that link, so the equivalent in
 321 BGP is the loss of the peering and thus the loss of reachability to
 322 the peer. This is signaled by the IGP withdrawing the route to the
 323 peer. But "Wait wait wait", i hear you say ... "just because the IGP
 324 withdraws 1.1.1.1/32 doesn't mean I can't reach 1.1.1.1, perhaps there
 325 is a less specific route that gives reachability to 1.1.1.1". Indeed
 326 there may be. So a little more on BGP network design. I know it's like
 327 a bad detective novel where the author drip feeds you the plot... When
 328 describing iBGP peerings one 'always' describes the peer using one of
 329 its GigEthernet0/0/back addresses. Why? A GigEthernet0/0/back interface
 330 never goes down (unless you admin down it yourself), some muppet can't
 331 accidentally cut through the GigEthernet0/0/back cable whilst digging up the
 332 street. And what subnet mask length does a prefix have on a GigEthernet0/0/back
 333 interface? it's 'always' a /32. Why? because there's no cable to connect
 334 any other devices. This choice justifies there 'always' being a /32
 335 route for the BGP peer. But what prevents there not being a less
 336 specific - nothing.
 337 Now clearly if the BGP peer crashes then the /32 for its GigEthernet0/0/back is
 338 going to be removed from the IGP, but what will withdraw the less
 339 specific - nothing.
 340
 341 So in order to make use of this trick of relying on the withdrawal of
 342 the /32 for the peer to signal that the peer is down and thus the
 343 signal to converge the FIB, we need to force FIB to recurse only via
 344 the /32 and not via a less specific. This is called a 'recursion
 345 constraint'. In this case the constraint is 'recurse via host'
 346 i.e. for ipv4 use a /32.
 347 So we need to update our route additions from before:
 348
 349 .. code-block:: console
 350
 351     ip route add 8.0.0.0/16 via 1.1.1.1 resolve-via-host
 352     ip route add 8.0.0.0/16 via 1.1.1.2 resolve-via-host
 353
 354 checking the FIB output is left as an exercise to the reader. I hope
 355 you're doing these configs as you read. There's little change in the
 356 output, you'll see some extra flags on the paths.
 357
 358 Now let's add the less specific, just for fun:
 359
 360
 361 .. code-block:: console
 362
 363     ip route add 1.1.1.0/28 via 10.0.0.2 GigEthernet0/0/0
 364
 365 nothing changes in resolution of 8.0.0.0/16.
 366
 367 Now withdraw the route to 1.1.1.2/32:
 368
 369 .. code-block:: console
 370
 371     ip route del 1.1.1.2/32 via 10.0.0.2 GigEthernet0/0/0
 372
 373 In the FIB we see:
 374
 375 .. code-block:: console
 376
 377     DBGvpp# sh ip fib 8.0.0.0/32
 378       ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ]
 379       8.0.0.0/16 fib:0 index:18 locks:2
 380         API refs:1 src-flags:added,contributing,active,
 381           path-list:[15] locks:2 flags:shared, uPRF-list:13 len:2 itfs:[1, 2, ]
 382             path:[15] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 383               via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 384             path:[17] pl-index:15 ip4 weight=1 pref=0 recursive:  cfg-flags:resolve-host,
 385               via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
 386
 387         forwarding:   unicast-ip4-chain
 388           [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:13 to:[0:0]]
 389             [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
 390               [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
 391               [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 392
 393 the path via 1.1.1.2 is unresolved, because the recursion constraints
 394 are preventing the the path resolving via 1.1.1.0/28. the LB index 20
 395 has been updated to remove the unresolved path.
 396
 397 Job done? Not quite! Why not?
 398
 399 Let's re-examine the goals of this chapter. We wanted to update 'a
 400 few' objects, which we have defined as not all the millions of
 401 recursive routes. Did we do that here? We sure did, when we
 402 modified LB index 20. So WTF?? Where's the indirection object that can
 403 be modified so that the LBs for the recursive routes are not
 404 modified - it's not there.... WTF?
 405
 406 OK so the great detective has assembled all the suspects in the
 407 drawing room and only now does he drop the bomb; the FIB knows the
 408 scale, we talked above about what the scale **can** be, worst case
 409 scenario, but that's not necessarily what it is in this hypothetical
 410 (your) deployment. It knows how many recursive routes there are that
 411 depend on a /32, it can thus make its own determination of the
 412 definition of 'a few'. In other words, if there are only 'a few'
 413 recursive prefixes that depend on a /32 then it will update them
 414 synchronously (and we'll discuss what synchronously means a bit more later).
 415
 416 So what does FIB consider to be 'a few'. Let's add more routes and
 417 find out.
 418
 419 .. code-block:: console
 420
 421     DBGvpp# ip route add 8.1.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
 422       ...
 423     DBGvpp# ip route add 8.63.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host
 424
 425 and we see:
 426
 427 .. code-block:: console
 428
 429     DBGvpp# sh ip fib 8.8.0.0
 430      ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
 431      8.8.0.0/16 fib:0 index:77 locks:2
 432      API refs:1 src-flags:added,contributing,active,
 433        path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
 434          path:[17] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 435            via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 436          path:[15] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 437            via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
 438
 439      forwarding:   unicast-ip4-chain
 440        [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:2 uRPF:28 flags:[uses-map] to:[0:0]]
 441            load-balance-map: index:0 buckets:2
 442               index:    0    1
 443                 map:    0    1
 444          [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
 445            [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
 446            [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 447          [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:18 to:[0:0]]
 448            [0] [@3]: arp-ipv4: via 10.0.1.2 GigEthernet0/0/0
 449
 450
 451 Two elements to note here; the path-list has the 'popular' flag and
 452 there is a load-balance map in the forwarding path.
 453
 454 'popular' in this case means that the path-list has passed the limit
 455 of 'a few' in the number of children it has.
 456
 457 here are the children:
 458
 459 .. code-block:: console
 460
 461   DBGvpp# sh fib path-list 15
 462     path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ]
 463       path:[17] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 464         via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 465       path:[15] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 466         via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12]
 467       children:{entry:18}{entry:21}{entry:22}{entry:23}{entry:25}{entry:26}{entry:27}{entry:28}{entry:29}{entry:30}{entry:31}{entry:32}{entry:33}{entry:34}{entry:35}{entry:36}{entry:37}{entry:38}{entry:39}{entry:40}{entry:41}{entry:42}{entry:43}{entry:44}{entry:45}{entry:46}{entry:47}{entry:48}{entry:49}{entry:50}{entry:51}{entry:52}{entry:53}{entry:54}{entry:55}{entry:56}{entry:57}{entry:58}{entry:59}{entry:60}{entry:61}{entry:62}{entry:63}{entry:64}{entry:65}{entry:66}{entry:67}{entry:68}{entry:69}{entry:70}{entry:71}{entry:72}{entry:73}{entry:74}{entry:75}{entry:76}{entry:77}{entry:78}{entry:79}{entry:80}{entry:81}{entry:82}{entry:83}{entry:84}
 468
 469 64 children makes it popular. The number is fixed (there is no API to
 470 change it). Its choice is an attempt to balance the performance cost
 471 of the indirection performance degradation versus the convergence
 472 gain.
 473
 474 Popular path-lists contribute the load-balance map, this is the
 475 missing indirection object. Its indirection happens when choosing the
 476 bucket in the LB. The packet's flow-hash is taken 'mod number of
 477 buckets' to give the 'candidate bucket' then the map will take this
 478 'index' and convert it into the 'map'. You can see in the example above
 479 that no change occurs, i.e. if the flow-hash mod n chooses bucket 1
 480 then it gets bucket 1.
 481
 482 Why is this useful? The path-list is shared (you can convince
 483 yourself of this if you look at each of the 8.x.0.0/16 routes we
 484 added) and all of these routes use the same load-balance map, therefore, to
 485 converge all the recursive routs, we need only change the map and
 486 we're good; we again get PIC.
 487
 488 OK who's still awake... if you're thinking there's more to this story,
 489 you're right. Keep reading.
 490
 491 This failure scenario is called iBGP PIC edge. It's 'edge' because it
 492 refers to the loss of an edge device, and iBGP because the device was
 493 a iBGP peer (we learn iBGP peers in the IGP). There is a similar eBGP
 494 PIC edge scenario, but this is left for an exercise to the reader (hint
 495 there are other recursion constraints - see the RFC).
 496
 497 Which Objects
 498 ^^^^^^^^^^^^^
 499
 500 The next topic on our list of how to converge quickly was to
 501 effectively find the objects that need to be updated when a converge
 502 event happens. If you haven't realised by now that the FIB is an
 503 object graph, then can I politely suggest you go back and start from
 504 the beginning ...
 505
 506 Finding the objects affected by a change is simply a matter of walking
 507 from the parent (the object affected) to its children. These
 508 dependencies are kept really for this reason.
 509
 510 So is fast convergence just a matter of walking the graph? Yes and
 511 no. The question to ask yourself is this, "in the case of iBGP PIC edge,
 512 when the /32 is withdrawn, what is the list of objects that need to be
 513 updated and particularly what is the order they should be updated in
 514 order to obtain the best convergence time?" Think breadth v. depth first.
 515
 516 ... ponder for a while ...
 517
 518 For iBGP PIC edge we said it's the path-list that provides the
 519 indirection through the load-balance map. Hence once all path-lists
 520 are updated we are converged, thereafter, at our leisure, we can
 521 update the child recursive prefixes. Is the breadth or depth first?
 522
 523 It's breadth first.
 524
 525 Breadth first walks are achieved by spawning an async walk of the
 526 branch of the graph that we don't want to traverse. Withdrawing the /32
 527 triggers a synchronous walk of the children of the /32 route, we want
 528 a synchronous walk because we want to converge ASAP. This synchronous
 529 walk will encounter path-lists in the /32 route's child dependent list.
 530 These path-lists (and thier LB maps) will be updated. If a path-list is
 531 popular, then it will spawn a async walk of the path-list's child
 532 dependent routes, if not it will walk those routes. So the walk
 533 effectively proceeds breadth first across the path-lists, then returns
 534 to the start to do the affected routes.
 535
 536 Now the story is complete. The murderer is revealed.
 537
 538 Let's withdraw one of the IGP routes.
 539
 540 .. code-block:: console
 541
 542   DBGvpp# ip route del 1.1.1.2/32 via 10.0.1.2 GigEthernet0/0/1
 543
 544   DBGvpp# sh ip fib 8.8.0.0
 545   ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ]
 546   8.8.0.0/16 fib:0 index:77 locks:2
 547     API refs:1 src-flags:added,contributing,active,
 548       path-list:[15] locks:128 flags:shared,popular, uPRF-list:18 len:2 itfs:[1, 2, ]
 549         path:[17] pl-index:15 ip4 weight=1 pref=0 recursive:  oper-flags:resolved, cfg-flags:resolve-host,
 550           via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17]
 551         path:[15] pl-index:15 ip4 weight=1 pref=0 recursive:  cfg-flags:resolve-host,
 552           via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0]
 553
 554     forwarding:   unicast-ip4-chain
 555       [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:1 uRPF:18 to:[0:0]]
 556         [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]]
 557           [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800
 558           [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800
 559
 560 the LB Map has gone, since the prefix now only has one path. You'll
 561 need to be a CLI ninja if you want to catch the output showing the LB
 562 map in its transient state of:
 563
 564 .. code-block:: console
 565
 566            load-balance-map: index:0 buckets:2
 567               index:    0    1
 568                 map:    0    0
 569
 570 but it happens. Trust me. I've got tests and everything.
 571
 572 On the final topic of how to converge quickly; 'make each update fast'
 573 there are no tricks.
 574
 575
 576