src/vnet/fib/fib.h

   1 /*
   2  * Copyright (c) 2016 Cisco and/or its affiliates.
   3  * Licensed under the Apache License, Version 2.0 (the "License");
   4  * you may not use this file except in compliance with the License.
   5  * You may obtain a copy of the License at:
   6  *
   7  *     http://www.apache.org/licenses/LICENSE-2.0
   8  *
   9  * Unless required by applicable law or agreed to in writing, software
  10  * distributed under the License is distributed on an "AS IS" BASIS,
  11  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12  * See the License for the specific language governing permissions and
  13  * limitations under the License.
  14  */
  15 /**
  16  * \brief
  17  * A IP v4/6 independent FIB.
  18  *
  19  * The main functions provided by the FIB are as follows;
  20  *
  21  *  - source priorities
  22  *
  23  *   A route can be added to the FIB by more than entity or source. Sources
  24  * include, but are not limited to, API, CLI, LISP, MAP, etc (for the full list
  25  * see fib_entry.h). Each source provides the forwarding information (FI) that
  26  * is has determined as required for that route. Since each source determines the
  27  * FI using different best  path and loop prevention algorithms, it is not
  28  * correct for the FI of multiple sources to be combined. Instead the FIB must
  29  * choose to use the FI from only one source. This choose is based on a static
  30  * priority assignment. For example;
  31  * IF a prefix is added as a result of interface configuration:
  32  *    set interface address 192.168.1.1/24 GigE0
  33  * and then it is also added from the CLI
  34  *    ip route 192.168.1.1/32 via 2.2.2.2/32
  35  * then the 'interface' source will prevail, and the route will remain as
  36  * 'local'.
  37  * The requirement of the FIB is to always install the FI from the winning
  38  * source and thus to maintain the FI added by losing sources so it can be
  39  * installed should the winning source be withdrawn.
  40  *
  41  *  - adj-fib maintenance
  42  *
  43  *   When ARP or ND discover a neighbour on a link an adjacency forms for the
  44  * address of that neighbour. It is also required to insert a route in the
  45  * appropriate FIB table, corresponding to the VRF for the link, an entry for
  46  * that neighbour. This entry is often referred to as an adj-fib. Adj-fibs
  47  * have a dedicated source; 'ADJ'.
  48  * The priority of the ADJ source is lower than most. This is so the following
  49  * config;
  50  *    set interface address 192.168.1.1/32 GigE0
  51  *    ip arp 192.168.1.2 GigE0 dead.dead.dead
  52  *    ip route add 192.168.1.2 via 10.10.10.10 GigE1
  53  * will forward traffic for 192.168.1.2 via GigE1. That is the route added
  54  * by the control plane is favoured over the adjacency discovered by ARP.
  55  * The control plane, with its associated authentication, is considered the
  56  * authoritative source.
  57  * To counter the nefarious addition of adj-fib, through the nefarious injection
  58  * of adjacencies, the FIB is also required to ensure that only adj-fibs whose
  59  * less specific covering prefix is connected are installed in forwarding. This
  60  * requires the use of 'cover tracking', where a route maintains a dependency
  61  * relationship with the route that is its less specific cover. When this cover
  62  * changes (i.e. there is a new covering route) or the forwarding information
  63  * of the cover changes, then the covered route is notified.
  64  *
  65  * Overlapping sub-nets are not supported, so no adj-fib has multiple paths.
  66  * The control plane is expected to remove a prefix configured for an interface
  67  * before the interface changes VRF.
  68  * So while the following config is accepted:
  69  *    set interface address 192.168.1.1/32 GigE0
  70  *    ip arp 192.168.1.2 GigE0 dead.dead.dead
  71  *    set interface ip table GigE0 2
  72  * it does not result in the desired behaviour.
  73  *
  74  *  - attached export.
  75  *
  76  * Further to adj-fib maintenance above consider the following config:
  77  *    set interface address 192.168.1.1/24 GigE0
  78  *    ip route add table 2 192.168.1.0/24 GigE0
  79  * Traffic destined for 192.168.1.2 in table 2 will generate an ARP request
  80  * on GigE0. However, since GigE0 is in table 0, all adj-fibs will be added in
  81  * FIB 0. Hence all hosts in the sub-net are unreachable from table 2. To resolve
  82  * this, all adj-fib and local prefixes are exported (i.e. copied) from the
  83  * 'export' table 0, to the 'import' table 2. There can be many import tables
  84  * for a single export table.
  85  *
  86  *  - recursive route resolution
  87  *
  88  *   A recursive route is of the form:
  89  *       1.1.1.1/32 via 10.10.10.10
  90  * i.e. a route for which no egress interface is provided. In order to forward
  91  * traffic to 1.1.1.1/32 the FIB must therefore first determine how to forward
  92  * traffic to 10.10.10.10/32. This is recursive resolution.
  93  * Recursive resolution, just like normal resolution, proceeds via a longest
  94  * prefix match for the 'via-address' 10.10.10.10. Note it is only possible
  95  * to add routes via an address (i.e. a /32 or /128) not via a shorter mask
  96  * prefix. There is no use case for the latter.
  97  * Since recursive resolution proceeds via a longest prefix match, the entry
  98  * in the FIB that will resolve the recursive route, termed the via-entry, may
  99  * change as other routes are added to the FIB. Consider the recursive
 100  * route shown above, and this non-recursive route:
 101  *       10.10.10.0/24 via 192.168.16.1 GigE0
 102  * The entry for 10.10.10.0/24 is thus the resolving via-entry. If this entry is
 103  * modified, to say;
 104  *       10.10.10.0/24 via 192.16.1.3 GigE0
 105  * Then packet for 1.1.1.1/32 must also be sent to the new next-hop.
 106  * Now consider the addition of;
 107  *       10.10.10.0/28 via 192.168.16.2 GigE0
 108  * The more specific /28 is a better longest prefix match and thus becomes the
 109  * via-entry. Removal of the /28 means the resolution will revert to the /24.
 110  * The tracking to the changes in recursive resolution is the requirement of
 111  * the FIB. When the forwarding information of the via-entry changes a back-walk
 112  * is used to update dependent recursive routes. When new routes are added to
 113  * the table the cover tracking feature provides the necessary notifications to
 114  * the via-entry routes.
 115  * The adjacency constructed for 1.1.1.1/32 will be a recursive adjacency
 116  * whose next adjacency will be contributed from the via-entry. Maintaining
 117  * the validity of this recursive adjacency is a requirement of the FIB.
 118  *
 119  *  - recursive loop avoidance
 120  *
 121  * Consider this set of routes:
 122  *     1.1.1.1/32 via 2.2.2.2
 123  *     2.2.2.2/32 via 3.3.3.3
 124  *     3.3.3.3/32 via 1.1.1.1
 125  * this is termed a recursion loop - all of the routes in the loop are
 126  * unresolved in so far as they do not have a resolving adjacency, but each
 127  * is resolved because the via-entry is known. It is important here to note
 128  * the distinction between the control-plane objects and the data-plane objects
 129  * (more details in the implementation section). The control plane objects must
 130  * allow the loop to form (i.e. the graph becomes cyclic), however, the
 131  * data-plane absolutely must not allow the loop to form, otherwise the packet
 132  * would loop indefinitely and never egress the device - meltdown would follow.
 133  * The control plane must allow the loop to form, because when the loop breaks,
 134  * all members of the loop need to be updated. Forming the loop allows the
 135  * dependencies to be correctly setup to allow this to happen.
 136  * There is no limit to the depth of recursion supported by VPP so:
 137  *    9.9.9.100/32 via 9.9.9.99
 138  *    9.9.9.99/32  via 9.9.9.98
 139  *    9.9.9.98/32  via 9.9.9.97
 140  *      ... turtles, turtles, turtles ...
 141  *    9.9.9.1/32 via 10.10.10.10 Gig0
 142  * is supported to as many layers of turtles is desired, however, when
 143  * back-walking a graph (in this case from 9.9.9.1/32 up toward 9.9.9.100/32)
 144  * a FIB needs to differentiate the case where the recursion is deep versus
 145  * the case where the recursion is looped. A simple method, employed by VPP FIB,
 146  * is to limit the number of steps. VPP FIB limit is 16. Typical BGP scenarios
 147  * in the wild do not exceed 3 (BGP Inter-AS option C).
 148  *
 149  * - Fast Convergence
 150  *
 151  * After a network topology change, the 'convergence' time, is the time taken
 152  * for the router to complete a transition to forward traffic using the new
 153  * topology. The convergence time is therefore a summation of the time to;
 154  *  - detect the failure.
 155  *  - calculate the new 'best path' information
 156  *  - download the new best paths to the data-plane.
 157  *  - install those best best in data-plane forwarding.
 158  * The last two points are of relevance to VPP architecture. The download API is
 159  * binary and batch, details are not discussed here. There is no HW component to
 160  * programme, installation time is bounded by the memory allocation and table
 161  * lookup and insert access times.
 162  *
 163  * 'Fast' convergence refers to a set of technologies that a FIB can employ to
 164  * completely or partially restore forwarding whilst the convergence actions
 165  * listed above are ongoing. Fast convergence technologies are further
 166  * sub-divided into Prefix Independent Convergence (PIC) and Loop Free
 167  * Alternate path Fast re-route (LFA-FRR or sometimes called IP-FRR) which
 168  * affect recursive and non-recursive routes respectively.
 169  *
 170  * LFA-FRR
 171  *
 172  * Consider the network topology below:
 173  *
 174  *          C
 175  *        /   \
 176  *  X -- A --- B - Y
 177  *       |     |
 178  *       D     F
 179  *        \   /
 180  *          E
 181  *
 182  * all links are equal cost, traffic is passing from X to Y. the best path is
 183  * X-A-B-Y. There are two alternative paths, one via C and one via E. An
 184  * alternate path is considered to be loop free if no other router on that path
 185  * would forward the traffic back to the sender. Consider router C, its best
 186  * path to Y is via B, so if A were to send traffic destined to Y to C, then C
 187  * would forward that traffic to B - this is a loop-free alternate path. In
 188  * contrast consider router D. D's shortest path to Y is via A, so if A were to
 189  * send traffic destined to Y via D, then D would send it back to A; this is
 190  * not a loop-free alternate path. There are several points of note;
 191  *   - we are considering the pre-failure routing topology
 192  *   - any equal-cost multi-path between A and B is also a LFA path.
 193  *   - in order for A to calculate LFA paths it must be aware of the best-path
 194  *     to Y from the perspective of D. These calculations are thus limited to
 195  *     routing protocols that have a full view of the network topology, i.e.
 196  *     link-state DB protocols like OSPF or an SDN controller. LFA protected
 197  *     prefixes are thus non-recursive.
 198  *
 199  * LFA is specified as a 1 to 1 redundancy; a primary path has only one LFA
 200  * (a.k.a. backup) path. To my knowledge this limitation is one of complexity
 201  * in the calculation of and capacity planning using a 1-n redundancy.
 202  *
 203  * In the event that the link A-B fails, the alternate path via C can be used.
 204  * In order to provide 'fast' failover in the event of a failure, the control
 205  * plane will download both the primary and the backup path to the FIB. It is
 206  * then a requirement of the FIB to perform the failover (a.k.a cutover) from
 207  * the primary to the backup path as quickly as possible, and particularly
 208  * without any other control-plane intervention. The expectation is cutover is
 209  * less than 50 milli-seconds - a value allegedly from the VOIP QoS. Note that
 210  * cutover time still includes the fault detection time, which in a vitalised
 211  * environment could be the dominant factor. Failure detection can be either a
 212  * link down, which will affect multiple paths on a multi-access interface, or
 213  * via a specific path heartbeat (i.e. BFD).
 214  * At this time VPP does not support LFA, that is it does not support the
 215  * installation of a primary and backup path[s] for a route. However, it does
 216  * support ECMP, and VPP FIB is designed to quickly remove failed paths from
 217  * the ECMP set, however, it does not insert shared objects specific to the
 218  * protected resource into the forwarding object graph, since this would incur
 219  * a forwarding/performance cost. Failover time is thus route number dependent.
 220  * Details are provided in the implementation section below.
 221  *
 222  * PIC
 223  *
 224  * PIC refers to the concept that the converge time should be independent of
 225  * the number of prefixes/routes that are affected by the failure. PIC is
 226  * therefore most appropriate when considering networks with large number of
 227  * prefixes, i.e. BGP networks and thus recursive prefixes. There are several
 228  * flavours of PIC covering different locations of protection and failure
 229  * scenarios. An outline is given below, see the literature for more details:
 230  *
 231  * Y/16 - CE1 -- PE1---\
 232  *                | \   P1---\
 233  *                |  \        PE3 -- CE3 - X/16
 234  *                |   - P2---/
 235  * Y/16 - CE2 -- PE2---/
 236  *
 237  * CE = customer edge, PE = provider edge. external-BGP runs between customer
 238  * and provider, internal-BGP runs between provider and provider.
 239  *
 240  * 1) iBGP PIC-core: consider traffic from CE1 to X/16 via CE3. On PE1 there is
 241  *    are routes;
 242  *       X/16 (and hundreds of thousands of others like it)
 243  *         via PE3
 244  *    and
 245  *      PE3/32 (its loopback address)
 246  *        via 10.0.0.1 Link0 (this is P1)
 247  *        via 10.1.1.1 Link1 (this is P2)
 248  * the failure is the loss of link0 or link1
 249  * As in all PIC scenarios, in order to provide prefix independent convergence
 250  * it must be that the route for X/16 (and all other routes via PE3) do not
 251  * need to be updated in the FIB. The FIB therefore needs to update a single
 252  * object that is shared by all routes - once this shared object is updated,
 253  * then all routes using it will be instantly updated to use the new forwarding
 254  * information. In this case the shared object is the resolving route via PE3.
 255  * Once the route via PE3 is updated via IGP (OSPF) convergence, then all
 256  * recursive routes that resolve through it are also updated. VPP FIB
 257  * implements this scenario via a recursive-adjacency. the X/16 and it sibling
 258  * routes share a recursive-adjacency that links to/points at/stacks on the
 259  * normal adjacency contributed by the route for PE3. Once this shared
 260  * recursive adj is re-linked then all routes are switched to using the new
 261  * forwarding information. This is shown below;
 262  *
 263  * pre-failure;
 264  *   X/16 --> R-ADJ-1 --> ADJ-1-PE3 (multi-path via P1 and P2)
 265  *
 266  * post-failure:
 267  *   X/16 --> R-ADJ-1 --> ADJ-2-PE3 (single path via P1)
 268  *
 269  * note that R-ADJ-1 (the recursive adj) remains in the forwarding graph,
 270  * therefore X/16 (and all its siblings) is not updated.
 271  * X/16 and its siblings share the recursive adj since they share the same
 272  * path-list. It is the path-list object that contributes the recursive-adj
 273  * (see next section for more details)
 274  *
 275  *
 276  * 2) iBGP PIC-edge; Traffic from CE3 to Y/16. On PE3 there is are routes;
 277  *      Y/16  (and hundreds of thousands of others like it)
 278  *        via PE1
 279  *        via PE2
 280  *  and
 281  *     PE1/32 (PE1's loopback address)
 282  *       via 10.0.2.2 Link0 (this is P1)
 283  *     PE2/32 (PE2's loopback address)
 284  *       via 10.0.3.3 Link1 (this is P2)
 285  *
 286  * the failure is the loss of reachability to PE2. this could be either the
 287  * loss of the link P2-PE2 or the loss of the node PE2. This is detected either
 288  * by the withdrawal of the PE2's loopback route or by some form of failure
 289  * detection (i.e. BFD).
 290  * VPP FIB again provides PIC via the use of the shared recursive-adj. Y/16 and
 291  * its siblings will again share a path-list for the list {PE1,PE2}, this
 292  * path-list will contribute a multi-path-recursive-adj, i.e. a multi-path-adj
 293  * with each choice therein being another adj;
 294  *
 295  *  Y/16 -> RM-ADJ --> ADJ1 (for PE1)
 296  *                 --> ADJ2 (for PE2)
 297  *
 298  * when the route for PE1 is withdrawn then the multi-path-recursive-adjacency
 299  * is updated to be;
 300  *
 301  * Y/16 --> RM-ADJ --> ADJ1 (for PE1)
 302  *                 --> ADJ1 (for PE1)
 303  *
 304  * that is both choices in the ECMP set are the same and thus all traffic is
 305  * forwarded to PE1. Eventually the control plane will download a route update
 306  * for Y/16 to be via PE1 only. At that time the situation will be:
 307  *
 308  * Y/16 -> R-ADJ --> ADJ1 (for PE1)
 309  *
 310  * In the scenario above we assumed that PE1 and PE2 are ECMP for Y/16. eBGP
 311  * PIC core is also specified for the case were one PE is primary and the other
 312  * backup - VPP FIB does not support that case at this time.
 313  *
 314  * 3) eBGP PIC Edge; Traffic from CE3 to Y/16. On PE1 there is are routes;
 315  *      Y/16 (and hundreds of thousands of others like it)
 316  *         via CE1 (primary)
 317  *         via PE2 (backup)
 318  *   and
 319  *     CE1 (this is an adj-fib)
 320  *       via 11.0.0.1 Link0 (this is CE1) << this is an adj-fib
 321  *     PE2 (PE2's loopback address)
 322  *       via 10.0.5.5 Link1 (this is link PE1-PE2)
 323  * the failure is the loss of link0 to CE1. The failure can be detected by FIB
 324  * either as a link down event or by the control plane withdrawing the connected
 325  * prefix on the link0 (say 10.0.5.4/30). The latter works because the resolving
 326  * entry is an adj-fib, so removing the connected will withdraw the adj-fib, and
 327  * hence the recursive path becomes unresolved. The former is faster,
 328  * particularly in the case of Inter-AS option A where there are many VLAN
 329  * sub-interfaces on the PE-CE link, one for each VRF, and so the control plane
 330  * must remove the connected prefix for each sub-interface to trigger PIC in
 331  * each VRF. Note though that total PIC cutover time will depend on VRF scale
 332  * with either trigger.
 333  * Primary and backup paths in this eBGP PIC-edge scenario are calculated by
 334  * BGP. Each peer is configured to always advertise its best external path to
 335  * its iBGP peers. Backup paths therefore send traffic from the PE back into the
 336  * core to an alternate PE. A PE may have multiple external paths, i.e. multiple
 337  * directly connected CEs, it may also have multiple backup PEs, however there
 338  * is no correlation between the two, so unlike LFA-FRR, the redundancy model is
 339  * N-M; N primary paths are backed-up by M backup paths - only when all primary
 340  * paths fail, then the cutover is performed onto the M backup paths. Note that
 341  * PE2 must be suitably configured to forward traffic on its external path that
 342  * was received from PE1. VPP FIB does not support external-internal-BGP (eiBGP)
 343  * load-balancing.
 344  *
 345  * As with LFA-FRR the use of primary and backup paths is not currently
 346  * supported, however, the use of a recursive-multi-path-adj, and a suitably
 347  * constrained hashing algorithm to choose from the primary or backup path sets,
 348  * would again provide the necessary shared object and hence the prefix scale
 349  * independent cutover.
 350  *
 351  * Astute readers will recognise that both of the eBGP PIC scenarios refer only
 352  * to a BGP free core.
 353  *
 354  * Fast convergence implementation options come in two flavours:
 355  *  1) Insert switches into the data-path. The switch represents the protected
 356  *     resource. If the switch is 'on' the primary path is taken, otherwise
 357  *     the backup path is taken. Testing the switch in the data-path comes with
 358  *     an associated performance cost. A given packet may encounter more than
 359  *     one protected resource as it is forwarded. This approach minimises
 360  *     cutover times as packets will be forwarded on the backup path as soon
 361  *     as the protected resource is detected to be down and the single switch
 362  *     is tripped. However, it comes at a performance cost, which increases
 363  *     with each shared resource a packet encounters in the data-path.
 364  *     This approach is thus best suited to LFA-FRR where the protected routes
 365  *     are non-recursive (i.e. encounter few shared resources) and the
 366  *     expectation on cutover times is more stringent (<50msecs).
 367  *  2) Update shared objects. Identify objects in the data-path, that are
 368  *     required to be present whether or not fast convergence is required (i.e.
 369  *     adjacencies) that can be shared by multiple routes. Create a dependency
 370  *     between these objects at the protected resource. When the protected
 371  *     resource fails, each of the shared objects is updated in a way that all
 372  *     users of it see a consistent change. This approach incurs no performance
 373  *     penalty as the data-path structure is unchanged, however, the cutover
 374  *     times are longer as more work is required when the resource fails. This
 375  *     scheme is thus more appropriate to recursive prefixes (where the packet
 376  *     will encounter multiple protected resources) and to fast-convergence
 377  *     technologies where the cutover times are less stringent (i.e. PIC).
 378  *
 379  * Implementation:
 380  * ---------------
 381  *
 382  * Due to the requirements outlined above, not all routes known to FIB
 383  * (e.g. adj-fibs) are installed in forwarding. However, should circumstances
 384  * change, those routes will need to be added. This adds the requirement that
 385  * a FIB maintains two tables per-VRF, per-AF (where a 'table' is indexed by
 386  * prefix); the forwarding and non-forwarding tables.
 387  *
 388  * For DP speed in VPP we want the lookup in the forwarding table to directly
 389  * result in the ADJ. So the two tables; one contains all the routes (a
 390  * lookup therein yields a fib_entry_t), the other contains only the forwarding
 391  * routes (a lookup therein yields an ip_adjacency_t). The latter is used by the
 392  * DP.
 393  * This trades memory for forwarding performance. A good trade-off in VPP's
 394  * expected operating environments.
 395  *
 396  * Note these tables are keyed only by the prefix (and since there 2 two
 397  * per-VRF, implicitly by the VRF too). The key for an adjacency is the
 398  * tuple:{next-hop, address (and it's AF), interface, link/ether-type}.
 399  * consider this curious, but allowed, config;
 400  *
 401  *   set int ip addr 10.0.0.1/24 Gig0
 402  *   set ip arp Gig0 10.0.0.2 dead.dead.dead
 403  *   # a host in that sub-net is routed via a better next hop (say it avoids a
 404  *   # big L2 domain)
 405  *   ip route add 10.0.0.2 Gig1 192.168.1.1
 406  *   # this recursive should go via Gig1
 407  *   ip route add 1.1.1.1/32 via 10.0.0.2
 408  *   # this non-recursive should go via Gig0
 409  *   ip route add 2.2.2.2/32 via Gig0 10.0.0.2
 410  *
 411  * for the last route, the lookup for the path (via {Gig0, 10.0.0.2}) in the
 412  * prefix table would not yield the correct result. To fix this we need a
 413  * separate table for the adjacencies.
 414  *
 415  *  - FIB data structures;
 416  *
 417  * fib_entry_t:
 418  *   - a representation of a route.
 419  *     - has a prefix.
 420  *    - it maintains an array of path-lists that have been contributed by the
 421  *      different sources
 422  *    - install an adjacency in the forwarding table contributed by the best
 423  *      source's path-list.
 424  *
 425  * fib_path_list_t:
 426  *   - a list of paths
 427  *   - path-lists may be shared between FIB entries. The path-lists are thus
 428  *     kept in a DB. The key is the combined description of the paths. We share
 429  *     path-lists  when it will aid convergence to do so. Adding path-lists to
 430  *     this DB that are never shared, or are not shared by prefixes that are
 431  *     not subject to PIC, will increase the size of the DB unnecessarily and
 432  *     may lead to increased search times due to hash collisions.
 433  *   - the path-list contributes the appropriate adj for the entry in the
 434  *     forwarding table. The adj can be 'normal', multi-path or recursive,
 435  *     depending on the number of paths and their types.
 436  *   - since path-lists are shared there is only one instance of the multi-path
 437  *     adj that they [may] create. As such multi-path adjacencies do not need a
 438  *     separate DB.
 439  * The path-list with recursive paths and the recursive adjacency that it
 440  * contributes forms the backbone of the fast convergence architecture (as
 441  * described previously).
 442  *
 443  * fib_path_t:
 444  *   - a description of how to forward the traffic (i.e. via {Gig1, K}).
 445  *   - the path describes the intent on how to forward. This differs from how
 446  *     the path resolves. I.e. it might not be resolved at all (since the
 447  *     interface is deleted or down).
 448  *   - paths have different types, most notably recursive or non-recursive.
 449  *   - a fib_path_t will contribute the appropriate adjacency object. It is from
 450  *     these contributions that the DP graph/chain for the route is built.
 451  *   - if the path is recursive and a recursion loop is detected, then the path
 452  *     will contribute the special DROP adjacency. This way, whilst the control
 453  *     plane graph is looped, the data-plane graph does not.
 454  *
 455  * we build a graph of these objects;
 456  *
 457  *  fib_entry_t -> fib_path_list_t -> fib_path_t -> ...
 458  *
 459  * for recursive paths:
 460  *
 461  *   fib_path_t -> fib_entry_t -> ....
 462  *
 463  * for non-recursive paths
 464  *
 465  *  fib_path_t -> ip_adjacency_t -> interface
 466  *
 467  * These objects, which constitute the 'control plane' part of the FIB are used
 468  * to represent the resolution of a route. As a whole this is referred to as the
 469  * control plane graph. There is a separate DP graph to represent the forwarding
 470  * of a packet. In the DP graph each object represents an action that is applied
 471  * to a packet as it traverses the graph. For example, a lookup of a IP address
 472  * in the forwarding table could result in the following graph:
 473  *
 474  *    recursive-adj --> multi-path-adj --> interface_A
 475  *                                     --> interface_B
 476  *
 477  * A packet traversing this FIB DP graph would thus also traverse a VPP node
 478  * graph of:
 479  *
 480  *    ipX_recursive --> ipX_rewrite --> interface_A_tx --> etc
 481  *
 482  * The taxonomy of objects in a FIB graph is as follows, consider;
 483  *
 484  *   A -->
 485  *   B --> D
 486  *   C -->
 487  *
 488  * Where A,B and C are (for example) routes that resolve through D.
 489  *  parent; D is the parent of A, B, and C.
 490  *  children: A, B, and C are children of D.
 491  *  sibling: A, B and C are siblings of one another.
 492  *
 493  * All shared objects in the FIB are reference counted. Users of these objects
 494  * are thus expected to use the add_lock/unlock semantics (as one would
 495  * normally use malloc/free).
 496  *
 497  * WALKS
 498  *
 499  * It is necessary to walk/traverse the graph forwards (entry to interface) to
 500  * perform a collapse or build a recursive adj and backwards (interface
 501  * to entry) to perform updates, i.e. when interface state changes or when
 502  * recursive route resolution updates occur.
 503  * A forward walk follows simply by navigating an object's parent pointer to
 504  * access its parent object. For objects with multiple parents (e.g. a
 505  * path-list), each parent is walked in turn.
 506  * To support back-walks direct dependencies are maintained between objects,
 507  * i.e. in the relationship, {A, B, C} --> D, then object D will maintain a list
 508  * of 'pointers' to its children {A, B, C}. Bare C-language pointers are not
 509  * allowed, so a pointer is described in terms of an object type (i.e. entry,
 510  * path-list, etc) and index - this allows the object to be retrieved from the
 511  * appropriate pool. A list is maintained to achieve fast convergence at scale.
 512  * When there are millions or recursive prefixes, it is very inefficient to
 513  * blindly walk the tables looking for entries that were affected by a given
 514  * topology change. The lowest hanging fruit when optimising is to remove
 515  * actions that are not required, so all back-walks only traverse objects that
 516  * are directly affected by the change.
 517  *
 518  * PIC Core and fast-reroute rely on FIB reacting quickly to an interface
 519  * state change to update the multi-path-adjacencies that use this interface.
 520  * An example graph is shown below:
 521  *
 522  *    E_a -->
 523  *    E_b --> PL_2 --> P_a --> Interface_A
 524  *    ...          --> P_c -\
 525  *    E_k -->                \
 526  *                            Interface_K
 527  *                            /
 528  *    E_l -->                /
 529  *    E_m --> PL_1 --> P_d -/
 530  *    ...          --> P_f --> Interface_F
 531  *    E_z -->
 532  *
 533  * E  = fib_entry_t
 534  * PL = fib_path_list_t
 535  * P  = fib_path_t
 536  * The subscripts are arbitrary and serve only to distinguish object instances.
 537  * This CP graph result in the following DP graph:
 538  *
 539  *     M-ADJ-2 --> Interface_A
 540  *             \
 541  *              -> Interface_K
 542  *             /
 543  *     M-ADJ-1 --> Interface_F
 544  *
 545  * M-ADJ = multi-path-adjacency.
 546  *
 547  * When interface K goes down a back-walk is started over its dependants in the
 548  * control plane graph. This back-walk will reach PL_1 and PL_2 and result in
 549  * the calculation of new adjacencies that have interface K removed. The walk
 550  * will continue to the entry objects and thus the forwarding table is updated
 551  * for each prefix with the new adjacency. The DP graph then becomes:
 552  *
 553  *    ADJ-3 --> Interface_A
 554  *
 555  *    ADJ-4 --> Interface_F
 556  *
 557  * The eBGP PIC scenarios described above relied on the update of a path-list's
 558  * recursive-adjacency to provide the shared point of cutover. This is shown
 559  * below
 560  *
 561  *    E_a -->
 562  *    E_b --> PL_2 --> P_a --> E_44 --> PL_a --> P_b --> Interface_A
 563  *    ...          --> P_c -\
 564  *    E_k -->                \
 565  *                            \
 566  *                           E_1 --> PL_k -> P_k --> Interface_K
 567  *                            /
 568  *    E_l -->                /
 569  *    E_m --> PL_1 --> P_d -/
 570  *    ...          --> P_f --> E_55 --> PL_e --> P_e --> Interface_E
 571  *    E_z -->
 572  *
 573  * The failure scenario is the removal of entry E_1 and thus the paths P_c and
 574  * P_d become unresolved. To achieve PIC the two shared recursive path-lists,
 575  * PL_1 and PL_2 must be updated to remove E_1 from the recursive-multi-path-
 576  * adjacencies that they contribute, before any entry E_a to E_z is updated.
 577  * This means that as the update propagates backwards (right to left) in the
 578  * graph it must do so breadth first not depth first. Note this approach leads
 579  * to convergence times that are dependent on the number of path-list and so
 580  * the number of combinations of egress PEs - this is desirable as this
 581  * scale is considerably lower than the number of prefixes.
 582  *
 583  * If we consider another section of the graph that is similar to the one
 584  * shown above where there is another prefix E_2 in a similar position to E_1
 585  * and so also has many dependent children. It is reasonable to expect that a
 586  * particular network failure may simultaneously render E_1 and E_2 unreachable.
 587  * This means that the update to withdraw E_2 is download immediately after the
 588  * update to withdraw E_1. It is a requirement on the FIB to not spend large
 589  * amounts of time in a back-walk whilst processing the update for E_1, i.e. the
 590  * back-walk must not reach as far as E_a and its siblings. Therefore, after the
 591  * back-walk has traversed one generation (breadth first) to update all the
 592  * path-lists it should be suspended/back-ground and further updates allowed
 593  * to be handled. Once the update queue is empty, the suspended walks can be
 594  * resumed. Note that in the case that multiple updates affect the same entry
 595  * (say E_1) then this will trigger multiple similar walks, these are merged,
 596  * so each child is updated only once.
 597  * In the presence of more layers of recursion PIC is still a desirable
 598  * feature. Consider an extension to the diagram above, where more recursive
 599  * routes (E_100 -> E_200) are added as children of E_a:
 600  *
 601  * E_100 -->
 602  * E_101 --> PL_3 --> P_j-\
 603  * ...                     \
 604  * E_199 -->               E_a -->
 605  *                         E_b --> PL_2 --> P_a --> E_44 --> ...etc..
 606  *                         ...          --> P_c -\
 607  *                         E_k                    \
 608  *                                                E_1 --> ...etc..
 609  *                                                 /
 610  *                         E_l -->                /
 611  *                         E_m --> PL_1 --> P_d -/
 612  *                         ...          --> P_e --> E_55 --> ...etc..
 613  *                         E_z -->
 614  *
 615  * To achieve PIC for the routes E_100->E_199, PL_3 needs to be updated before
 616  * E_b -> E_z, a breadth first traversal at each level would not achieve this.
 617  * Instead the walk must proceed intelligently. Children on PL_2 are sorted so
 618  * those Entry objects that themselves have children appear first in the list,
 619  * those without later. When an entry object is walked that has children, a
 620  * walk of its children is pushed to the front background queue. The back
 621  * ground queue is a priority queue. As the breadth first traversal proceeds
 622  * across the dependent entry object E_a to E_k, when the first entry that does
 623  * not have children is reached (E_b), the walk is suspended and placed at the
 624  * back of the queue. Following this prioritisation method shared path-list
 625  * updates are performed before all non-resolving entry objects.
 626  * The CPU/core/thread that handles the updates is the same thread that handles
 627  * the back-walks. Handling updates has a higher priority than making walk
 628  * progress, so a walk is required to be interruptable/suspendable when new
 629  * updates are available.
 630  * !!! TODO - this section describes how walks should be not how they are !!!
 631  *
 632  * In the diagram above E_100 is an IP route, however, VPP has no restrictions
 633  * on the type of object that can be a dependent of a FIB entry. Children of
 634  * a FIB entry can be (and are) GRE & VXLAN tunnels endpoints, L2VPN LSPs etc.
 635  * By including all object types into the graph and extending the back-walk, we
 636  * can thus deliver fast convergence to technologies that overlay on an IP
 637  * network.
 638  *
 639  * If having read all the above carefully you are still thinking;  'i don't need
 640  * all this %&$* i have a route only I know about and I just need to jam it in',
 641  * then fib_table_entry_special_add() is your only friend.
 642  */
 643
 644 #ifndef __FIB_H__
 645 #define __FIB_H__
 646
 647 #include <vnet/fib/fib_table.h>
 648 #include <vnet/fib/fib_entry.h>
 649
 650 #endif