src/plugins/lb/lb_plugin_doc.rst

   1 Load Balancer plugin
   2 ====================
   3
   4 Version
   5 -------
   6
   7 The load balancer plugin is currently in *beta* version. Both CLIs and
   8 APIs are subject to *heavy* changes, which also means feedback is really
   9 welcome regarding features, apis, etc…
  10
  11 Overview
  12 --------
  13
  14 This plugin provides load balancing for VPP in a way that is largely
  15 inspired from Google’s MagLev:
  16 http://research.google.com/pubs/pub44824.html
  17
  18 The load balancer is configured with a set of Virtual IPs (VIP, which
  19 can be prefixes), and for each VIP, with a set of Application Server
  20 addresses (ASs).
  21
  22 There are four encap types to steer traffic to different ASs: 1).
  23 IPv4+GRE ad IPv6+GRE encap types: Traffic received for a given VIP (or
  24 VIP prefix) is tunneled using GRE towards the different ASs in a way
  25 that (tries to) ensure that a given session will always be tunneled to
  26 the same AS.
  27
  28 2). IPv4+L3DSR encap types: L3DSR is used to overcome Layer 2
  29 limitations of Direct Server Return Load Balancing. It maps VIP to DSCP
  30 bits, and reuse TOS bits to transfer DSCP bits to server, and then
  31 server will get VIP from DSCP-to-VIP mapping.
  32
  33 Both VIPs or ASs can be IPv4 or IPv6, but for a given VIP, all ASs must
  34 be using the same encap. type (i.e. IPv4+GRE or IPv6+GRE or IPv4+L3DSR).
  35 Meaning that for a given VIP, all AS addresses must be of the same
  36 family.
  37
  38 3). IPv4/IPv6 + NAT4/NAT6 encap types: This type provides kube-proxy
  39 data plane on user space, which is used to replace linux kernel’s
  40 kube-proxy based on iptables.
  41
  42 Currently, load balancer plugin supports three service types: a) Cluster
  43 IP plus Port: support any protocols, including TCP, UDP. b) Node IP plus
  44 Node Port: currently only support UDP. c) External Load Balancer.
  45
  46 For Cluster IP plus Port case: kube-proxy is configured with a set of
  47 Virtual IPs (VIP, which can be prefixes), and for each VIP, with a set
  48 of AS addresses (ASs).
  49
  50 For a specific session received for a given VIP (or VIP prefix), first
  51 packet selects a AS according to internal load balancing algorithm, then
  52 does DNAT operation and sent to chosen AS. At the same time, will create
  53 a session entry to store AS chosen result. Following packets for that
  54 session will look up session table first, which ensures that a given
  55 session will always be routed to the same AS.
  56
  57 For returned packet from AS, it will do SNAT operation and sent out.
  58
  59 Please refer to below for details:
  60 https://schd.ws/hosted_files/ossna2017/1e/VPP_K8S_GTPU_OSSNA.pdf
  61
  62 Performance
  63 -----------
  64
  65 The load balancer has been tested up to 1 millions flows and still
  66 forwards more than 3Mpps per core in such circumstances. Although 3Mpps
  67 seems already good, it is likely that performance will be improved in
  68 next versions.
  69
  70 Configuration
  71 -------------
  72
  73 Global LB parameters
  74 ~~~~~~~~~~~~~~~~~~~~
  75
  76 The load balancer needs to be configured with some parameters:
  77
  78 ::
  79
  80    lb conf [ip4-src-address <addr>] [ip6-src-address <addr>]
  81            [buckets <n>] [timeout <s>]
  82
  83 ip4-src-address: the source address used to send encap. packets using
  84 IPv4 for GRE4 mode. or Node IP4 address for NAT4 mode.
  85
  86 ip6-src-address: the source address used to send encap. packets using
  87 IPv6 for GRE6 mode. or Node IP6 address for NAT6 mode.
  88
  89 buckets: the *per-thread* established-connections-table number of
  90 buckets.
  91
  92 timeout: the number of seconds a connection will remain in the
  93 established-connections-table while no packet for this flow is received.
  94
  95 Configure the VIPs
  96 ~~~~~~~~~~~~~~~~~~
  97
  98 ::
  99
 100    lb vip <prefix> [encap (gre6|gre4|l3dsr|nat4|nat6)] \
 101      [dscp <n>] [port <n> target_port <n> node_port <n>] [new_len <n>] [del]
 102
 103 new_len is the size of the new-connection-table. It should be 1 or 2
 104 orders of magnitude bigger than the number of ASs for the VIP in order
 105 to ensure a good load balancing. Encap l3dsr and dscp is used to map VIP
 106 to dscp bit and rewrite DSCP bit in packets. So the selected server
 107 could get VIP from DSCP bit in this packet and perform DSR. Encap
 108 nat4/nat6 and port/target_port/node_port is used to do kube-proxy data
 109 plane.
 110
 111 Examples:
 112
 113 ::
 114
 115    lb vip 2002::/16 encap gre6 new_len 1024
 116    lb vip 2003::/16 encap gre4 new_len 2048
 117    lb vip 80.0.0.0/8 encap gre6 new_len 16
 118    lb vip 90.0.0.0/8 encap gre4 new_len 1024
 119    lb vip 100.0.0.0/8 encap l3dsr dscp 2 new_len 32
 120    lb vip 90.1.2.1/32 encap nat4 port 3306 target_port 3307 node_port 30964 new_len 1024
 121    lb vip 2004::/16 encap nat6 port 6306 target_port 6307 node_port 30966 new_len 1024
 122
 123 Configure the ASs (for each VIP)
 124 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 125
 126 ::
 127
 128    lb as <vip-prefix> [<address> [<address> [...]]] [del]
 129
 130 You can add (or delete) as many ASs at a time (for a single VIP). Note
 131 that the AS address family must correspond to the VIP encap. IP family.
 132
 133 Examples:
 134
 135 ::
 136
 137    lb as 2002::/16 2001::2 2001::3 2001::4
 138    lb as 2003::/16 10.0.0.1 10.0.0.2
 139    lb as 80.0.0.0/8 2001::2
 140    lb as 90.0.0.0/8 10.0.0.1
 141
 142 Configure SNAT
 143 ~~~~~~~~~~~~~~
 144
 145 ::
 146
 147    lb set interface nat4 in <intfc> [del]
 148
 149 Set SNAT feature in a specific interface. (applicable in NAT4 mode only)
 150
 151 ::
 152
 153    lb set interface nat6 in <intfc> [del]
 154
 155 Set SNAT feature in a specific interface. (applicable in NAT6 mode only)
 156
 157 Monitoring
 158 ----------
 159
 160 The plugin provides quite a bunch of counters and information. These are
 161 still subject to quite significant changes.
 162
 163 ::
 164
 165    show lb
 166    show lb vip
 167    show lb vip verbose
 168
 169    show node counters
 170
 171 Design notes
 172 ------------
 173
 174 Multi-Threading
 175 ~~~~~~~~~~~~~~~
 176
 177 MagLev is a distributed system which pseudo-randomly generates a
 178 new-connections-table based on AS names such that each server configured
 179 with the same set of ASs ends up with the same table. Connection
 180 stickiness is then ensured with an established-connections-table. Using
 181 ECMP, it is assumed (but not relied on) that servers will mostly receive
 182 traffic for different flows.
 183
 184 This implementation pushes the parallelism a little bit further by using
 185 one established-connections table per thread. This is equivalent to
 186 assuming that RSS will make a job similar to ECMP, and is pretty useful
 187 as threads don’t need to get a lock in order to write in the table.
 188
 189 Hash Table
 190 ~~~~~~~~~~
 191
 192 A load balancer requires an efficient read and write hash table. The
 193 hash table used by ip6-forward is very read-efficient, but not so much
 194 for writing. In addition, it is not a big deal if writing into the hash
 195 table fails (again, MagLev uses a flow table but does not heavily
 196 relies on it).
 197
 198 The plugin therefore uses a very specific (and stupid) hash table. -
 199 Fixed (and power of 2) number of buckets (configured at runtime) - Fixed
 200 (and power of 2) elements per buckets (configured at compilation time)
 201
 202 Reference counting
 203 ~~~~~~~~~~~~~~~~~~
 204
 205 When an AS is removed, there is two possible ways to react. - Keep using
 206 the AS for established connections - Change AS for established
 207 connections (likely to cause error for TCP)
 208
 209 In the first case, although an AS is removed from the configuration, its
 210 associated state needs to stay around as long as it is used by at least
 211 one thread.
 212
 213 In order to avoid locks, a specific reference counter is used. The
 214 design is quite similar to clib counters but: - It is possible to
 215 decrease the value - Summing will not zero the per-thread counters -
 216 Only the thread can reallocate its own counters vector (to avoid
 217 concurrency issues)
 218
 219 This reference counter is lock free, but reading a count of 0 does not
 220 mean the value can be freed unless it is ensured by *other* means that
 221 no other thread is concurrently referencing the object. In the case of
 222 this plugin, it is assumed that no concurrent event will take place
 223 after a few seconds.