src/plugins/acl/acl_multicore_doc.rst

   1 Multicore support for ACL plugin
   2 ================================
   3
   4 This captures some considerations and design decisions that I have made,
   5 both for my own memory later on (“what the hell was I thinking?!?”), and
   6 for anyone interested to criticize/improve/hack on this code.
   7
   8 One of the factors taken into account while making these decisions, was
   9 the relative emphasis on the multi-thread vs. single-thread use cases:
  10 the latter is the vastly more prevalent. But, one can not optimize the
  11 single-thread performance without having a functioning code for
  12 multi-thread.
  13
  14 stateless ACLs
  15 --------------
  16
  17 The stateless trivially parallelizes, and the only potential for the
  18 race between the different threads is during the reconfiguration, at the
  19 time of replacing the old ACL being checked, with the new ACL.
  20
  21 In case an acl_add_replace is being used to replace the rules within the
  22 existing entry, a reallocation of ``am->acls[X].rules`` vector will
  23 happen and potentially a change in count.
  24
  25 acl_match_5tuple() has the following code:
  26
  27 .. code:: c
  28
  29      a = am->acls + acl_index;
  30      for (i = 0; i < a->count; i++)
  31        {
  32          r = a->rules + i;
  33         . . .
  34
  35 Ideally we should be immune from a->rules changing, but the problem
  36 arises if the count changes in flight, and the new ruleset is smaller -
  37 then we will attempt to “match” against the free memory.
  38
  39 This can(?) be solved by replacing the for() with while(), so the
  40 comparison happens at each iteration.
  41
  42 full_acl_match_5tuple(), which iterates over the list of ACLs, is a bit
  43 less immune, since it takes the pointer to the vector to iterate and
  44 keeps a local copy of that pointer.
  45
  46 This race can be solved by checking the current pointer to the vector
  47 with the source pointer, and seeing if there is an (unlikely) change,
  48 and if there is, return the “deny” action, or, better, restart the
  49 check.
  50
  51 Since the check reloads the ACL list on a per-packet basis, there is
  52 only a window of opportunity of one packet to “match” packet against an
  53 incorrect rule set. The workers also do not change anything, only read.
  54 Therefore, it looks like building special structures to ensure that it
  55 does not happen at all might be not worth it.
  56
  57 At least not until we have a unit-test able to reliably catch this
  58 condition and test that the measures applied are effective. Adding the
  59 code which is not possible to exercise is worse than not adding any code
  60 at all.
  61
  62 So, I opt for “do-nothing” here for the moment.
  63
  64 reflexive ACLs: single-thread
  65 -----------------------------
  66
  67 Before we talk multi-thread, is worth revisiting the design of the
  68 reflexive ACLs in the plugin, and the history of their evolution.
  69
  70 The very first version of the ACL plugin, shipped in 1701, mostly did
  71 the job using the existing components and gluing them together. Because
  72 it needed to work in bridged forwarding path only, using L2 classifier
  73 as an insertion point appeared natural, also L2 classifier, being a
  74 table with sessions, seemed like a good place to hold the sessions.
  75
  76 So, the original design had two conceptual nodes: one, pointed by the
  77 next_miss from the L2 classifier table, was checking the actual ACL, and
  78 inserting session into the L2 classifier table, and the other one,
  79 pointed to by the next_match within the specific session rule, was
  80 checking the existing session. The timing out of the existing
  81 connections was done in the datapath, by periodically calling the aging
  82 function.
  83
  84 This decision to use the existing components, with its attractiveness,
  85 did bring a few limitations as well:
  86
  87 -  L2 classifier is a simple mask-and-value match, with a fixed mask
  88    across the table. So, sanely supporting IPv6 packets with extension
  89    headers in that framework was impossible.
  90
  91 -  There is no way to get a backpressure from L2 classifier depending on
  92    memory usage. When it runs out of memory, it simply crashes the box.
  93    When it runs out of memory ? We don’t really know. Depends on how it
  94    allocates it.
  95
  96 -  Since we need to match the *reflected* traffic, we had to create
  97    *two* full session entries in two different directions, which is
  98    quite wasteful memory-wise.
  99
 100 -  (showstopper): the L2 classifier runs only in the bridged data path,
 101    so supporting routed data path would require creating something else
 102    entirely different, which would mean much more headaches support-wise
 103    going forward.
 104
 105 Because of that, I have moved to a different model of creating a
 106 session-5-tuple from the packet data - once, and then doing all the
 107 matching just on that 5-tuple.
 108
 109 This has allowed to add support for skipping IPv6 extension headers.
 110
 111 Also, this new version started to store the sessions in a dedicated
 112 bihash-per-interface, with the session key data being aligned for the
 113 ingress packets, and being mirrored for the egress packets. This allows
 114 of significant savings in memory, because now we need to keep only one
 115 copy of the session table per interface instead of two, and also to only
 116 have ONE node for all the lookups, (L2/L3 path, in/out, IPv4/IPv6) -
 117 significantly reducing the code complexity.
 118
 119 Unfortunately, bihash still has the “lack of backpressure” problem, in a
 120 sense that if you try to insert too many entries and run out of memory
 121 in the heap you supplied, you get a crash.
 122
 123 To somewhat workaround against that, there is a “maximum tested number
 124 of sessions” value, which tracks the currently inserted sessions in the
 125 bihash, and if this number is being approached, a more aggressive
 126 cleanup can happen. If this number is reached, two behaviors are
 127 possible:
 128
 129 -  attempt to do the stateless ACL matching and permit the packet if it
 130    succeeds
 131
 132 -  deny the packet
 133
 134 Currently I have opted for a second one, since it allows for a better
 135 defined behavior, and if you have to permit the traffic in both
 136 directions, why using stateful anyway ?
 137
 138 In order to be able to do the cleanup, we need to discriminate between
 139 the session types, with each session type having its own idle timeout.
 140 In order to do that, we keep three lists, defined in enum acl_timeout_e:
 141 ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
 142
 143 The first one is hopefully obvious - it is just all UDP connections.
 144 They have an idle timeout of 600 seconds.
 145
 146 The second and third is a bit more subtle. TCP is a complicated
 147 protocol, and we need to tread the fine line between doing too little
 148 and doing too much, and triggering the potential compatibility issues
 149 because of being a “middlebox”.
 150
 151 I decided to split the TCP connections into two classes: established,
 152 and everything else. “Established”, means we have seen the SYN and ACK
 153 from both sides (with PUSH obviously masked out). This is the “active”
 154 state of any TCP connection and we would like to ensure we do not screw
 155 it up. So, the connections in this state have the default idle timer of
 156 24 hours.
 157
 158 All the rest of the connections have the idle timeout of 2 minutes,
 159 (inspired by an old value of MSL) and based on the observation that the
 160 states this class represent are usually very short lived.
 161
 162 Once we have these three baskets of connections, it is trivial to
 163 imagine a simple cleanup mechanism to deal with this: take a TCP
 164 transient connection that has been hanging around.
 165
 166 It is debatable whether we want to do discrimination between the
 167 different TCP transient connections. Assuming we do FIFO (and the lists
 168 allow us to do just that), it means a given connection on the head of
 169 the list has been hanging around for longest. Thus, if we are short on
 170 resources, we might just go ahead and reuse it within the datapath.
 171
 172 This is where we are slowly approaching the question “Why in the world
 173 have not you used timer wheel or such ?”
 174
 175 The answer is simple: within the above constraints, it does not buy me
 176 much.
 177
 178 Also, timer wheel creates a leaky abstraction with a difficult to manage
 179 corner case. Which corner case ?
 180
 181 We have a set of objects (sessions) with an event that may or may not
 182 happen (idle timeout timer firing), and a necessity to reset the idle
 183 timeout when there is activity on the session.
 184
 185 In the worst case, where we had a 10000 of one-packet UDP sessions just
 186 created 10 minutes ago, we would need to deal with a spike of 10000
 187 expired timers.
 188
 189 Of course, if we have the active traffic on all of these 10000
 190 connections, then we will not have to deal with that ? Right, but we
 191 will still have to deal with canceling and requeueing the timers.
 192
 193 In the best possible case, requeueing a timer is going to be something
 194 along the lines of a linked-list removal and reinsertion.
 195
 196 However, keep in mind we already need to classify the connections for
 197 reuse, so therefore we already have the linked lists!
 198
 199 And if we just check these linked lists periodically in a FIFO fashion,
 200 we can get away with a very simple per-packet operation: writing back
 201 the timestamp of “now” into the connection structure.
 202
 203 Then rather than requeueing the list on a per-packet or per-frame basis,
 204 we can defer this action until the time this session appears on the head
 205 of the FIFO list, and the cleaning routine makes the decision about
 206 whether to discard the session (because the interval since last activity
 207 is bigger than the idle timeout), or to requeue the session back to the
 208 end of the list (because the last activity was less than idle timeout
 209 ago).
 210
 211 So, rather than using the timers, we can simply reuse our classification
 212 FIFOs, with the following heuristic: do not look at the session that was
 213 enqueued at time X until X+session_timeout. If we enqueue the sessions
 214 in the order of their initial activity, then we can simply use enqueue
 215 timestamp of the head session as a decision criterion for when we need
 216 to get back at looking at it for the timeout purposes.
 217
 218 Since the number of FIFOs is small, we get a slightly worse check
 219 performance than with timers, but still O(1).
 220
 221 We seemingly do quite a few “useless” operations of requeueing the items
 222 back to the tail of the list - but, these are the operations we do not
 223 have to do in the active data path, so overall it is a win.
 224
 225 (Diversion: I believe this problem is congruent to poll vs. epoll or
 226 events vs. threads, some reading on this subject:
 227 http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
 228
 229 We can also can run a TCP-like scheme for adaptively changing the wait
 230 period in the routine that deals with the connection timeouts: we can
 231 attempt to check the connections a couple of times per second (same as
 232 we would advance the timer wheel), and then if we have requeued close to
 233 a max-per-quantum number of connections, we can half the waiting
 234 interval, and if we did not requeue any, we can slowly increment the
 235 waiting interval - which at a steady state should stabilize similar to
 236 what the TCP rate does.
 237
 238 reflexive ACLs: multi-thread
 239 ----------------------------
 240
 241 The single-threaded implementation in 1704 used a separate “cleaner”
 242 process to deal with the timing out of the connections. It is all good
 243 and great when you know that there is only a single core to run
 244 everything on, but the existence of the lists proves to be a massive
 245 difficulty when it comes to operating from multiple threads.
 246
 247 Initial study shows that with a few assumptions (e.g. that the cleaner
 248 running in main thread and the worker have a demarcation point in time
 249 where either one or the other one touches the session in the list) it
 250 might be possible to make it work, but the resulting trickiness of doing
 251 it neatly with all the corner cases is quite large.
 252
 253 So, for the multi-threaded scenario, we need to move the connection
 254 aging back to the same CPU as its creation.
 255
 256 Luckily we can do this with the help of the interrupts.
 257
 258 So, the design is as follows: the aging thread
 259 (acl_fa_session_cleaner_process) periodically fires the interrupts to
 260 the workers interrupt nodes
 261 (acl_fa_worker_session_cleaner_process_node.index), using
 262 vlib_node_set_interrupt_pending(), and the interrupt node
 263 acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
 264 which does the actual job of advancing the lists. And within the actual
 265 datapath the only thing we will be doing is putting the items onto FIFO,
 266 and updating the last active time on the existing connection.
 267
 268 The one “delicate” part is that the worker for one leg of the connection
 269 might be different from the worker of another leg of the connection -
 270 but, even if the “owner” tries to free the connection, nothing terrible
 271 can happen - worst case the element of the pool (which is nominally free
 272 for a short period) will get the timestamp updated - same thing about
 273 the TCP flags seen.
 274
 275 A slightly trickier issue arises when the packet initially seen by one
 276 worker (thus owned by that worker), and the return packet processed by
 277 another worker, and as a result changes the the class of the connection
 278 (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa). If the
 279 class changes from one with the shorter idle time to the one with the
 280 longer idle time, then unless we are in the starvation mode where the
 281 transient connections are recycled, we can simply do nothing and let the
 282 normal requeue mechanism kick in. If the class changes from the longer
 283 idle timer to the shorter idle timer, then we risk keeping the
 284 connection around for longer than needed, which will affect the resource
 285 usage.
 286
 287 One solution to that is to have NxN ring buffers (where N is the number
 288 of workers), such that the non-owner can signal to the owner the
 289 connection# that needs to be requeued out of order.
 290
 291 A simpler solution though, is to ensure that each FIFO’s period is equal
 292 to that of a shortest timer. This way the resource starvation problem is
 293 taken care of, at an expense of some additional work.
 294
 295 This all looks sufficiently nice and simple until a skeleton falls out
 296 of the closet: sometimes we want to clean the connections en masse
 297 before they expire.
 298
 299 There few potential scenarios: 1) removal of an ACL from the interface
 300 2) removal of an interface 3) manual action of an operator (in the
 301 future).
 302
 303 In order to tackle this, we need to modify the logic which decides
 304 whether to requeue the connection on the end of the list, or to delete
 305 it due to idle timeout:
 306
 307 We define a point in time, and have each worker thread fast-forward
 308 through its FIFO, in the process looking for sessions that satisfy the
 309 criteria, and either keeping them or requeueing them.
 310
 311 To keep the ease of appearance to the outside world, we still process
 312 this as an event within the connection cleaner thread, but this event
 313 handler does as follows: 1) it creates the bitmap of the sw_if_index
 314 values requested to be cleared 2) for each worker, it waits to ensure
 315 there is no cleanup operation in progress (and if there is one, it
 316 waits), and then makes a copy of the bitmap, sets the per-worker flag of
 317 a cleanup operation, and sends an interrupt. 3) wait until all cleanup
 318 operations have completed.
 319
 320 Within the worker interrupt node, we check if the “cleanup in progress”
 321 is set, and if it is, we check the “fast forward time” value. If unset,
 322 we initialize it to value now, and compare the requested bitmap of
 323 sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of
 324 sw_if_index that this worker deals with.
 325
 326 (we set the bit in the bitmap every time we enqueue the packet onto a
 327 FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
 328
 329 If the result of this AND operation is zero - then we can clear the flag
 330 of cleanup in progress and return. Else we kick off the quantum of
 331 cleanup, and make sure we get another interrupt ASAP if that cleanup
 332 operation returns non-zero, meaning there is more work to do. When that
 333 operation returns zero, everything has been processed, we can clear the
 334 “cleanup-in-progress” flag, and zeroize the bitmap of sw_if_index-es
 335 requested to be cleaned.
 336
 337 The interrupt node signals its wish to receive an interrupt ASAP by
 338 setting interrupt_is_needed flag within the per-worker structure. The
 339 main thread, while waiting for the cleanup operation to complete, checks
 340 if there is a request for interrupt, and if there is - it sends one.
 341
 342 This approach gives us a way to mass-clean the connections which is
 343 reusing the code of the regular idle connection cleanup.
 344
 345 One potential inefficiency is the bitmap values set by the session
 346 insertion in the data path - there is nothing to clear them.
 347
 348 So, if one rearranges the interface placement with the workers, then the
 349 cleanups will cause some unnecessary work. For now, we consider it an
 350 acceptable limitation. It can be resolved by having another per-worker
 351 bitmap, which, when set, would trigger the cleanup of the bits in the
 352 serviced_sw_if_index_bitmap).
 353
 354 === the end ===