src/plugins/acl/acl_multicore_doc.md

   1 Multicore support for ACL plugin    {#acl_multicore}
   2 ================================
   3
   4 This captures some considerations and design decisions that I have made,
   5 both for my own memory later on ("what the hell was I thinking?!?"),
   6 and for anyone interested to criticize/improve/hack on this code.
   7
   8 One of the factors taken into account while making these decisions,
   9 was the relative emphasis on the multi-thread vs. single-thread
  10 use cases: the latter is the vastly more prevalent. But,
  11 one can not optimize the single-thread performance without
  12 having a functioning code for multi-thread.
  13
  14 stateless ACLs
  15 ==============
  16
  17 The stateless trivially parallelizes, and the only potential for the
  18 race between the different threads is during the reconfiguration,
  19 at the time of replacing the old ACL being checked, with
  20 the new ACL.
  21
  22 In case an acl_add_replace is being used to replace the rules
  23 within the existing entry, a reallocation of `am->acls[X].rules`
  24 vector will happen and potentially a change in count.
  25
  26 acl_match_5tuple() has the following code:
  27
  28 ```{.c}
  29   a = am->acls + acl_index;
  30   for (i = 0; i < a->count; i++)
  31     {
  32       r = a->rules + i;
  33      . . .
  34 ```
  35
  36 Ideally we should be immune from a->rules changing,
  37 but the problem arises if the count changes in flight,
  38 and the new ruleset is smaller - then we will attempt
  39 to "match" against the free memory.
  40
  41 This can(?) be solved by replacing the for() with while(),
  42 so the comparison happens at each iteration.
  43
  44 full_acl_match_5tuple(), which iterates over the list
  45 of ACLs, is a bit less immune, since it takes the pointer
  46 to the vector to iterate and keeps a local copy of
  47 that pointer.
  48
  49 This race can be solved by checking the
  50 current pointer to the vector with the source pointer,
  51 and seeing if there is an (unlikely) change, and if
  52 there is, return the "deny" action, or, better,
  53 restart the check.
  54
  55 Since the check reloads the ACL list on a per-packet basis,
  56 there is only a window of opportunity of one packet to
  57 "match" packet against an incorrect rule set.
  58 The workers also do not change anything, only read.
  59 Therefore, it looks like building special structures
  60 to ensure that it does not happen at all might be not
  61 worth it.
  62
  63 At least not until we have a unit-test able to
  64 reliably catch this condition and test that
  65 the measures applied are effective. Adding the code
  66 which is not possible to exercise is worse than
  67 not adding any code at all.
  68
  69 So, I opt for "do-nothing" here for the moment.
  70
  71 reflexive ACLs: single-thread
  72 =============================
  73
  74 Before we talk multi-thread, is worth revisiting the
  75 design of the reflexive ACLs in the plugin, and
  76 the history of their evolution.
  77
  78 The very first version of the ACL plugin, shipped in
  79 1701, mostly did the job using the existing components
  80 and gluing them together. Because it needed to work
  81 in bridged forwarding path only, using L2 classifier
  82 as an insertion point appeared natural, also L2 classifier,
  83 being a table with sessions, seemed like a good place
  84 to hold the sessions.
  85
  86 So, the original design had two conceptual nodes:
  87 one, pointed by the next_miss from the L2 classifier table,
  88 was checking the actual ACL, and inserting session into
  89 the L2 classifier table, and the other one, pointed
  90 to by the next_match within the specific session rule,
  91 was checking the existing session. The timing out
  92 of the existing connections was done in the datapath,
  93 by periodically calling the aging function.
  94
  95 This decision to use the existing components,
  96 with its attractiveness, did bring a few limitations as well:
  97
  98 * L2 classifier is a simple mask-and-value match, with
  99 a fixed mask across the table. So, sanely supporting IPv6
 100 packets with extension headers in that framework was impossible.
 101
 102 * There is no way to get a backpressure from L2 classifier
 103 depending on memory usage. When it runs out of memory,
 104 it simply crashes the box. When it runs out of memory ?
 105 We don't really know. Depends on how it allocates it.
 106
 107 * Since we need to match the *reflected* traffic,
 108 we had to create *two* full session entries
 109 in two different directions, which is quite wasteful memory-wise.
 110
 111 * (showstopper): the L2 classifier runs only in
 112 the bridged data path, so supporting routed data path
 113 would require creating something else entirely different,
 114 which would mean much more headaches support-wise going forward.
 115
 116 Because of that, I have moved to a different model of
 117 creating a session-5-tuple from the packet data - once,
 118 and then doing all the matching just on that 5-tuple.
 119
 120 This has allowed to add support for skipping IPv6 extension headers.
 121
 122 Also, this new version started to store the sessions in a dedicated
 123 bihash-per-interface, with the session key data being
 124 aligned for the ingress packets, and being mirrored for the
 125 egress packets. This allows of significant savings in memory,
 126 because now we need to keep only one copy of the session table per
 127 interface instead of two, and also to only have ONE node for all the lookups,
 128 (L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.
 129
 130 Unfortunately, bihash still has the "lack of backpressure" problem,
 131 in a sense that if you try to insert too many entries and run out
 132 of memory in the heap you supplied, you get a crash.
 133
 134 To somewhat workaround against that, there is a "maximum tested number of sessions"
 135 value, which tracks the currently inserted sessions in the bihash,
 136 and if this number is being approached, a more aggressive cleanup
 137 can happen. If this number is reached, two behaviors are possible:
 138
 139 * attempt to do the stateless ACL matching and permit the packet
 140   if it succeeds
 141
 142 * deny the packet
 143
 144 Currently I have opted for a second one, since it allows for
 145 a better defined behavior, and if you have to permit
 146 the traffic in both directions, why using stateful anyway ?
 147
 148 In order to be able to do the cleanup, we need to discriminate between
 149 the session types, with each session type having its own idle timeout.
 150 In order to do that, we keep three lists, defined in enum acl_timeout_e:
 151 ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
 152
 153 The first one is hopefully obvious - it is just all UDP connections.
 154 They have an idle timeout of 600 seconds.
 155
 156 The second and third is a bit more subtle. TCP is a complicated protocol,
 157 and we need to tread the fine line between doing too little and doing
 158 too much, and triggering the potential compatibility issues because of
 159 being a "middlebox".
 160
 161 I decided to split the TCP connections into two classes:
 162 established, and everything else. "Established", means we have seen
 163 the SYN and ACK from both sides (with PUSH obviously masked out).
 164 This is the "active" state of any TCP connection and we would like
 165 to ensure we do not screw it up. So, the connections in this state
 166 have the default idle timer of 24 hours.
 167
 168 All the rest of the connections have the idle timeout of 2 minutes,
 169 (inspired by an old value of MSL) and based on the observation
 170 that the states this class represent are usually very short lived.
 171
 172 Once we have these three baskets of connections, it is trivial to
 173 imagine a simple cleanup mechanism to deal with this: take a
 174 TCP transient connection that has been hanging around.
 175
 176 It is debatable whether we want to do discrimination between the
 177 different TCP transient connections. Assuming we do FIFO (and
 178 the lists allow us to do just that), it means a given connection
 179 on the head of the list has been hanging around for longest.
 180 Thus, if we are short on resources, we might just go ahead and
 181 reuse it within the datapath.
 182
 183 This is where we are slowly approaching the question
 184 "Why in the world have not you used timer wheel or such ?"
 185
 186 The answer is simple: within the above constraints, it does
 187 not buy me much.
 188
 189 Also, timer wheel creates a leaky abstraction with a difficult
 190 to manage corner case. Which corner case ?
 191
 192 We have a set of objects (sessions) with an event that may
 193 or may not happen (idle timeout timer firing), and a
 194 necessity to reset the idle timeout when there is
 195 activity on the session.
 196
 197 In the worst case, where we had a 10000 of one-packet
 198 UDP sessions just created 10 minutes ago, we would need
 199 to deal with a spike of 10000 expired timers.
 200
 201 Of course, if we have the active traffic on all
 202 of these 10000 connections, then we will not have
 203 to deal with that ? Right, but we will still have to deal
 204 with canceling and requeueing the timers.
 205
 206 In the best possible case, requeueing a timer is
 207 going to be something along the lines of a linked-list
 208 removal and reinsertion.
 209
 210 However, keep in mind we already need to classify the
 211 connections for reuse, so therefore we already have
 212 the linked lists!
 213
 214 And if we just check these linked lists periodically in
 215 a FIFO fashion, we can get away with a very simple per-packet operation:
 216 writing back the timestamp of "now" into the connection structure.
 217
 218 Then rather than requeueing the list on a per-packet or per-frame
 219 basis, we can defer this action until the time this session
 220 appears on the head of the FIFO list, and the cleaning
 221 routine makes the decision about whether to discard
 222 the session (because the interval since last activity is bigger
 223 than the idle timeout), or to requeue the session back to
 224 the end of the list (because the last activity was less
 225 than idle timeout ago).
 226
 227 So, rather than using the timers, we can simply reuse our classification
 228 FIFOs, with the following heuristic: do not look at the session that was
 229 enqueued at time X until X+session_timeout. If we enqueue the sessions
 230 in the order of their initial activity, then we can simply use enqueue
 231 timestamp of the head session as a decision criterion for when we need
 232 to get back at looking at it for the timeout purposes.
 233
 234 Since the number of FIFOs is small, we get a slightly worse check
 235 performance than with timers, but still O(1).
 236
 237 We seemingly do quite a few "useless" operations of requeueing the items
 238 back to the tail of the list - but, these are the operations we do not
 239 have to do in the active data path, so overall it is a win.
 240
 241 (Diversion: I believe this problem is congruent to poll vs. epoll or
 242 events vs. threads, some reading on this subject:
 243 http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
 244
 245 We can also can run a TCP-like scheme for adaptively changing
 246 the wait period in the routine that deals with the connection timeouts:
 247 we can attempt to check the connections a couple of times per second
 248 (same as we would advance the timer wheel), and then if we have requeued
 249 close to a max-per-quantum number of connections, we can half the waiting
 250 interval, and if we did not requeue any, we can slowly increment the waiting
 251 interval - which at a steady state should stabilize similar to what the TCP rate
 252 does.
 253
 254 reflexive ACLs: multi-thread
 255 =============================
 256
 257 The single-threaded implementation in 1704 used a separate "cleaner" process
 258 to deal with the timing out of the connections.
 259 It is all good and great when you know that there is only a single core
 260 to run everything on, but the existence of the lists proves to be
 261 a massive difficulty when it comes to operating from multiple threads.
 262
 263 Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
 264 and the worker have a demarcation point in time where either one or the other one touches
 265 the session in the list) it might be possible to make it work, but the resulting
 266 trickiness of doing it neatly with all the corner cases is quite large.
 267
 268 So, for the multi-threaded scenario, we need to move the connection
 269 aging back to the same CPU as its creation.
 270
 271 Luckily we can do this with the help of the interrupts.
 272
 273 So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
 274 periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
 275 using vlib_node_set_interrupt_pending(), and
 276 the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
 277 which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
 278 doing is putting the items onto FIFO, and updating the last active time on the existing connection.
 279
 280 The one "delicate" part is that the worker for one leg of the connection might be different from
 281 the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
 282 nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
 283 will get the timestamp updated - same thing about the TCP flags seen.
 284
 285 A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
 286 and the return packet processed by another worker, and as a result changes the
 287 the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
 288 If the class changes from one with the shorter idle time to the one with the longer idle time,
 289 then unless we are in the starvation mode where the transient connections are recycled,
 290 we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
 291 timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
 292 will affect the resource usage.
 293
 294 One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
 295 can signal to the owner the connection# that needs to be requeued out of order.
 296
 297 A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
 298 This way the resource starvation problem is taken care of, at an expense of some additional work.
 299
 300 This all looks sufficiently nice and simple until a skeleton falls out of the closet:
 301 sometimes we want to clean the connections en masse before they expire.
 302
 303 There few potential scenarios:
 304 1) removal of an ACL from the interface
 305 2) removal of an interface
 306 3) manual action of an operator (in the future).
 307
 308 In order to tackle this, we need to modify the logic which decides whether to requeue the
 309 connection on the end of the list, or to delete it due to idle timeout:
 310
 311 We define a point in time, and have each worker thread fast-forward through its FIFO,
 312 in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.
 313
 314 To keep the ease of appearance to the outside world, we still process this as an event
 315 within the connection cleaner thread, but this event handler does as follows:
 316 1) it creates the bitmap of the sw_if_index values requested to be cleared
 317 2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
 318 it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
 319 3) wait until all cleanup operations have completed.
 320
 321 Within the worker interrupt node, we check if the "cleanup in progress" is set,
 322 and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
 323 requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.
 324
 325 (we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
 326
 327 If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
 328 Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
 329 meaning there is more work to do.
 330 When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
 331 zeroize the bitmap of sw_if_index-es requested to be cleaned.
 332
 333 The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
 334 flag within the per-worker structure. The main thread, while waiting for the
 335 cleanup operation to complete, checks if there is a request for interrupt,
 336 and if there is - it sends one.
 337
 338 This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
 339 connection cleanup.
 340
 341 One potential inefficiency is the bitmap values set by the session insertion
 342 in the data path - there is nothing to clear them.
 343
 344 So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
 345 For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
 346 would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).
 347
 348 === the end ===
 349