src/plugins/acl/acl-plugin.md

   1 Multicore support for ACL plugin
   2 ================================
   3
   4 This captures some considerations and design decisions that I have made,
   5 both for my own memory later on ("what the hell was I thinking?!?"),
   6 and for anyone interested to criticize/improve/hack on this code.
   7
   8 One of the factors taken into account while making these decisions,
   9 was the relative emphasis on the multi-thread vs. single-thread
  10 use cases: the latter is the vastly more prevalent. But,
  11 one can not optimize the single-thread performance without
  12 having a functioning code for multi-thread.
  13
  14 stateless ACLs
  15 ==============
  16
  17 The stateless trivially parallelizes, and the only potential for the
  18 race between the different threads is during the reconfiguration,
  19 at the time of replacing the old ACL being checked, with
  20 the new ACL.
  21
  22 In case an acl_add_replace is being used to replace the rules
  23 within the existing entry, a reallocation of am->acls[X].rules
  24 vector will happen and potentially a change in count.
  25
  26 acl_match_5tuple() has the following code:
  27
  28   a = am->acls + acl_index;
  29   for (i = 0; i < a->count; i++)
  30     {
  31       r = a->rules + i;
  32      . . .
  33
  34 Ideally we should be immune from a->rules changing,
  35 but the problem arises if the count changes in flight,
  36 and the new ruleset is smaller - then we will attempt
  37 to "match" against the free memory.
  38
  39 This can(?) be solved by replacing the for() with while(),
  40 so the comparison happens at each iteration.
  41
  42 full_acl_match_5tuple(), which iterates over the list
  43 of ACLs, is a bit less immune, since it takes the pointer
  44 to the vector to iterate and keeps a local copy of
  45 that pointer.
  46
  47 This race can be solved by checking the
  48 current pointer to the vector with the source pointer,
  49 and seeing if there is an (unlikely) change, and if
  50 there is, return the "deny" action, or, better,
  51 restart the check.
  52
  53 Since the check reloads the ACL list on a per-packet basis,
  54 there is only a window of opportunity of one packet to
  55 "match" packet against an incorrect rule set.
  56 The workers also do not change anything, only read.
  57 Therefore, it looks like building special structures
  58 to ensure that it does not happen at all might be not
  59 worth it.
  60
  61 At least not until we have a unit-test able to
  62 reliably catch this condition and test that
  63 the measures applied are effective. Adding the code
  64 which is not possible to exercise is worse than
  65 not adding any code at all.
  66
  67 So, I opt for "do-nothing" here for the moment.
  68
  69 reflexive ACLs: single-thread
  70 =============================
  71
  72 Before we talk multi-thread, is worth revisiting the
  73 design of the reflexive ACLs in the plugin, and
  74 the history of their evolution.
  75
  76 The very first version of the ACL plugin, shipped in
  77 1701, mostly did the job using the existing components
  78 and gluing them together. Because it needed to work
  79 in bridged forwarding path only, using L2 classifier
  80 as an insertion point appeared natural, also L2 classifier,
  81 being a table with sessions, seemed like a good place
  82 to hold the sessions.
  83
  84 So, the original design had two conceptual nodes:
  85 one, pointed by the next_miss from the L2 classifier table,
  86 was checking the actual ACL, and inserting session into
  87 the L2 classifier table, and the other one, pointed
  88 to by the next_match within the specific session rule,
  89 was checking the existing session. The timing out
  90 of the existing connections was done in the datapath,
  91 by periodically calling the aging function.
  92
  93 This decision to use the existing components,
  94 with its attrativeness, did bring a few limitations as well:
  95
  96 * L2 classifier is a simple mask-and-value match, with
  97 a fixed mask across the table. So, sanely supporting IPv6
  98 packets with extension headers in that framework was impossible.
  99
 100 * There is no way to get a backpressure from L2 classifier
 101 depending on memory usage. When it runs out of memory,
 102 it simply crashes the box. When it runs out of memory ?
 103 We don't really know. Depends on how it allocates it.
 104
 105 * Since we need to match the *reflected* traffic,
 106 we had to create *two* full session entries
 107 in two different directions, which is quite wasteful memory-wise.
 108
 109 * (showstopper): the L2 classifier runs only in
 110 the bridged data path, so supporting routed data path
 111 would require creating something else entirely different,
 112 which would mean much more headaches support-wise going forward.
 113
 114 Because of that, I have moved to a different model of
 115 creating a session-5-tuple from the packet data - once,
 116 and then doing all the matching just on that 5-tuple.
 117
 118 This has allowed to add support for skipping IPv6 extension headers.
 119
 120 Also, this new version started to store the sessions in a dedicated
 121 bihash-per-interface, with the session key data being
 122 aligned for the ingress packets, and being mirrored for the
 123 egress packets. This allows of significant savings in memory,
 124 because now we need to keep only one copy of the session table per
 125 interface instead of two, and also to only have ONE node for all the lookups,
 126 (L2/L3 path, in/out, IPv4/IPv6) - significantly reducing the code complexity.
 127
 128 Unfortunately, bihash still has the "lack of backpressure" problem,
 129 in a sense that if you try to insert too many entries and run out
 130 of memory in the heap you supplied, you get a crash.
 131
 132 To somewhat workaround against that, there is a "maximum tested number of sessions"
 133 value, which tracks the currently inserted sessions in the bihash,
 134 and if this number is being approached, a more aggressive cleanup
 135 can happen. If this number is reached, two behaviors are possible:
 136
 137 * attempt to do the stateless ACL matching and permit the packet
 138   if it succeeds
 139
 140 * deny the packet
 141
 142 Currently I have opted for a second one, since it allows for
 143 a better defined behavior, and if you have to permit
 144 the traffic in both directions, why using stateful anyway ?
 145
 146 In order to be able to do the cleanup, we need to discriminate between
 147 the session types, with each session type having its own idle timeout.
 148 In order to do that, we keep three lists, defined in enum acl_timeout_e:
 149 ACL_TIMEOUT_UDP_IDLE, ACL_TIMEOUT_TCP_IDLE, ACL_TIMEOUT_TCP_TRANSIENT.
 150
 151 The first one is hopefully obvious - it is just all UDP connections.
 152 They have an idle timeout of 600 seconds.
 153
 154 The second and third is a bit more subtle. TCP is a complicated protocol,
 155 and we need to tread the fine line between doing too little and doing
 156 too much, and triggering the potential compatibility issues because of
 157 being a "middlebox".
 158
 159 I decided to split the TCP connections into two classes:
 160 established, and everything else. "Established", means we have seen
 161 the SYN and ACK from both sides (with PUSH obviously masked out).
 162 This is the "active" state of any TCP connection and we would like
 163 to ensure we do not screw it up. So, the connections in this state
 164 have the default idle timer of 24 hours.
 165
 166 All the rest of the connections have the idle timeout of 2 minutes,
 167 (inspired by an old value of MSL) and based on the observation
 168 that the states this class represent are usually very short lived.
 169
 170 Once we have these three baskets of connections, it is trivial to
 171 imagine a simple cleanup mechanism to deal with this: take a
 172 TCP transient connection that has been hanging around.
 173
 174 It is debatable whether we want to do discrimination between the
 175 different TCP transient connections. Assuming we do FIFO (and
 176 the lists allow us to do just that), it means a given connection
 177 on the head of the list has been hanging around for longest.
 178 Thus, if we are short on resources, we might just go ahead and
 179 reuse it within the datapath.
 180
 181 This is where we are slowly approaching the question
 182 "Why in the world have not you used timer wheel or such ?"
 183
 184 The answer is simple: within the above constraints, it does
 185 not buy me much.
 186
 187 Also, timer wheel creates a leaky abstraction with a difficult
 188 to manage corner case. Which corner case ?
 189
 190 We have a set of objects (sessions) with an event that may
 191 or may not happen (idle timeout timer firing), and a
 192 necessity to reset the idle timeout when there is
 193 activity on the session.
 194
 195 In the worst case, where we had a 10000 of one-packet
 196 UDP sessions just created 10 minutes ago, we would need
 197 to deal with a spike of 10000 expired timers.
 198
 199 Of course, if we have the active traffic on all
 200 of these 10000 connections, then we will not have
 201 to deal with that ? Right, but we will still have to deal
 202 with canceling and requeueing the timers.
 203
 204 In the best possible case, requeueing a timer is
 205 going to be something along the lines of a linked-list
 206 removal and reinsertion.
 207
 208 However, keep in mind we already need to classify the
 209 connections for reuse, so therefore we already have
 210 the linked lists!
 211
 212 And if we just check these linked lists periodically in
 213 a FIFO fashion, we can get away with a very simple per-packet operation:
 214 writing back the timestamp of "now" into the connection structure.
 215
 216 Then rather than requeueing the list on a per-packet or per-frame
 217 basis, we can defer this action until the time this session
 218 appears on the head of the FIFO list, and the cleaning
 219 routine makes the decision about whether to discard
 220 the session (because the interval since last activity is bigger
 221 than the idle timeout), or to requeue the session back to
 222 the end of the list (because the last activity was less
 223 than idle timeout ago).
 224
 225 So, rather than using the timers, we can simply reuse our classification
 226 FIFOs, with the following heuristic: do not look at the session that was
 227 enqueued at time X until X+session_timeout. If we enqueue the sessions
 228 in the order of their initial activity, then we can simply use enqueue
 229 timestamp of the head session as a decision criterion for when we need
 230 to get back at looking at it for the timeout purposes.
 231
 232 Since the number of FIFOs is small, we get a slightly worse check
 233 performance than with timers, but still O(1).
 234
 235 We seemingly do quite a few "useless" operations of requeueing the items
 236 back to the tail of the list - but, these are the operations we do not
 237 have to do in the active data path, so overall it is a win.
 238
 239 (Diversion: I believe this problem is congruent to poll vs. epoll or
 240 events vs. threads, some reading on this subject:
 241 http://web.archive.org/web/20120225022154/http://sheddingbikes.com/posts/1280829388.html)
 242
 243 We can also can run a TCP-like scheme for adaptively changing
 244 the wait period in the routine that deals with the connection timeouts:
 245 we can attempt to check the connections a couple of times per second
 246 (same as we would advance the timer wheel), and then if we have requeued
 247 close to a max-per-quantum number of connections, we can half the waiting
 248 interval, and if we did not requeue any, we can slowly increment the waiting
 249 interval - which at a steady state should stabilize similar to what the TCP rate
 250 does.
 251
 252 reflexive ACLs: multi-thread
 253 =============================
 254
 255 The single-threaded implementation in 1704 used a separate "cleaner" process
 256 to deal with the timing out of the connections.
 257 It is all good and great when you know that there is only a single core
 258 to run everything on, but the existence of the lists proves to be
 259 a massive difficulty when it comes to operating from multiple threads.
 260
 261 Initial study shows that with a few assumptions (e.g. that the cleaner running in main thread
 262 and the worker have a demarcation point in time where either one or the other one touches
 263 the session in the list) it might be possible to make it work, but the resulting
 264 trickiness of doing it neatly with all the corner cases is quite large.
 265
 266 So, for the multi-threaded scenario, we need to move the connection
 267 aging back to the same CPU as its creation.
 268
 269 Luckily we can do this with the help of the interrupts.
 270
 271 So, the design is as follows: the aging thread (acl_fa_session_cleaner_process)
 272 periodically fires the interrupts to the workers interrupt nodes (acl_fa_worker_session_cleaner_process_node.index),
 273 using vlib_node_set_interrupt_pending(), and
 274 the interrupt node acl_fa_worker_conn_cleaner_process() calls acl_fa_check_idle_sessions()
 275 which does the actual job of advancing the lists. And within the actual datapath the only thing we will be
 276 doing is putting the items onto FIFO, and updating the last active time on the existing connection.
 277
 278 The one "delicate" part is that the worker for one leg of the connection might be different from
 279 the worker of another leg of the connection - but, even if the "owner" tries to free the connection,
 280 nothing terrible can happen - worst case the element of the pool (which is nominally free for a short period)
 281 will get the timestamp updated - same thing about the TCP flags seen.
 282
 283 A slightly trickier issue arises when the packet initially seen by one worker (thus owned by that worker),
 284 and the return packet processed by another worker, and as a result changes the
 285 the class of the connection (e.g. becomes TCP_ESTABLISHED from TCP_TRANSIENT or vice versa).
 286 If the class changes from one with the shorter idle time to the one with the longer idle time,
 287 then unless we are in the starvation mode where the transient connections are recycled,
 288 we can simply do nothing and let the normal requeue mechanism kick in. If the class changes from the longer idle
 289 timer to the shorter idle timer, then we risk keeping the connection around for longer than needed, which
 290 will affect the resource usage.
 291
 292 One solution to that is to have NxN ring buffers (where N is the number of workers), such that the non-owner
 293 can signal to the owner the connection# that needs to be requeued out of order.
 294
 295 A simpler solution though, is to ensure that each FIFO's period is equal to that of a shortest timer.
 296 This way the resource starvation problem is taken care of, at an expense of some additional work.
 297
 298 This all looks sufficiently nice and simple until a skeleton falls out of the closet:
 299 sometimes we want to clean the connections en masse before they expire.
 300
 301 There few potential scenarios:
 302 1) removal of an ACL from the interface
 303 2) removal of an interface
 304 3) manual action of an operator (in the future).
 305
 306 In order to tackle this, we need to modify the logic which decides whether to requeue the
 307 connection on the end of the list, or to delete it due to idle timeout:
 308
 309 We define a point in time, and have each worker thread fast-forward through its FIFO,
 310 in the process looking for sessions that satisfy the criteria, and either keeping them or requeueing them.
 311
 312 To keep the ease of appearance to the outside world, we still process this as an event
 313 within the connection cleaner thread, but this event handler does as follows:
 314 1) it creates the bitmap of the sw_if_index values requested to be cleared
 315 2) for each worker, it waits to ensure there is no cleanup operation in progress (and if there is one,
 316 it waits), and then makes a copy of the bitmap, sets the per-worker flag of a cleanup operation, and sends an interrupt.
 317 3) wait until all cleanup operations have completed.
 318
 319 Within the worker interrupt node, we check if the "cleanup in progress" is set,
 320 and if it is, we check the "fast forward time" value. If unset, we initialize it to value now, and compare the
 321 requested bitmap of sw_if_index values (pending_clear_sw_if_index_bitmap) with the bitmap of sw_if_index that this worker deals with.
 322
 323 (we set the bit in the bitmap every time we enqueue the packet onto a FIFO - serviced_sw_if_index_bitmap in acl_fa_conn_list_add_session).
 324
 325 If the result of this AND operation is zero - then we can clear the flag of cleanup in progress and return.
 326 Else we kick off the quantum of cleanup, and make sure we get another interrupt ASAP if that cleanup operation returns non-zero,
 327 meaning there is more work to do.
 328 When that operation returns zero, everything has been processed, we can clear the "cleanup-in-progress" flag, and
 329 zeroize the bitmap of sw_if_index-es requested to be cleaned.
 330
 331 The interrupt node signals its wish to receive an interrupt ASAP by setting interrupt_is_needed
 332 flag within the per-worker structure. The main thread, while waiting for the
 333 cleanup operation to complete, checks if there is a request for interrupt,
 334 and if there is - it sends one.
 335
 336 This approach gives us a way to mass-clean the connections which is reusing the code of the regular idle
 337 connection cleanup.
 338
 339 One potential inefficiency is the bitmap values set by the session insertion
 340 in the data path - there is nothing to clear them.
 341
 342 So, if one rearranges the interface placement with the workers, then the cleanups will cause some unnecessary work.
 343 For now, we consider it an acceptable limitation. It can be resolved by having another per-worker bitmap, which, when set,
 344 would trigger the cleanup of the bits in the serviced_sw_if_index_bitmap).
 345
 346 === the end ===
 347