2 VNET (VPP Network Stack)
3 ========================
5 The files associated with the VPP network stack layer are located in the
6 *./src/vnet* folder. The Network Stack Layer is basically an
7 instantiation of the code in the other layers. This layer has a vnet
8 library that provides vectorized layer-2 and 3 networking graph nodes, a
9 packet generator, and a packet tracer.
11 In terms of building a packet processing application, vnet provides a
12 platform-independent subgraph to which one connects a couple of
15 Typical RX connections include "ethernet-input" \[full software
16 classification, feeds ipv4-input, ipv6-input, arp-input etc.\] and
17 "ipv4-input-no-checksum" \[if hardware can classify, perform ipv4 header
20 Effective graph dispatch function coding
21 ----------------------------------------
23 Over the 15 years, multiple coding styles have emerged: a
24 single/dual/quad loop coding model (with variations) and a
25 fully-pipelined coding model.
30 The single/dual/quad loop model variations conveniently solve problems
31 where the number of items to process is not known in advance: typical
32 hardware RX-ring processing. This coding style is also very effective
33 when a given node will not need to cover a complex set of dependent
36 Here is an quad/single loop which can leverage up-to-avx512 SIMD vector
37 units to convert buffer indices to buffer pointers:
41 simulated_ethernet_interface_tx (vlib_main_t * vm,
43 node, vlib_frame_t * frame)
45 u32 n_left_from, *from;
48 u32 thread_index = vm->thread_index;
49 vnet_main_t *vnm = vnet_get_main ();
50 vnet_interface_main_t *im = &vnm->interface_main;
51 vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
52 u16 nexts[VLIB_FRAME_SIZE], *next;
54 n_left_from = frame->n_vectors;
55 from = vlib_frame_vector_args (frame);
58 * Convert up to VLIB_FRAME_SIZE indices in "from" to
59 * buffer pointers in bufs[]
61 vlib_get_buffers (vm, from, bufs, n_left_from);
66 * While we have at least 4 vector elements (pkts) to process..
68 while (n_left_from >= 4)
70 /* Prefetch next quad-loop iteration. */
71 if (PREDICT_TRUE (n_left_from >= 8))
73 vlib_prefetch_buffer_header (b[4], STORE);
74 vlib_prefetch_buffer_header (b[5], STORE);
75 vlib_prefetch_buffer_header (b[6], STORE);
76 vlib_prefetch_buffer_header (b[7], STORE);
80 * $$$ Process 4x packets right here...
81 * set next[0..3] to send the packets where they need to go
84 do_something_to (b[0]);
85 do_something_to (b[1]);
86 do_something_to (b[2]);
87 do_something_to (b[3]);
89 /* Process the next 0..4 packets */
95 * Clean up 0...3 remaining packets at the end of the incoming frame
97 while (n_left_from > 0)
100 * $$$ Process one packet right here...
101 * set next[0..3] to send the packets where they need to go
103 do_something_to (b[0]);
105 /* Process the next packet */
112 * Send the packets along their respective next-node graph arcs
113 * Considerable locality of reference is expected, most if not all
114 * packets in the inbound vector will traverse the same next-node
117 vlib_buffer_enqueue_to_next (vm, node, from, nexts, frame->n_vectors);
119 return frame->n_vectors;
123 Given a packet processing task to implement, it pays to scout around
124 looking for similar tasks, and think about using the same coding
125 pattern. It is not uncommon to recode a given graph node dispatch function
126 several times during performance optimization.
128 Creating Packets from Scratch
129 -----------------------------
131 At times, it's necessary to create packets from scratch and send
132 them. Tasks like sending keepalives or actively opening connections
133 come to mind. Its not difficult, but accurate buffer metadata setup is
136 ### Allocating Buffers
138 Use vlib_buffer_alloc, which allocates a set of buffer indices. For
139 low-performance applications, it's OK to allocate one buffer at a
140 time. Note that vlib_buffer_alloc(...) does NOT initialize buffer
143 In high-performance cases, allocate a vector of buffer indices,
144 and hand them out from the end of the vector; decrement _vec_len(..)
145 as buffer indices are allocated. See tcp_alloc_tx_buffers(...) and
146 tcp_get_free_buffer_index(...) for an example.
148 ### Buffer Initialization Example
150 The following example shows the **main points**, but is not to be
151 blindly cut-'n-pasted.
158 vlib_buffer_free_list_t *fl;
160 /* Allocate a buffer */
161 if (vlib_buffer_alloc (vm, &bi0, 1) != 1)
164 b0 = vlib_get_buffer (vm, bi0);
166 /* Initialize the buffer */
167 fl = vlib_buffer_get_free_list (vm, VLIB_BUFFER_DEFAULT_FREE_LIST_INDEX);
168 vlib_buffer_init_for_free_list (b0, fl);
169 VLIB_BUFFER_TRACE_TRAJECTORY_INIT (b0);
171 /* At this point b0->current_data = 0, b0->current_length = 0 */
174 * Copy data into the buffer. This example ASSUMES that data will fit
175 * in a single buffer, and is e.g. an ip4 packet.
177 if (have_packet_rewrite)
179 clib_memcpy (b0->data, data, vec_len (data));
180 b0->current_length = vec_len (data);
184 /* OR, build a udp-ip packet (for example) */
185 ip = vlib_buffer_get_current (b0);
186 udp = (udp_header_t *) (ip + 1);
187 data_dst = (u8 *) (udp + 1);
189 ip->ip_version_and_header_length = 0x45;
191 ip->protocol = IP_PROTOCOL_UDP;
192 ip->length = clib_host_to_net_u16 (sizeof (*ip) + sizeof (*udp) +
194 ip->src_address.as_u32 = src_address->as_u32;
195 ip->dst_address.as_u32 = dst_address->as_u32;
196 udp->src_port = clib_host_to_net_u16 (src_port);
197 udp->dst_port = clib_host_to_net_u16 (dst_port);
198 udp->length = clib_host_to_net_u16 (vec_len (udp_data));
199 clib_memcpy (data_dst, udp_data, vec_len(udp_data));
201 if (compute_udp_checksum)
203 /* RFC 7011 section 10.3.2. */
204 udp->checksum = ip4_tcp_udp_compute_checksum (vm, b0, ip);
205 if (udp->checksum == 0)
206 udp->checksum = 0xffff;
208 b0->current_length = vec_len (sizeof (*ip) + sizeof (*udp) +
212 b0->flags |= (VLIB_BUFFER_TOTAL_LENGTH_VALID;
214 /* sw_if_index 0 is the "local" interface, which always exists */
215 vnet_buffer (b0)->sw_if_index[VLIB_RX] = 0;
217 /* Use the default FIB index for tx lookup. Set non-zero to use another fib */
218 vnet_buffer (b0)->sw_if_index[VLIB_TX] = 0;
222 If your use-case calls for large packet transmission, use
223 vlib_buffer_chain_append_data_with_alloc(...) to create the requisite
226 ### Enqueueing packets for lookup and transmission
228 The simplest way to send a set of packets is to use
229 vlib_get_frame_to_node(...) to allocate fresh frame(s) to
230 ip4_lookup_node or ip6_lookup_node, add the constructed buffer
231 indices, and dispatch the frame using vlib_put_frame_to_node(...).
235 f = vlib_get_frame_to_node (vm, ip4_lookup_node.index);
236 f->n_vectors = vec_len(buffer_indices_to_send);
237 to_next = vlib_frame_vector_args (f);
239 for (i = 0; i < vec_len (buffer_indices_to_send); i++)
240 to_next[i] = buffer_indices_to_send[i];
242 vlib_put_frame_to_node (vm, ip4_lookup_node_index, f);
245 It is inefficient to allocate and schedule single packet frames.
246 That's typical in case you need to send one packet per second, but
247 should **not** occur in a for-loop!
252 Vlib includes a frame element \[packet\] trace facility, with a simple
253 debug CLI interface. The cli is straightforward: "trace add
254 input-node-name count" to start capturing packet traces.
256 To trace 100 packets on a typical x86\_64 system running the dpdk
257 plugin: "trace add dpdk-input 100". When using the packet generator:
258 "trace add pg-input 100"
260 To display the packet trace: "show trace"
262 Each graph node has the opportunity to capture its own trace data. It is
263 almost always a good idea to do so. The trace capture APIs are simple.
265 The packet capture APIs snapshoot binary data, to minimize processing at
266 capture time. Each participating graph node initialization provides a
267 vppinfra format-style user function to pretty-print data when required
268 by the VLIB "show trace" command.
270 Set the VLIB node registration ".format\_trace" member to the name of
271 the per-graph node format function.
273 Here's a simple example:
276 u8 * my_node_format_trace (u8 * s, va_list * args)
278 vlib_main_t * vm = va_arg (*args, vlib_main_t *);
279 vlib_node_t * node = va_arg (*args, vlib_node_t *);
280 my_node_trace_t * t = va_arg (*args, my_trace_t *);
282 s = format (s, "My trace data was: %d", t-><whatever>);
288 The trace framework hands the per-node format function the data it
289 captured as the packet whizzed by. The format function pretty-prints the
292 Graph Dispatcher Pcap Tracing
293 -----------------------------
295 The vpp graph dispatcher knows how to capture vectors of packets in pcap
296 format as they're dispatched. The pcap captures are as follows:
299 VPP graph dispatch trace record description:
302 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
303 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
304 | Major Version | Minor Version | NStrings | ProtoHint |
305 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
306 | Buffer index (big endian) |
307 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
308 + VPP graph node name ... ... | NULL octet |
309 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
310 | Buffer Metadata ... ... | NULL octet |
311 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
312 | Buffer Opaque ... ... | NULL octet |
313 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
314 | Buffer Opaque 2 ... ... | NULL octet |
315 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
316 | VPP ASCII packet trace (if NStrings > 4) | NULL octet |
317 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
318 | Packet data (up to 16K) |
319 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
322 Graph dispatch records comprise a version stamp, an indication of how
323 many NULL-terminated strings will follow the record header and preceed
324 packet data, and a protocol hint.
326 The buffer index is an opaque 32-bit cookie which allows consumers of
327 these data to easily filter/track single packets as they traverse the
330 Multiple records per packet are normal, and to be expected. Packets
331 will appear multipe times as they traverse the vpp forwarding
332 graph. In this way, vpp graph dispatch traces are significantly
333 different from regular network packet captures from an end-station.
334 This property complicates stateful packet analysis.
336 Restricting stateful analysis to records from a single vpp graph node
337 such as "ethernet-input" seems likely to improve the situation.
339 As of this writing: major version = 1, minor version = 0. Nstrings
340 SHOULD be 4 or 5. Consumers SHOULD be wary values less than 4 or
341 greater than 5. They MAY attempt to display the claimed number of
342 strings, or they MAY treat the condition as an error.
344 Here is the current set of protocol hints:
349 VLIB_NODE_PROTO_HINT_NONE = 0,
350 VLIB_NODE_PROTO_HINT_ETHERNET,
351 VLIB_NODE_PROTO_HINT_IP4,
352 VLIB_NODE_PROTO_HINT_IP6,
353 VLIB_NODE_PROTO_HINT_TCP,
354 VLIB_NODE_PROTO_HINT_UDP,
355 VLIB_NODE_N_PROTO_HINTS,
356 } vlib_node_proto_hint_t;
359 Example: VLIB_NODE_PROTO_HINT_IP6 means that the first octet of packet
360 data SHOULD be 0x60, and should begin an ipv6 packet header.
362 Downstream consumers of these data SHOULD pay attention to the
363 protocol hint. They MUST tolerate inaccurate hints, which MAY occur
366 ### Dispatch Pcap Trace Debug CLI
368 To start a dispatch trace capture of up to 10,000 trace records:
371 pcap dispatch trace on max 10000 file dispatch.pcap
374 To start a dispatch trace which will also include standard vpp packet
375 tracing for packets which originate in dpdk-input:
378 pcap dispatch trace on max 10000 file dispatch.pcap buffer-trace dpdk-input 1000
380 To save the pcap trace, e.g. in /tmp/dispatch.pcap:
383 pcap dispatch trace off
386 ### Wireshark dissection of dispatch pcap traces
388 It almost goes without saying that we built a companion wireshark
389 dissector to display these traces. As of this writing, we're in the
390 process of trying to upstream the wireshark dissector.
392 Until we manage to upstream the wireshark dissector, please see the
393 "How to build a vpp dispatch trace aware Wireshark" page for build
394 info, and/or take a look at .../extras/wireshark.
396 Here is a sample packet dissection, with some fields omitted for
397 clarity. The point is that the wireshark dissector accurately
398 displays **all** of the vpp buffer metadata, and the name of the graph
402 Frame 1: 2216 bytes on wire (17728 bits), 2216 bytes captured (17728 bits)
403 Encapsulation type: USER 13 (58)
404 [Protocols in frame: vpp:vpp-metadata:vpp-opaque:vpp-opaque2:eth:ethertype:ip:tcp:data]
406 BufferIndex: 0x00036663
407 NodeName: ethernet-input
410 Metadata: current_data: 0, current_length: 102
411 Metadata: current_config_index: 0, flow_id: 0, next_buffer: 0
412 Metadata: error: 0, n_add_refs: 0, buffer_pool_index: 0
413 Metadata: trace_index: 0, recycle_count: 0, len_not_first_buf: 0
414 Metadata: free_list_index: 0
417 Opaque: raw: 00000007 ffffffff 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
418 Opaque: sw_if_index[VLIB_RX]: 7, sw_if_index[VLIB_TX]: -1
419 Opaque: L2 offset 0, L3 offset 0, L4 offset 0, feature arc index 0
420 Opaque: ip.adj_index[VLIB_RX]: 0, ip.adj_index[VLIB_TX]: 0
421 Opaque: ip.flow_hash: 0x0, ip.save_protocol: 0x0, ip.fib_index: 0
422 Opaque: ip.save_rewrite_length: 0, ip.rpf_id: 0
423 Opaque: ip.icmp.type: 0 ip.icmp.code: 0, ip.icmp.data: 0x0
424 Opaque: ip.reass.next_index: 0, ip.reass.estimated_mtu: 0
425 Opaque: ip.reass.fragment_first: 0 ip.reass.fragment_last: 0
426 Opaque: ip.reass.range_first: 0 ip.reass.range_last: 0
427 Opaque: ip.reass.next_range_bi: 0x0, ip.reass.ip6_frag_hdr_offset: 0
428 Opaque: mpls.ttl: 0, mpls.exp: 0, mpls.first: 0, mpls.save_rewrite_length: 0, mpls.bier.n_bytes: 0
429 Opaque: l2.feature_bitmap: 00000000, l2.bd_index: 0, l2.l2_len: 0, l2.shg: 0, l2.l2fib_sn: 0, l2.bd_age: 0
430 Opaque: l2.feature_bitmap_input: none configured, L2.feature_bitmap_output: none configured
431 Opaque: l2t.next_index: 0, l2t.session_index: 0
432 Opaque: l2_classify.table_index: 0, l2_classify.opaque_index: 0, l2_classify.hash: 0x0
433 Opaque: policer.index: 0
434 Opaque: ipsec.flags: 0x0, ipsec.sad_index: 0
436 Opaque: map_t.v6.saddr: 0x0, map_t.v6.daddr: 0x0, map_t.v6.frag_offset: 0, map_t.v6.l4_offset: 0
437 Opaque: map_t.v6.l4_protocol: 0, map_t.checksum_offset: 0, map_t.mtu: 0
438 Opaque: ip_frag.mtu: 0, ip_frag.next_index: 0, ip_frag.flags: 0x0
439 Opaque: cop.current_config_index: 0
440 Opaque: lisp.overlay_afi: 0
441 Opaque: tcp.connection_index: 0, tcp.seq_number: 0, tcp.seq_end: 0, tcp.ack_number: 0, tcp.hdr_offset: 0, tcp.data_offset: 0
442 Opaque: tcp.data_len: 0, tcp.flags: 0x0
443 Opaque: sctp.connection_index: 0, sctp.sid: 0, sctp.ssn: 0, sctp.tsn: 0, sctp.hdr_offset: 0
444 Opaque: sctp.data_offset: 0, sctp.data_len: 0, sctp.subconn_idx: 0, sctp.flags: 0x0
445 Opaque: snat.flags: 0x0
448 Opaque2: raw: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
449 Opaque2: qos.bits: 0, qos.source: 0
450 Opaque2: loop_counter: 0
451 Opaque2: gbp.flags: 0, gbp.src_epg: 0
452 Opaque2: pg_replay_timestamp: 0
454 Ethernet II, Src: 06:d6:01:41:3b:92 (06:d6:01:41:3b:92), Dst: IntelCor_3d:f6 Transmission Control Protocol, Src Port: 22432, Dst Port: 54084, Seq: 1, Ack: 1, Len: 36
456 Destination Port: 54084
457 TCP payload (36 bytes)
460 0000 cf aa 8b f5 53 14 d4 c7 29 75 3e 56 63 93 9d 11 ....S...)u>Vc...
461 0010 e5 f2 92 27 86 56 4c 21 ce c5 23 46 d7 eb ec 0d ...'.VL!..#F....
462 0020 a8 98 36 5a ..6Z
463 Data: cfaa8bf55314d4c729753e5663939d11e5f2922786564c21…
467 It's a matter of a couple of mouse-clicks in Wireshark to filter the
468 trace to a specific buffer index. With that specific kind of filtration,
469 one can watch a packet walk through the forwarding graph; noting any/all
470 metadata changes, header checksum changes, and so forth.
472 This should be of significant value when developing new vpp graph
473 nodes. If new code mispositions b->current_data, it will be completely
474 obvious from looking at the dispatch trace in wireshark.