doc/guides/prog_guide/writing_efficient_code.rst

   1 ..  BSD LICENSE
   2     Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
   3     All rights reserved.
   4
   5     Redistribution and use in source and binary forms, with or without
   6     modification, are permitted provided that the following conditions
   7     are met:
   8
   9     * Redistributions of source code must retain the above copyright
  10     notice, this list of conditions and the following disclaimer.
  11     * Redistributions in binary form must reproduce the above copyright
  12     notice, this list of conditions and the following disclaimer in
  13     the documentation and/or other materials provided with the
  14     distribution.
  15     * Neither the name of Intel Corporation nor the names of its
  16     contributors may be used to endorse or promote products derived
  17     from this software without specific prior written permission.
  18
  19     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  20     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  21     LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  22     A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  23     OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  24     SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  25     LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  26     DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  27     THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28     (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  29     OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30
  31 Writing Efficient Code
  32 ======================
  33
  34 This chapter provides some tips for developing efficient code using the DPDK.
  35 For additional and more general information,
  36 please refer to the *Intel® 64 and IA-32 Architectures Optimization Reference Manual*
  37 which is a valuable reference to writing efficient code.
  38
  39 Memory
  40 ------
  41
  42 This section describes some key memory considerations when developing applications in the DPDK environment.
  43
  44 Memory Copy: Do not Use libc in the Data Plane
  45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  46
  47 Many libc functions are available in the DPDK, via the Linux* application environment.
  48 This can ease the porting of applications and the development of the configuration plane.
  49 However, many of these functions are not designed for performance.
  50 Functions such as memcpy() or strcpy() should not be used in the data plane.
  51 To copy small structures, the preference is for a simpler technique that can be optimized by the compiler.
  52 Refer to the *VTune™ Performance Analyzer Essentials* publication from Intel Press for recommendations.
  53
  54 For specific functions that are called often,
  55 it is also a good idea to provide a self-made optimized function, which should be declared as static inline.
  56
  57 The DPDK API provides an optimized rte_memcpy() function.
  58
  59 Memory Allocation
  60 ~~~~~~~~~~~~~~~~~
  61
  62 Other functions of libc, such as malloc(), provide a flexible way to allocate and free memory.
  63 In some cases, using dynamic allocation is necessary,
  64 but it is really not advised to use malloc-like functions in the data plane because
  65 managing a fragmented heap can be costly and the allocator may not be optimized for parallel allocation.
  66
  67 If you really need dynamic allocation in the data plane, it is better to use a memory pool of fixed-size objects.
  68 This API is provided by librte_mempool.
  69 This data structure provides several services that increase performance, such as memory alignment of objects,
  70 lockless access to objects, NUMA awareness, bulk get/put and per-lcore cache.
  71 The rte_malloc () function uses a similar concept to mempools.
  72
  73 Concurrent Access to the Same Memory Area
  74 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  75
  76 Read-Write (RW) access operations by several lcores to the same memory area can generate a lot of data cache misses,
  77 which are very costly.
  78 It is often possible to use per-lcore variables, for example, in the case of statistics.
  79 There are at least two solutions for this:
  80
  81 *   Use RTE_PER_LCORE variables. Note that in this case, data on lcore X is not available to lcore Y.
  82
  83 *   Use a table of structures (one per lcore). In this case, each structure must be cache-aligned.
  84
  85 Read-mostly variables can be shared among lcores without performance losses if there are no RW variables in the same cache line.
  86
  87 NUMA
  88 ~~~~
  89
  90 On a NUMA system, it is preferable to access local memory since remote memory access is slower.
  91 In the DPDK, the memzone, ring, rte_malloc and mempool APIs provide a way to create a pool on a specific socket.
  92
  93 Sometimes, it can be a good idea to duplicate data to optimize speed.
  94 For read-mostly variables that are often accessed,
  95 it should not be a problem to keep them in one socket only, since data will be present in cache.
  96
  97 Distribution Across Memory Channels
  98 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  99
 100 Modern memory controllers have several memory channels that can load or store data in parallel.
 101 Depending on the memory controller and its configuration,
 102 the number of channels and the way the memory is distributed across the channels varies.
 103 Each channel has a bandwidth limit,
 104 meaning that if all memory access operations are done on the first channel only, there is a potential bottleneck.
 105
 106 By default, the  :ref:`Mempool Library <Mempool_Library>` spreads the addresses of objects among memory channels.
 107
 108 Communication Between lcores
 109 ----------------------------
 110
 111 To provide a message-based communication between lcores,
 112 it is advised to use the DPDK ring API, which provides a lockless ring implementation.
 113
 114 The ring supports bulk and burst access,
 115 meaning that it is possible to read several elements from the ring with only one costly atomic operation
 116 (see :doc:`ring_lib`).
 117 Performance is greatly improved when using bulk access operations.
 118
 119 The code algorithm that dequeues messages may be something similar to the following:
 120
 121 .. code-block:: c
 122
 123     #define MAX_BULK 32
 124
 125     while (1) {
 126         /* Process as many elements as can be dequeued. */
 127         count = rte_ring_dequeue_burst(ring, obj_table, MAX_BULK);
 128         if (unlikely(count == 0))
 129             continue;
 130
 131         my_process_bulk(obj_table, count);
 132    }
 133
 134 PMD Driver
 135 ----------
 136
 137 The DPDK Poll Mode Driver (PMD) is also able to work in bulk/burst mode,
 138 allowing the factorization of some code for each call in the send or receive function.
 139
 140 Avoid partial writes.
 141 When PCI devices write to system memory through DMA,
 142 it costs less if the write operation is on a full cache line as opposed to part of it.
 143 In the PMD code, actions have been taken to avoid partial writes as much as possible.
 144
 145 Lower Packet Latency
 146 ~~~~~~~~~~~~~~~~~~~~
 147
 148 Traditionally, there is a trade-off between throughput and latency.
 149 An application can be tuned to achieve a high throughput,
 150 but the end-to-end latency of an average packet will typically increase as a result.
 151 Similarly, the application can be tuned to have, on average,
 152 a low end-to-end latency, at the cost of lower throughput.
 153
 154 In order to achieve higher throughput,
 155 the DPDK attempts to aggregate the cost of processing each packet individually by processing packets in bursts.
 156
 157 Using the testpmd application as an example,
 158 the burst size can be set on the command line to a value of 16 (also the default value).
 159 This allows the application to request 16 packets at a time from the PMD.
 160 The testpmd application then immediately attempts to transmit all the packets that were received,
 161 in this case, all 16 packets.
 162
 163 The packets are not transmitted until the tail pointer is updated on the corresponding TX queue of the network port.
 164 This behavior is desirable when tuning for high throughput because
 165 the cost of tail pointer updates to both the RX and TX queues can be spread across 16 packets,
 166 effectively hiding the relatively slow MMIO cost of writing to the PCIe* device.
 167 However, this is not very desirable when tuning for low latency because
 168 the first packet that was received must also wait for another 15 packets to be received.
 169 It cannot be transmitted until the other 15 packets have also been processed because
 170 the NIC will not know to transmit the packets until the TX tail pointer has been updated,
 171 which is not done until all 16 packets have been processed for transmission.
 172
 173 To consistently achieve low latency, even under heavy system load,
 174 the application developer should avoid processing packets in bunches.
 175 The testpmd application can be configured from the command line to use a burst value of 1.
 176 This will allow a single packet to be processed at a time, providing lower latency,
 177 but with the added cost of lower throughput.
 178
 179 Locks and Atomic Operations
 180 ---------------------------
 181
 182 Atomic operations imply a lock prefix before the instruction,
 183 causing the processor's LOCK# signal to be asserted during execution of the following instruction.
 184 This has a big impact on performance in a multicore environment.
 185
 186 Performance can be improved by avoiding lock mechanisms in the data plane.
 187 It can often be replaced by other solutions like per-lcore variables.
 188 Also, some locking techniques are more efficient than others.
 189 For instance, the Read-Copy-Update (RCU) algorithm can frequently replace simple rwlocks.
 190
 191 Coding Considerations
 192 ---------------------
 193
 194 Inline Functions
 195 ~~~~~~~~~~~~~~~~
 196
 197 Small functions can be declared as static inline in the header file.
 198 This avoids the cost of a call instruction (and the associated context saving).
 199 However, this technique is not always efficient; it depends on many factors including the compiler.
 200
 201 Branch Prediction
 202 ~~~~~~~~~~~~~~~~~
 203
 204 The Intel® C/C++ Compiler (icc)/gcc built-in helper functions likely() and unlikely()
 205 allow the developer to indicate if a code branch is likely to be taken or not.
 206 For instance:
 207
 208 .. code-block:: c
 209
 210     if (likely(x > 1))
 211         do_stuff();
 212
 213 Setting the Target CPU Type
 214 ---------------------------
 215
 216 The DPDK supports CPU microarchitecture-specific optimizations by means of CONFIG_RTE_MACHINE option
 217 in the DPDK configuration file.
 218 The degree of optimization depends on the compiler's ability to optimize for a specific microarchitecture,
 219 therefore it is preferable to use the latest compiler versions whenever possible.
 220
 221 If the compiler version does not support the specific feature set (for example, the Intel® AVX instruction set),
 222 the build process gracefully degrades to whatever latest feature set is supported by the compiler.
 223
 224 Since the build and runtime targets may not be the same,
 225 the resulting binary also contains a platform check that runs before the
 226 main() function and checks if the current machine is suitable for running the binary.
 227
 228 Along with compiler optimizations,
 229 a set of preprocessor defines are automatically added to the build process (regardless of the compiler version).
 230 These defines correspond to the instruction sets that the target CPU should be able to support.
 231 For example, a binary compiled for any SSE4.2-capable processor will have RTE_MACHINE_CPUFLAG_SSE4_2 defined,
 232 thus enabling compile-time code path selection for different platforms.