doc/guides/nics/mlx5.rst

   1 ..  SPDX-License-Identifier: BSD-3-Clause
   2     Copyright 2015 6WIND S.A.
   3     Copyright 2015 Mellanox Technologies, Ltd
   4
   5 MLX5 poll mode driver
   6 =====================
   7
   8 The MLX5 poll mode driver library (**librte_pmd_mlx5**) provides support
   9 for **Mellanox ConnectX-4**, **Mellanox ConnectX-4 Lx** , **Mellanox
  10 ConnectX-5** and **Mellanox Bluefield** families of 10/25/40/50/100 Gb/s
  11 adapters as well as their virtual functions (VF) in SR-IOV context.
  12
  13 Information and documentation about these adapters can be found on the
  14 `Mellanox website <http://www.mellanox.com>`__. Help is also provided by the
  15 `Mellanox community <http://community.mellanox.com/welcome>`__.
  16
  17 There is also a `section dedicated to this poll mode driver
  18 <http://www.mellanox.com/page/products_dyn?product_family=209&mtag=pmd_for_dpdk>`__.
  19
  20 .. note::
  21
  22    Due to external dependencies, this driver is disabled by default. It must
  23    be enabled manually by setting ``CONFIG_RTE_LIBRTE_MLX5_PMD=y`` and
  24    recompiling DPDK.
  25
  26 Implementation details
  27 ----------------------
  28
  29 Besides its dependency on libibverbs (that implies libmlx5 and associated
  30 kernel support), librte_pmd_mlx5 relies heavily on system calls for control
  31 operations such as querying/updating the MTU and flow control parameters.
  32
  33 For security reasons and robustness, this driver only deals with virtual
  34 memory addresses. The way resources allocations are handled by the kernel
  35 combined with hardware specifications that allow it to handle virtual memory
  36 addresses directly ensure that DPDK applications cannot access random
  37 physical memory (or memory that does not belong to the current process).
  38
  39 This capability allows the PMD to coexist with kernel network interfaces
  40 which remain functional, although they stop receiving unicast packets as
  41 long as they share the same MAC address.
  42 This means legacy linux control tools (for example: ethtool, ifconfig and
  43 more) can operate on the same network interfaces that owned by the DPDK
  44 application.
  45
  46 Enabling librte_pmd_mlx5 causes DPDK applications to be linked against
  47 libibverbs.
  48
  49 Features
  50 --------
  51
  52 - Multi arch support: x86_64, POWER8, ARMv8, i686.
  53 - Multiple TX and RX queues.
  54 - Support for scattered TX and RX frames.
  55 - IPv4, IPv6, TCPv4, TCPv6, UDPv4 and UDPv6 RSS on any number of queues.
  56 - Several RSS hash keys, one for each flow type.
  57 - Default RSS operation with no hash key specification.
  58 - Configurable RETA table.
  59 - Support for multiple MAC addresses.
  60 - VLAN filtering.
  61 - RX VLAN stripping.
  62 - TX VLAN insertion.
  63 - RX CRC stripping configuration.
  64 - Promiscuous mode.
  65 - Multicast promiscuous mode.
  66 - Hardware checksum offloads.
  67 - Flow director (RTE_FDIR_MODE_PERFECT, RTE_FDIR_MODE_PERFECT_MAC_VLAN and
  68   RTE_ETH_FDIR_REJECT).
  69 - Flow API.
  70 - Multiple process.
  71 - KVM and VMware ESX SR-IOV modes are supported.
  72 - RSS hash result is supported.
  73 - Hardware TSO for generic IP or UDP tunnel, including VXLAN and GRE.
  74 - Hardware checksum Tx offload for generic IP or UDP tunnel, including VXLAN and GRE.
  75 - RX interrupts.
  76 - Statistics query including Basic, Extended and per queue.
  77 - Rx HW timestamp.
  78 - Tunnel types: VXLAN, L3 VXLAN, VXLAN-GPE, GRE, MPLSoGRE, MPLSoUDP.
  79 - Tunnel HW offloads: packet type, inner/outer RSS, IP and UDP checksum verification.
  80
  81 Limitations
  82 -----------
  83
  84 - For secondary process:
  85
  86   - Forked secondary process not supported.
  87   - All mempools must be initialized before rte_eth_dev_start().
  88   - External memory unregistered in EAL memseg list cannot be used for DMA
  89     unless such memory has been registered by ``mlx5_mr_update_ext_mp()`` in
  90     primary process and remapped to the same virtual address in secondary
  91     process. If the external memory is registered by primary process but has
  92     different virtual address in secondary process, unexpected error may happen.
  93
  94 - Flow pattern without any specific vlan will match for vlan packets as well:
  95
  96   When VLAN spec is not specified in the pattern, the matching rule will be created with VLAN as a wild card.
  97   Meaning, the flow rule::
  98
  99         flow create 0 ingress pattern eth / vlan vid is 3 / ipv4 / end ...
 100
 101   Will only match vlan packets with vid=3. and the flow rules::
 102
 103         flow create 0 ingress pattern eth / ipv4 / end ...
 104
 105   Or::
 106
 107         flow create 0 ingress pattern eth / vlan / ipv4 / end ...
 108
 109   Will match any ipv4 packet (VLAN included).
 110
 111 - A multi segment packet must have less than 6 segments in case the Tx burst function
 112   is set to multi-packet send or Enhanced multi-packet send. Otherwise it must have
 113   less than 50 segments.
 114
 115 - Count action for RTE flow is **only supported in Mellanox OFED**.
 116
 117 - Flows with a VXLAN Network Identifier equal (or ends to be equal)
 118   to 0 are not supported.
 119
 120 - VXLAN TSO and checksum offloads are not supported on VM.
 121
 122 - L3 VXLAN and VXLAN-GPE tunnels cannot be supported together with MPLSoGRE and MPLSoUDP.
 123
 124 - VF: flow rules created on VF devices can only match traffic targeted at the
 125   configured MAC addresses (see ``rte_eth_dev_mac_addr_add()``).
 126
 127 .. note::
 128
 129    MAC addresses not already present in the bridge table of the associated
 130    kernel network device will be added and cleaned up by the PMD when closing
 131    the device. In case of ungraceful program termination, some entries may
 132    remain present and should be removed manually by other means.
 133
 134 - When Multi-Packet Rx queue is configured (``mprq_en``), a Rx packet can be
 135   externally attached to a user-provided mbuf with having EXT_ATTACHED_MBUF in
 136   ol_flags. As the mempool for the external buffer is managed by PMD, all the
 137   Rx mbufs must be freed before the device is closed. Otherwise, the mempool of
 138   the external buffers will be freed by PMD and the application which still
 139   holds the external buffers may be corrupted.
 140
 141 - If Multi-Packet Rx queue is configured (``mprq_en``) and Rx CQE compression is
 142   enabled (``rxq_cqe_comp_en``) at the same time, RSS hash result is not fully
 143   supported. Some Rx packets may not have PKT_RX_RSS_HASH.
 144
 145 - IPv6 Multicast messages are not supported on VM, while promiscuous mode
 146   and allmulticast mode are both set to off.
 147   To receive IPv6 Multicast messages on VM, explicitly set the relevant
 148   MAC address using rte_eth_dev_mac_addr_add() API.
 149
 150 - E-Switch VXLAN tunnel is not supported together with outer VLAN.
 151
 152 - E-Switch Flows with VNI pattern must include the VXLAN decapsulation action.
 153
 154 - E-Switch VXLAN decapsulation Flow:
 155
 156   - can be applied to PF port only.
 157   - must specify VF port action (packet redirection from PF to VF).
 158   - must specify tunnel outer UDP local (destination) port, wildcards not allowed.
 159   - must specify tunnel outer VNI, wildcards not allowed.
 160   - must specify tunnel outer local (destination)  IPv4 or IPv6 address, wildcards not allowed.
 161   - optionally may specify tunnel outer remote (source) IPv4 or IPv6, wildcards or group IPs allowed.
 162   - optionally may specify tunnel inner source and destination MAC addresses.
 163
 164 - E-Switch VXLAN encapsulation Flow:
 165
 166   - can be applied to VF ports only.
 167   - must specify PF port action (packet redirection from VF to PF).
 168   - must specify the VXLAN item with tunnel outer parameters.
 169   - must specify the tunnel outer VNI in the VXLAN item.
 170   - must specify the tunnel outer remote (destination) UDP port in the VXLAN item.
 171   - must specify the tunnel outer local (source) IPv4 or IPv6 in the , this address will locally (with scope link) assigned to the outer network interface, wildcards not allowed.
 172   - must specify the tunnel outer remote (destination) IPv4 or IPv6 in the VXLAN item, group IPs allowed.
 173   - must specify the tunnel outer destination MAC address in the VXLAN item, this address will be used to create neigh rule.
 174
 175 Statistics
 176 ----------
 177
 178 MLX5 supports various of methods to report statistics:
 179
 180 Port statistics can be queried using ``rte_eth_stats_get()``. The port statistics are through SW only and counts the number of packets received or sent successfully by the PMD.
 181
 182 Extended statistics can be queried using ``rte_eth_xstats_get()``. The extended statistics expose a wider set of counters counted by the device. The extended port statistics counts the number of packets received or sent successfully by the port. As Mellanox NICs are using the :ref:`Bifurcated Linux Driver <linux_gsg_linux_drivers>` those counters counts also packet received or sent by the Linux kernel. The counters with ``_phy`` suffix counts the total events on the physical port, therefore not valid for VF.
 183
 184 Finally per-flow statistics can by queried using ``rte_flow_query`` when attaching a count action for specific flow. The flow counter counts the number of packets received successfully by the port and match the specific flow.
 185
 186 Configuration
 187 -------------
 188
 189 Compilation options
 190 ~~~~~~~~~~~~~~~~~~~
 191
 192 These options can be modified in the ``.config`` file.
 193
 194 - ``CONFIG_RTE_LIBRTE_MLX5_PMD`` (default **n**)
 195
 196   Toggle compilation of librte_pmd_mlx5 itself.
 197
 198 - ``CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS`` (default **n**)
 199
 200   Build PMD with additional code to make it loadable without hard
 201   dependencies on **libibverbs** nor **libmlx5**, which may not be installed
 202   on the target system.
 203
 204   In this mode, their presence is still required for it to run properly,
 205   however their absence won't prevent a DPDK application from starting (with
 206   ``CONFIG_RTE_BUILD_SHARED_LIB`` disabled) and they won't show up as
 207   missing with ``ldd(1)``.
 208
 209   It works by moving these dependencies to a purpose-built rdma-core "glue"
 210   plug-in which must either be installed in a directory whose name is based
 211   on ``CONFIG_RTE_EAL_PMD_PATH`` suffixed with ``-glue`` if set, or in a
 212   standard location for the dynamic linker (e.g. ``/lib``) if left to the
 213   default empty string (``""``).
 214
 215   This option has no performance impact.
 216
 217 - ``CONFIG_RTE_LIBRTE_MLX5_DEBUG`` (default **n**)
 218
 219   Toggle debugging code and stricter compilation flags. Enabling this option
 220   adds additional run-time checks and debugging messages at the cost of
 221   lower performance.
 222
 223 Environment variables
 224 ~~~~~~~~~~~~~~~~~~~~~
 225
 226 - ``MLX5_GLUE_PATH``
 227
 228   A list of directories in which to search for the rdma-core "glue" plug-in,
 229   separated by colons or semi-colons.
 230
 231   Only matters when compiled with ``CONFIG_RTE_LIBRTE_MLX5_DLOPEN_DEPS``
 232   enabled and most useful when ``CONFIG_RTE_EAL_PMD_PATH`` is also set,
 233   since ``LD_LIBRARY_PATH`` has no effect in this case.
 234
 235 - ``MLX5_SHUT_UP_BF``
 236
 237   Configures HW Tx doorbell register as IO-mapped.
 238
 239   By default, the HW Tx doorbell is configured as a write-combining register.
 240   The register would be flushed to HW usually when the write-combining buffer
 241   becomes full, but it depends on CPU design.
 242
 243   Except for vectorized Tx burst routines, a write memory barrier is enforced
 244   after updating the register so that the update can be immediately visible to
 245   HW.
 246
 247   When vectorized Tx burst is called, the barrier is set only if the burst size
 248   is not aligned to MLX5_VPMD_TX_MAX_BURST. However, setting this environmental
 249   variable will bring better latency even though the maximum throughput can
 250   slightly decline.
 251
 252 Run-time configuration
 253 ~~~~~~~~~~~~~~~~~~~~~~
 254
 255 - librte_pmd_mlx5 brings kernel network interfaces up during initialization
 256   because it is affected by their state. Forcing them down prevents packets
 257   reception.
 258
 259 - **ethtool** operations on related kernel interfaces also affect the PMD.
 260
 261 - ``rxq_cqe_comp_en`` parameter [int]
 262
 263   A nonzero value enables the compression of CQE on RX side. This feature
 264   allows to save PCI bandwidth and improve performance. Enabled by default.
 265
 266   Supported on:
 267
 268   - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5 and Bluefield.
 269   - POWER8 and ARMv8 with ConnectX-4 LX, ConnectX-5 and Bluefield.
 270
 271 - ``rxq_cqe_pad_en`` parameter [int]
 272
 273   A nonzero value enables 128B padding of CQE on RX side. The size of CQE
 274   is aligned with the size of a cacheline of the core. If cacheline size is
 275   128B, the CQE size is configured to be 128B even though the device writes
 276   only 64B data on the cacheline. This is to avoid unnecessary cache
 277   invalidation by device's two consecutive writes on to one cacheline.
 278   However in some architecture, it is more beneficial to update entire
 279   cacheline with padding the rest 64B rather than striding because
 280   read-modify-write could drop performance a lot. On the other hand,
 281   writing extra data will consume more PCIe bandwidth and could also drop
 282   the maximum throughput. It is recommended to empirically set this
 283   parameter. Disabled by default.
 284
 285   Supported on:
 286
 287   - CPU having 128B cacheline with ConnectX-5 and Bluefield.
 288
 289 - ``rxq_pkt_pad_en`` parameter [int]
 290
 291   A nonzero value enables padding Rx packet to the size of cacheline on PCI
 292   transaction. This feature would waste PCI bandwidth but could improve
 293   performance by avoiding partial cacheline write which may cause costly
 294   read-modify-copy in memory transaction on some architectures. Disabled by
 295   default.
 296
 297   Supported on:
 298
 299   - x86_64 with ConnectX-4, ConnectX-4 LX, ConnectX-5, ConnectX-6 and Bluefield.
 300   - POWER8 and ARMv8 with ConnectX-4 LX, ConnectX-5, ConnectX-6 and Bluefield.
 301
 302 - ``mprq_en`` parameter [int]
 303
 304   A nonzero value enables configuring Multi-Packet Rx queues. Rx queue is
 305   configured as Multi-Packet RQ if the total number of Rx queues is
 306   ``rxqs_min_mprq`` or more and Rx scatter isn't configured. Disabled by
 307   default.
 308
 309   Multi-Packet Rx Queue (MPRQ a.k.a Striding RQ) can further save PCIe bandwidth
 310   by posting a single large buffer for multiple packets. Instead of posting a
 311   buffers per a packet, one large buffer is posted in order to receive multiple
 312   packets on the buffer. A MPRQ buffer consists of multiple fixed-size strides
 313   and each stride receives one packet. MPRQ can improve throughput for
 314   small-packet traffic.
 315
 316   When MPRQ is enabled, max_rx_pkt_len can be larger than the size of
 317   user-provided mbuf even if DEV_RX_OFFLOAD_SCATTER isn't enabled. PMD will
 318   configure large stride size enough to accommodate max_rx_pkt_len as long as
 319   device allows. Note that this can waste system memory compared to enabling Rx
 320   scatter and multi-segment packet.
 321
 322 - ``mprq_log_stride_num`` parameter [int]
 323
 324   Log 2 of the number of strides for Multi-Packet Rx queue. Configuring more
 325   strides can reduce PCIe traffic further. If configured value is not in the
 326   range of device capability, the default value will be set with a warning
 327   message. The default value is 4 which is 16 strides per a buffer, valid only
 328   if ``mprq_en`` is set.
 329
 330   The size of Rx queue should be bigger than the number of strides.
 331
 332 - ``mprq_max_memcpy_len`` parameter [int]
 333
 334   The maximum length of packet to memcpy in case of Multi-Packet Rx queue. Rx
 335   packet is mem-copied to a user-provided mbuf if the size of Rx packet is less
 336   than or equal to this parameter. Otherwise, PMD will attach the Rx packet to
 337   the mbuf by external buffer attachment - ``rte_pktmbuf_attach_extbuf()``.
 338   A mempool for external buffers will be allocated and managed by PMD. If Rx
 339   packet is externally attached, ol_flags field of the mbuf will have
 340   EXT_ATTACHED_MBUF and this flag must be preserved. ``RTE_MBUF_HAS_EXTBUF()``
 341   checks the flag. The default value is 128, valid only if ``mprq_en`` is set.
 342
 343 - ``rxqs_min_mprq`` parameter [int]
 344
 345   Configure Rx queues as Multi-Packet RQ if the total number of Rx queues is
 346   greater or equal to this value. The default value is 12, valid only if
 347   ``mprq_en`` is set.
 348
 349 - ``txq_inline`` parameter [int]
 350
 351   Amount of data to be inlined during TX operations. Improves latency.
 352   Can improve PPS performance when PCI back pressure is detected and may be
 353   useful for scenarios involving heavy traffic on many queues.
 354
 355   Because additional software logic is necessary to handle this mode, this
 356   option should be used with care, as it can lower performance when back
 357   pressure is not expected.
 358
 359 - ``txqs_min_inline`` parameter [int]
 360
 361   Enable inline send only when the number of TX queues is greater or equal
 362   to this value.
 363
 364   This option should be used in combination with ``txq_inline`` above.
 365
 366   On ConnectX-4, ConnectX-4 LX, ConnectX-5 and Bluefield without
 367   Enhanced MPW:
 368
 369         - Disabled by default.
 370         - In case ``txq_inline`` is set recommendation is 4.
 371
 372   On ConnectX-5 and Bluefield with Enhanced MPW:
 373
 374         - Set to 8 by default.
 375
 376 - ``txqs_max_vec`` parameter [int]
 377
 378   Enable vectorized Tx only when the number of TX queues is less than or
 379   equal to this value. Effective only when ``tx_vec_en`` is enabled.
 380
 381   On ConnectX-5:
 382
 383         - Set to 8 by default on ARMv8.
 384         - Set to 4 by default otherwise.
 385
 386   On Bluefield
 387
 388         - Set to 16 by default.
 389
 390 - ``txq_mpw_en`` parameter [int]
 391
 392   A nonzero value enables multi-packet send (MPS) for ConnectX-4 Lx and
 393   enhanced multi-packet send (Enhanced MPS) for ConnectX-5 and Bluefield.
 394   MPS allows the TX burst function to pack up multiple packets in a
 395   single descriptor session in order to save PCI bandwidth and improve
 396   performance at the cost of a slightly higher CPU usage. When
 397   ``txq_inline`` is set along with ``txq_mpw_en``, TX burst function tries
 398   to copy entire packet data on to TX descriptor instead of including
 399   pointer of packet only if there is enough room remained in the
 400   descriptor. ``txq_inline`` sets per-descriptor space for either pointers
 401   or inlined packets. In addition, Enhanced MPS supports hybrid mode -
 402   mixing inlined packets and pointers in the same descriptor.
 403
 404   This option cannot be used with certain offloads such as ``DEV_TX_OFFLOAD_TCP_TSO,
 405   DEV_TX_OFFLOAD_VXLAN_TNL_TSO, DEV_TX_OFFLOAD_GRE_TNL_TSO, DEV_TX_OFFLOAD_VLAN_INSERT``.
 406   When those offloads are requested the MPS send function will not be used.
 407
 408   It is currently only supported on the ConnectX-4 Lx, ConnectX-5 and Bluefield
 409   families of adapters.
 410   On ConnectX-4 Lx the MPW is considered un-secure hence disabled by default.
 411   Users which enable the MPW should be aware that application which provides incorrect
 412   mbuf descriptors in the Tx burst can lead to serious errors in the host including, on some cases,
 413   NIC to get stuck.
 414   On ConnectX-5 and Bluefield the MPW is secure and enabled by default.
 415
 416 - ``txq_mpw_hdr_dseg_en`` parameter [int]
 417
 418   A nonzero value enables including two pointers in the first block of TX
 419   descriptor. This can be used to lessen CPU load for memory copy.
 420
 421   Effective only when Enhanced MPS is supported. Disabled by default.
 422
 423 - ``txq_max_inline_len`` parameter [int]
 424
 425   Maximum size of packet to be inlined. This limits the size of packet to
 426   be inlined. If the size of a packet is larger than configured value, the
 427   packet isn't inlined even though there's enough space remained in the
 428   descriptor. Instead, the packet is included with pointer.
 429
 430   Effective only when Enhanced MPS is supported. The default value is 256.
 431
 432 - ``tx_vec_en`` parameter [int]
 433
 434   A nonzero value enables Tx vector on ConnectX-5 and Bluefield NICs if the number of
 435   global Tx queues on the port is less than ``txqs_max_vec``.
 436
 437   This option cannot be used with certain offloads such as ``DEV_TX_OFFLOAD_TCP_TSO,
 438   DEV_TX_OFFLOAD_VXLAN_TNL_TSO, DEV_TX_OFFLOAD_GRE_TNL_TSO, DEV_TX_OFFLOAD_VLAN_INSERT``.
 439   When those offloads are requested the MPS send function will not be used.
 440
 441   Enabled by default on ConnectX-5 and Bluefield.
 442
 443 - ``rx_vec_en`` parameter [int]
 444
 445   A nonzero value enables Rx vector if the port is not configured in
 446   multi-segment otherwise this parameter is ignored.
 447
 448   Enabled by default.
 449
 450 - ``vf_nl_en`` parameter [int]
 451
 452   A nonzero value enables Netlink requests from the VF to add/remove MAC
 453   addresses or/and enable/disable promiscuous/all multicast on the Netdevice.
 454   Otherwise the relevant configuration must be run with Linux iproute2 tools.
 455   This is a prerequisite to receive this kind of traffic.
 456
 457   Enabled by default, valid only on VF devices ignored otherwise.
 458
 459 - ``l3_vxlan_en`` parameter [int]
 460
 461   A nonzero value allows L3 VXLAN and VXLAN-GPE flow creation. To enable
 462   L3 VXLAN or VXLAN-GPE, users has to configure firmware and enable this
 463   parameter. This is a prerequisite to receive this kind of traffic.
 464
 465   Disabled by default.
 466
 467 - ``dv_flow_en`` parameter [int]
 468
 469   A nonzero value enables the DV flow steering assuming it is supported
 470   by the driver.
 471   The DV flow steering is not supported on switchdev mode.
 472
 473   Disabled by default.
 474
 475 - ``representor`` parameter [list]
 476
 477   This parameter can be used to instantiate DPDK Ethernet devices from
 478   existing port (or VF) representors configured on the device.
 479
 480   It is a standard parameter whose format is described in
 481   :ref:`ethernet_device_standard_device_arguments`.
 482
 483   For instance, to probe port representors 0 through 2::
 484
 485     representor=[0-2]
 486
 487 Firmware configuration
 488 ~~~~~~~~~~~~~~~~~~~~~~
 489
 490 - L3 VXLAN and VXLAN-GPE destination UDP port
 491
 492    .. code-block:: console
 493
 494      mlxconfig -d <mst device> set IP_OVER_VXLAN_EN=1
 495      mlxconfig -d <mst device> set IP_OVER_VXLAN_PORT=<udp dport>
 496
 497   Verify configurations are set:
 498
 499    .. code-block:: console
 500
 501      mlxconfig -d <mst device> query | grep IP_OVER_VXLAN
 502      IP_OVER_VXLAN_EN                    True(1)
 503      IP_OVER_VXLAN_PORT                  <udp dport>
 504
 505 Prerequisites
 506 -------------
 507
 508 This driver relies on external libraries and kernel drivers for resources
 509 allocations and initialization. The following dependencies are not part of
 510 DPDK and must be installed separately:
 511
 512 - **libibverbs**
 513
 514   User space Verbs framework used by librte_pmd_mlx5. This library provides
 515   a generic interface between the kernel and low-level user space drivers
 516   such as libmlx5.
 517
 518   It allows slow and privileged operations (context initialization, hardware
 519   resources allocations) to be managed by the kernel and fast operations to
 520   never leave user space.
 521
 522 - **libmlx5**
 523
 524   Low-level user space driver library for Mellanox
 525   ConnectX-4/ConnectX-5/Bluefield devices, it is automatically loaded
 526   by libibverbs.
 527
 528   This library basically implements send/receive calls to the hardware
 529   queues.
 530
 531 - **libmnl**
 532
 533   Minimalistic Netlink library mainly relied on to manage E-Switch flow
 534   rules (i.e. those with the "transfer" attribute and typically involving
 535   port representors).
 536
 537 - **Kernel modules**
 538
 539   They provide the kernel-side Verbs API and low level device drivers that
 540   manage actual hardware initialization and resources sharing with user
 541   space processes.
 542
 543   Unlike most other PMDs, these modules must remain loaded and bound to
 544   their devices:
 545
 546   - mlx5_core: hardware driver managing Mellanox
 547     ConnectX-4/ConnectX-5/Bluefield devices and related Ethernet kernel
 548     network devices.
 549   - mlx5_ib: InifiniBand device driver.
 550   - ib_uverbs: user space driver for Verbs (entry point for libibverbs).
 551
 552 - **Firmware update**
 553
 554   Mellanox OFED releases include firmware updates for
 555   ConnectX-4/ConnectX-5/Bluefield adapters.
 556
 557   Because each release provides new features, these updates must be applied to
 558   match the kernel modules and libraries they come with.
 559
 560 .. note::
 561
 562    Both libraries are BSD and GPL licensed. Linux kernel modules are GPL
 563    licensed.
 564
 565 Installation
 566 ~~~~~~~~~~~~
 567
 568 Either RDMA Core library with a recent enough Linux kernel release
 569 (recommended) or Mellanox OFED, which provides compatibility with older
 570 releases.
 571
 572 RDMA Core with Linux Kernel
 573 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 574
 575 - Minimal kernel version : v4.14 or the most recent 4.14-rc (see `Linux installation documentation`_)
 576 - Minimal rdma-core version: v15+ commit 0c5f5765213a ("Merge pull request #227 from yishaih/tm")
 577   (see `RDMA Core installation documentation`_)
 578 - When building for i686 use:
 579
 580   - rdma-core version 18.0 or above built with 32bit support.
 581   - Kernel version 4.14.41 or above.
 582
 583 .. _`Linux installation documentation`: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/plain/Documentation/admin-guide/README.rst
 584 .. _`RDMA Core installation documentation`: https://raw.githubusercontent.com/linux-rdma/rdma-core/master/README.md
 585
 586 Mellanox OFED
 587 ^^^^^^^^^^^^^
 588
 589 - Mellanox OFED version: **4.4, 4.5**.
 590 - firmware version:
 591
 592   - ConnectX-4: **12.21.1000** and above.
 593   - ConnectX-4 Lx: **14.21.1000** and above.
 594   - ConnectX-5: **16.21.1000** and above.
 595   - ConnectX-5 Ex: **16.21.1000** and above.
 596   - Bluefield: **18.99.3950** and above.
 597
 598 While these libraries and kernel modules are available on OpenFabrics
 599 Alliance's `website <https://www.openfabrics.org/>`__ and provided by package
 600 managers on most distributions, this PMD requires Ethernet extensions that
 601 may not be supported at the moment (this is a work in progress).
 602
 603 `Mellanox OFED
 604 <http://www.mellanox.com/page/products_dyn?product_family=26&mtag=linux>`__
 605 includes the necessary support and should be used in the meantime. For DPDK,
 606 only libibverbs, libmlx5, mlnx-ofed-kernel packages and firmware updates are
 607 required from that distribution.
 608
 609 .. note::
 610
 611    Several versions of Mellanox OFED are available. Installing the version
 612    this DPDK release was developed and tested against is strongly
 613    recommended. Please check the `prerequisites`_.
 614
 615 Libmnl
 616 ^^^^^^
 617
 618 Minimal version for libmnl is **1.0.3**.
 619
 620 As a dependency of the **iproute2** suite, this library is often installed
 621 by default. It is otherwise readily available through standard system
 622 packages.
 623
 624 Its development headers must be installed in order to compile this PMD.
 625 These packages are usually named **libmnl-dev** or **libmnl-devel**
 626 depending on the Linux distribution.
 627
 628 Supported NICs
 629 --------------
 630
 631 * Mellanox(R) ConnectX(R)-4 10G MCX4111A-XCAT (1x10G)
 632 * Mellanox(R) ConnectX(R)-4 10G MCX4121A-XCAT (2x10G)
 633 * Mellanox(R) ConnectX(R)-4 25G MCX4111A-ACAT (1x25G)
 634 * Mellanox(R) ConnectX(R)-4 25G MCX4121A-ACAT (2x25G)
 635 * Mellanox(R) ConnectX(R)-4 40G MCX4131A-BCAT (1x40G)
 636 * Mellanox(R) ConnectX(R)-4 40G MCX413A-BCAT (1x40G)
 637 * Mellanox(R) ConnectX(R)-4 40G MCX415A-BCAT (1x40G)
 638 * Mellanox(R) ConnectX(R)-4 50G MCX4131A-GCAT (1x50G)
 639 * Mellanox(R) ConnectX(R)-4 50G MCX413A-GCAT (1x50G)
 640 * Mellanox(R) ConnectX(R)-4 50G MCX414A-BCAT (2x50G)
 641 * Mellanox(R) ConnectX(R)-4 50G MCX415A-GCAT (2x50G)
 642 * Mellanox(R) ConnectX(R)-4 50G MCX416A-BCAT (2x50G)
 643 * Mellanox(R) ConnectX(R)-4 50G MCX416A-GCAT (2x50G)
 644 * Mellanox(R) ConnectX(R)-4 50G MCX415A-CCAT (1x100G)
 645 * Mellanox(R) ConnectX(R)-4 100G MCX416A-CCAT (2x100G)
 646 * Mellanox(R) ConnectX(R)-4 Lx 10G MCX4121A-XCAT (2x10G)
 647 * Mellanox(R) ConnectX(R)-4 Lx 25G MCX4121A-ACAT (2x25G)
 648 * Mellanox(R) ConnectX(R)-5 100G MCX556A-ECAT (2x100G)
 649 * Mellanox(R) ConnectX(R)-5 Ex EN 100G MCX516A-CDAT (2x100G)
 650
 651 Quick Start Guide on OFED
 652 -------------------------
 653
 654 1. Download latest Mellanox OFED. For more info check the  `prerequisites`_.
 655
 656
 657 2. Install the required libraries and kernel modules either by installing
 658    only the required set, or by installing the entire Mellanox OFED:
 659
 660    .. code-block:: console
 661
 662         ./mlnxofedinstall --upstream-libs --dpdk
 663
 664 3. Verify the firmware is the correct one:
 665
 666    .. code-block:: console
 667
 668         ibv_devinfo
 669
 670 4. Verify all ports links are set to Ethernet:
 671
 672    .. code-block:: console
 673
 674         mlxconfig -d <mst device> query | grep LINK_TYPE
 675         LINK_TYPE_P1                        ETH(2)
 676         LINK_TYPE_P2                        ETH(2)
 677
 678    Link types may have to be configured to Ethernet:
 679
 680    .. code-block:: console
 681
 682         mlxconfig -d <mst device> set LINK_TYPE_P1/2=1/2/3
 683
 684         * LINK_TYPE_P1=<1|2|3> , 1=Infiniband 2=Ethernet 3=VPI(auto-sense)
 685
 686    For hypervisors verify SR-IOV is enabled on the NIC:
 687
 688    .. code-block:: console
 689
 690         mlxconfig -d <mst device> query | grep SRIOV_EN
 691         SRIOV_EN                            True(1)
 692
 693    If needed, set enable the set the relevant fields:
 694
 695    .. code-block:: console
 696
 697         mlxconfig -d <mst device> set SRIOV_EN=1 NUM_OF_VFS=16
 698         mlxfwreset -d <mst device> reset
 699
 700 5. Restart the driver:
 701
 702    .. code-block:: console
 703
 704         /etc/init.d/openibd restart
 705
 706    or:
 707
 708    .. code-block:: console
 709
 710         service openibd restart
 711
 712    If link type was changed, firmware must be reset as well:
 713
 714    .. code-block:: console
 715
 716         mlxfwreset -d <mst device> reset
 717
 718    For hypervisors, after reset write the sysfs number of virtual functions
 719    needed for the PF.
 720
 721    To dynamically instantiate a given number of virtual functions (VFs):
 722
 723    .. code-block:: console
 724
 725         echo [num_vfs] > /sys/class/infiniband/mlx5_0/device/sriov_numvfs
 726
 727 6. Compile DPDK and you are ready to go. See instructions on
 728    :ref:`Development Kit Build System <Development_Kit_Build_System>`
 729
 730 Performance tuning
 731 ------------------
 732
 733 1. Configure aggressive CQE Zipping for maximum performance:
 734
 735   .. code-block:: console
 736
 737         mlxconfig -d <mst device> s CQE_COMPRESSION=1
 738
 739   To set it back to the default CQE Zipping mode use:
 740
 741   .. code-block:: console
 742
 743         mlxconfig -d <mst device> s CQE_COMPRESSION=0
 744
 745 2. In case of virtualization:
 746
 747    - Make sure that hypervisor kernel is 3.16 or newer.
 748    - Configure boot with ``iommu=pt``.
 749    - Use 1G huge pages.
 750    - Make sure to allocate a VM on huge pages.
 751    - Make sure to set CPU pinning.
 752
 753 3. Use the CPU near local NUMA node to which the PCIe adapter is connected,
 754    for better performance. For VMs, verify that the right CPU
 755    and NUMA node are pinned according to the above. Run:
 756
 757    .. code-block:: console
 758
 759         lstopo-no-graphics
 760
 761    to identify the NUMA node to which the PCIe adapter is connected.
 762
 763 4. If more than one adapter is used, and root complex capabilities allow
 764    to put both adapters on the same NUMA node without PCI bandwidth degradation,
 765    it is recommended to locate both adapters on the same NUMA node.
 766    This in order to forward packets from one to the other without
 767    NUMA performance penalty.
 768
 769 5. Disable pause frames:
 770
 771    .. code-block:: console
 772
 773         ethtool -A <netdev> rx off tx off
 774
 775 6. Verify IO non-posted prefetch is disabled by default. This can be checked
 776    via the BIOS configuration. Please contact you server provider for more
 777    information about the settings.
 778
 779 .. note::
 780
 781         On some machines, depends on the machine integrator, it is beneficial
 782         to set the PCI max read request parameter to 1K. This can be
 783         done in the following way:
 784
 785         To query the read request size use:
 786
 787         .. code-block:: console
 788
 789                 setpci -s <NIC PCI address> 68.w
 790
 791         If the output is different than 3XXX, set it by:
 792
 793         .. code-block:: console
 794
 795                 setpci -s <NIC PCI address> 68.w=3XXX
 796
 797         The XXX can be different on different systems. Make sure to configure
 798         according to the setpci output.
 799
 800 7. To minimize overhead of searching Memory Regions:
 801
 802    - '--socket-mem' is recommended to pin memory by predictable amount.
 803    - Configure per-lcore cache when creating Mempools for packet buffer.
 804    - Refrain from dynamically allocating/freeing memory in run-time.
 805
 806 Notes for testpmd
 807 -----------------
 808
 809 Compared to librte_pmd_mlx4 that implements a single RSS configuration per
 810 port, librte_pmd_mlx5 supports per-protocol RSS configuration.
 811
 812 Since ``testpmd`` defaults to IP RSS mode and there is currently no
 813 command-line parameter to enable additional protocols (UDP and TCP as well
 814 as IP), the following commands must be entered from its CLI to get the same
 815 behavior as librte_pmd_mlx4:
 816
 817 .. code-block:: console
 818
 819    > port stop all
 820    > port config all rss all
 821    > port start all
 822
 823 Usage example
 824 -------------
 825
 826 This section demonstrates how to launch **testpmd** with Mellanox
 827 ConnectX-4/ConnectX-5/Bluefield devices managed by librte_pmd_mlx5.
 828
 829 #. Load the kernel modules:
 830
 831    .. code-block:: console
 832
 833       modprobe -a ib_uverbs mlx5_core mlx5_ib
 834
 835    Alternatively if MLNX_OFED is fully installed, the following script can
 836    be run:
 837
 838    .. code-block:: console
 839
 840       /etc/init.d/openibd restart
 841
 842    .. note::
 843
 844       User space I/O kernel modules (uio and igb_uio) are not used and do
 845       not have to be loaded.
 846
 847 #. Make sure Ethernet interfaces are in working order and linked to kernel
 848    verbs. Related sysfs entries should be present:
 849
 850    .. code-block:: console
 851
 852       ls -d /sys/class/net/*/device/infiniband_verbs/uverbs* | cut -d / -f 5
 853
 854    Example output:
 855
 856    .. code-block:: console
 857
 858       eth30
 859       eth31
 860       eth32
 861       eth33
 862
 863 #. Optionally, retrieve their PCI bus addresses for whitelisting:
 864
 865    .. code-block:: console
 866
 867       {
 868           for intf in eth2 eth3 eth4 eth5;
 869           do
 870               (cd "/sys/class/net/${intf}/device/" && pwd -P);
 871           done;
 872       } |
 873       sed -n 's,.*/\(.*\),-w \1,p'
 874
 875    Example output:
 876
 877    .. code-block:: console
 878
 879       -w 0000:05:00.1
 880       -w 0000:06:00.0
 881       -w 0000:06:00.1
 882       -w 0000:05:00.0
 883
 884 #. Request huge pages:
 885
 886    .. code-block:: console
 887
 888       echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages/nr_hugepages
 889
 890 #. Start testpmd with basic parameters:
 891
 892    .. code-block:: console
 893
 894       testpmd -l 8-15 -n 4 -w 05:00.0 -w 05:00.1 -w 06:00.0 -w 06:00.1 -- --rxq=2 --txq=2 -i
 895
 896    Example output:
 897
 898    .. code-block:: console
 899
 900       [...]
 901       EAL: PCI device 0000:05:00.0 on NUMA socket 0
 902       EAL:   probe driver: 15b3:1013 librte_pmd_mlx5
 903       PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_0" (VF: false)
 904       PMD: librte_pmd_mlx5: 1 port(s) detected
 905       PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fe
 906       EAL: PCI device 0000:05:00.1 on NUMA socket 0
 907       EAL:   probe driver: 15b3:1013 librte_pmd_mlx5
 908       PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_1" (VF: false)
 909       PMD: librte_pmd_mlx5: 1 port(s) detected
 910       PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:ff
 911       EAL: PCI device 0000:06:00.0 on NUMA socket 0
 912       EAL:   probe driver: 15b3:1013 librte_pmd_mlx5
 913       PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_2" (VF: false)
 914       PMD: librte_pmd_mlx5: 1 port(s) detected
 915       PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fa
 916       EAL: PCI device 0000:06:00.1 on NUMA socket 0
 917       EAL:   probe driver: 15b3:1013 librte_pmd_mlx5
 918       PMD: librte_pmd_mlx5: PCI information matches, using device "mlx5_3" (VF: false)
 919       PMD: librte_pmd_mlx5: 1 port(s) detected
 920       PMD: librte_pmd_mlx5: port 1 MAC address is e4:1d:2d:e7:0c:fb
 921       Interactive-mode selected
 922       Configuring Port 0 (socket 0)
 923       PMD: librte_pmd_mlx5: 0x8cba80: TX queues number update: 0 -> 2
 924       PMD: librte_pmd_mlx5: 0x8cba80: RX queues number update: 0 -> 2
 925       Port 0: E4:1D:2D:E7:0C:FE
 926       Configuring Port 1 (socket 0)
 927       PMD: librte_pmd_mlx5: 0x8ccac8: TX queues number update: 0 -> 2
 928       PMD: librte_pmd_mlx5: 0x8ccac8: RX queues number update: 0 -> 2
 929       Port 1: E4:1D:2D:E7:0C:FF
 930       Configuring Port 2 (socket 0)
 931       PMD: librte_pmd_mlx5: 0x8cdb10: TX queues number update: 0 -> 2
 932       PMD: librte_pmd_mlx5: 0x8cdb10: RX queues number update: 0 -> 2
 933       Port 2: E4:1D:2D:E7:0C:FA
 934       Configuring Port 3 (socket 0)
 935       PMD: librte_pmd_mlx5: 0x8ceb58: TX queues number update: 0 -> 2
 936       PMD: librte_pmd_mlx5: 0x8ceb58: RX queues number update: 0 -> 2
 937       Port 3: E4:1D:2D:E7:0C:FB
 938       Checking link statuses...
 939       Port 0 Link Up - speed 40000 Mbps - full-duplex
 940       Port 1 Link Up - speed 40000 Mbps - full-duplex
 941       Port 2 Link Up - speed 10000 Mbps - full-duplex
 942       Port 3 Link Up - speed 10000 Mbps - full-duplex
 943       Done
 944       testpmd>