X-Git-Url: https://gerrit.fd.io/r/gitweb?p=csit.git;a=blobdiff_plain;f=docs%2Freport%2Fvpp_performance_tests%2Foverview.rst;h=d6c641ca1f0298af444c2655bb0b51e906923d68;hp=dbc1612f84a241e0f5b7630bf0e01d513513ee63;hb=516eb1c7840e037f53b02a8936c1e0fedeecd20a;hpb=02f2a2176ec92efdf63399fb7dba1eb586465f38 diff --git a/docs/report/vpp_performance_tests/overview.rst b/docs/report/vpp_performance_tests/overview.rst index dbc1612f84..d6c641ca1f 100644 --- a/docs/report/vpp_performance_tests/overview.rst +++ b/docs/report/vpp_performance_tests/overview.rst @@ -1,13 +1,14 @@ Overview ======== +.. _tested_physical_topologies: + Tested Physical Topologies -------------------------- CSIT VPP performance tests are executed on physical baremetal servers hosted by -LF FD.io project. Testbed physical topology is shown in the figure below. - -:: +:abbr:`LF (Linux Foundation)` FD.io project. Testbed physical topology is shown +in the figure below.:: +------------------------+ +------------------------+ | | | | @@ -52,20 +53,19 @@ Going forward CSIT project will be looking to add more hardware into FD.io performance labs to address larger scale multi-interface and multi-NIC performance testing scenarios. -For test cases that require DUT (VPP) to communicate with VM(s) over vhost-user -interfaces, N of VM instances are created on SUT1 and SUT2. For N=1 DUT (VPP) -forwards packets between vhostuser and physical interfaces. For N>1 DUT (VPP) a -logical service chain forwarding topology is created on DUT (VPP) by applying L2 -or IPv4/IPv6 configuration depending on the test suite. -DUT (VPP) test topology with N VM instances -is shown in the figure below including applicable packet flow thru the DUTs and -VMs (marked in the figure with ``***``). - -:: +For test cases that require DUT (VPP) to communicate with +VirtualMachines (VMs) / Linux or Docker Containers (Ctrs) over +vhost-user/memif interfaces, N of VM/Ctr instances are created on SUT1 +and SUT2. For N=1 DUT forwards packets between vhost/memif and physical +interfaces. For N>1 DUT a logical service chain forwarding topology is +created on DUT by applying L2 or IPv4/IPv6 configuration depending on +the test suite. DUT test topology with N VM/Ctr instances is shown in +the figure below including applicable packet flow thru the DUTs and +VMs/Ctrs (marked in the figure with ``***``).:: +-------------------------+ +-------------------------+ | +---------+ +---------+ | | +---------+ +---------+ | - | | VM[1] | | VM[N] | | | | VM[1] | | VM[N] | | + | |VM/Ctr[1]| |VM/Ctr[N]| | | |VM/Ctr[1]| |VM/Ctr[N]| | | | ***** | | ***** | | | | ***** | | ***** | | | +--^---^--+ +--^---^--+ | | +--^---^--+ +--^---^--+ | | *| |* *| |* | | *| |* *| |* | @@ -85,34 +85,41 @@ VMs (marked in the figure with ``***``). **********************| |********************** +-----------+ -For VM tests, packets are switched by DUT (VPP) multiple times: twice for a -single VM, three times for two VMs, N+1 times for N VMs. -Hence the external -throughput rates measured by TG and listed in this report must be multiplied -by (N+1) to represent the actual DUT aggregate packet forwarding rate. - -Note that reported VPP performance results are specific to the SUTs tested. -Current LF FD.io SUTs are based on Intel XEON E5-2699v3 2.3GHz CPUs. SUTs with -other CPUs are likely to yield different results. A good rule of thumb, that -can be applied to estimate VPP packet thoughput for Phy-to-Phy (NIC-to-NIC, -PCI-to-PCI) topology, is to expect the forwarding performance to be -proportional to CPU core frequency, assuming CPU is the only limiting factor -and all other SUT parameters equivalent to FD.io CSIT environment. The same rule -of thumb can be also applied for Phy-to-VM-to-Phy (NIC-to-VM-to-NIC) topology, -but due to much higher dependency on intensive memory operations and -sensitivity to Linux kernel scheduler settings and behaviour, this estimation -may not always yield good enough accuracy. - -For detailed LF FD.io test bed specification and physical topology please refer -to `LF FDio CSIT testbed wiki page `_. +For VM/Ctr tests, packets are switched by DUT multiple times: twice for +a single VM/Ctr, three times for two VMs/Ctrs, N+1 times for N VMs/Ctrs. +Hence the external throughput rates measured by TG and listed in this +report must be multiplied by (N+1) to represent the actual DUT aggregate +packet forwarding rate. + +Note that reported DUT (VPP) performance results are specific to the SUTs +tested. Current :abbr:`LF (Linux Foundation)` FD.io SUTs are based on Intel +XEON E5-2699v3 2.3GHz CPUs. SUTs with other CPUs are likely to yield different +results. A good rule of thumb, that can be applied to estimate VPP packet +thoughput for Phy-to-Phy (NIC-to-NIC, PCI-to-PCI) topology, is to expect +the forwarding performance to be proportional to CPU core frequency, +assuming CPU is the only limiting factor and all other SUT parameters +equivalent to FD.io CSIT environment. The same rule of thumb can be also +applied for Phy-to-VM/Ctr-to-Phy (NIC-to-VM/Ctr-to-NIC) topology, but due to +much higher dependency on intensive memory operations and sensitivity to Linux +kernel scheduler settings and behaviour, this estimation may not always yield +good enough accuracy. + +For detailed FD.io CSIT testbed specification and topology, as well as +configuration and setup of SUTs and DUTs testbeds please refer to +:ref:`test_environment`. + +Similar SUT compute node and DUT VPP settings can be arrived to in a +standalone VPP setup by using a `vpp-config configuration tool +`_ developed within the +VPP project using CSIT recommended settings and scripts. Performance Tests Coverage -------------------------- -Performance tests are split into the two main categories: +Performance tests are split into two main categories: - Throughput discovery - discovery of packet forwarding rate using binary search - in accordance to RFC2544. + in accordance to :rfc:`2544`. - NDR - discovery of Non Drop Rate packet throughput, at zero packet loss; followed by one-way packet latency measurements at 10%, 50% and 100% of @@ -133,6 +140,9 @@ CSIT |release| includes following performance test suites, listed per NIC type: VLAN tagged Ethernet frames. - **L2BD** - L2 Bridge-Domain switched-forwarding of untagged Ethernet frames with MAC learning; disabled MAC learning i.e. static MAC tests to be added. + - **L2BD Scale** - L2 Bridge-Domain switched-forwarding of untagged Ethernet + frames with MAC learning; disabled MAC learning i.e. static MAC tests to be + added with 20k, 200k and 2M FIB entries. - **IPv4** - IPv4 routed-forwarding. - **IPv6** - IPv6 routed-forwarding. - **IPv4 Scale** - IPv4 routed-forwarding with 20k, 200k and 2M FIB entries. @@ -141,12 +151,20 @@ CSIT |release| includes following performance test suites, listed per NIC type: of 2 VMs using vhost-user interfaces, with VPP forwarding modes incl. L2 Cross-Connect, L2 Bridge-Domain, VXLAN with L2BD, IPv4 routed-forwarding. - **COP** - IPv4 and IPv6 routed-forwarding with COP address security. - - **iACL** - IPv4 and IPv6 routed-forwarding with iACL address security. + - **ACL** - L2 Bridge-Domain switched-forwarding and IPv4 and IPv6 routed- + forwarding with iACL and oACL IP address, MAC address and L4 port security. - **LISP** - LISP overlay tunneling for IPv4-over-IPv4, IPv6-over-IPv4, IPv6-over-IPv6, IPv4-over-IPv6 in IPv4 and IPv6 routed-forwarding modes. - **VXLAN** - VXLAN overlay tunnelling integration with L2XC and L2BD. - **QoS Policer** - ingress packet rate measuring, marking and limiting (IPv4). + - **NAT** - (Source) Network Address Translation tests with varying + number of users and ports per user. + - **Container memif connections** - VPP memif virtual interface tests to + interconnect VPP instances with L2XC and L2BD. + - **Container K8s Orchestrated Topologies** - Container topologies connected over + the memif virtual interface. + - **SRv6** - Segment Routing IPv6 tests. - 2port40GE XL710 Intel @@ -158,10 +176,14 @@ CSIT |release| includes following performance test suites, listed per NIC type: - **VMs with vhost-user** - virtual topologies with 1 VM and service chains of 2 VMs using vhost-user interfaces, with VPP forwarding modes incl. L2 Cross-Connect, L2 Bridge-Domain, VXLAN with L2BD, IPv4 routed-forwarding. - - **IPSec** - IPSec encryption with AES-GCM, CBC-SHA1 ciphers, in combination - with IPv4 routed-forwarding. + - **IPSecSW** - IPSec encryption with AES-GCM, CBC-SHA1 ciphers, in + combination with IPv4 routed-forwarding. + - **IPSecHW** - IPSec encryption with AES-GCM, CBC-SHA1 ciphers, in + combination with IPv4 routed-forwarding. Intel QAT HW acceleration. - **IPSec+LISP** - IPSec encryption with CBC-SHA1 ciphers, in combination with LISP-GPE overlay tunneling for IPv4-over-IPv4. + - **VPP TCP/IP stack** - tests of VPP TCP/IP stack used with VPP built-in HTTP + server. - 2port10GE X710 Intel @@ -182,92 +204,57 @@ CSIT |release| includes following performance test suites, listed per NIC type: Execution of performance tests takes time, especially the throughput discovery tests. Due to limited HW testbed resources available within FD.io labs hosted -by Linux Foundation, the number of tests for NICs other than X520 (a.k.a. -Niantic) has been limited to few baseline tests. Over time we expect the HW -testbed resources to grow, and will be adding complete set of performance -tests for all models of hardware to be executed regularly and(or) -continuously. +by :abbr:`LF (Linux Foundation)`, the number of tests for NICs other than X520 +(a.k.a. Niantic) has been limited to few baseline tests. CSIT team expect the +HW testbed resources to grow over time, so that complete set of performance +tests can be regularly and(or) continuously executed against all models of +hardware present in FD.io labs. Performance Tests Naming ------------------------ -CSIT |release| follows a common structured naming convention for all -performance and system functional tests, introduced in CSIT rls1701. +CSIT |release| follows a common structured naming convention for all performance +and system functional tests, introduced in CSIT |release-1|. -The naming should be intuitive for majority of the tests. Complete -description of CSIT test naming convention is provided on `CSIT test naming wiki +The naming should be intuitive for majority of the tests. Complete description +of CSIT test naming convention is provided on `CSIT test naming wiki `_. -Here few illustrative examples of the new naming usage for performance test -suites: - -#. **Physical port to physical port - a.k.a. NIC-to-NIC, Phy-to-Phy, P2P** - - - *PortNICConfig-WireEncapsulation-PacketForwardingFunction- - PacketProcessingFunction1-...-PacketProcessingFunctionN-TestType* - - *10ge2p1x520-dot1q-l2bdbasemaclrn-ndrdisc.robot* => 2 ports of 10GE on - Intel x520 NIC, dot1q tagged Ethernet, L2 bridge-domain baseline switching - with MAC learning, NDR throughput discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-ndrchk.robot* => 2 ports of 10GE - on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain baseline - switching with MAC learning, NDR throughput discovery. - - *10ge2p1x520-ethip4-ip4base-ndrdisc.robot* => 2 ports of 10GE on Intel - x520 NIC, IPv4 baseline routed forwarding, NDR throughput discovery. - - *10ge2p1x520-ethip6-ip6scale200k-ndrdisc.robot* => 2 ports of 10GE on - Intel x520 NIC, IPv6 scaled up routed forwarding, NDR throughput - discovery. - -#. **Physical port to VM (or VM chain) to physical port - a.k.a. NIC2VM2NIC, - P2V2P, NIC2VMchain2NIC, P2V2V2P** - - - *PortNICConfig-WireEncapsulation-PacketForwardingFunction- - PacketProcessingFunction1-...-PacketProcessingFunctionN-VirtEncapsulation- - VirtPortConfig-VMconfig-TestType* - - *10ge2p1x520-dot1q-l2bdbasemaclrn-eth-2vhost-1vm-ndrdisc.robot* => 2 ports - of 10GE on Intel x520 NIC, dot1q tagged Ethernet, L2 bridge-domain - switching to/from two vhost interfaces and one VM, NDR throughput - discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-eth-2vhost-1vm-ndrdisc.robot* => 2 - ports of 10GE on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain - switching to/from two vhost interfaces and one VM, NDR throughput - discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-eth-4vhost-2vm-ndrdisc.robot* => 2 - ports of 10GE on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain - switching to/from four vhost interfaces and two VMs, NDR throughput - discovery. - -Methodology: Multi-Thread and Multi-Core ----------------------------------------- - -**HyperThreading** - CSIT |release| performance tests are executed with SUT -servers' Intel XEON CPUs configured in HyperThreading Disabled mode (BIOS -settings). This is the simplest configuration used to establish baseline -single-thread single-core SW packet processing and forwarding performance. -Subsequent releases of CSIT will add performance tests with Intel -HyperThreading Enabled (requires BIOS settings change and hard reboot). - -**Multi-core Test** - CSIT |release| multi-core tests are executed in the +Methodology: Multi-Core and Multi-Threading +------------------------------------------- + +**Intel Hyper-Threading** - CSIT |release| performance tests are executed with +SUT servers' Intel XEON processors configured in Intel Hyper-Threading Disabled +mode (BIOS setting). This is the simplest configuration used to establish +baseline single-thread single-core application packet processing and forwarding +performance. Subsequent releases of CSIT will add performance tests with Intel +Hyper-Threading Enabled (requires BIOS settings change and hard reboot of +server). + +**Multi-core Tests** - CSIT |release| multi-core tests are executed in the following VPP thread and core configurations: #. 1t1c - 1 VPP worker thread on 1 CPU physical core. #. 2t2c - 2 VPP worker threads on 2 CPU physical cores. -Note that in quite a few test cases running VPP on 2 physical cores hits -the tested NIC I/O bandwidth or packets-per-second limit. +VPP worker threads are the data plane threads. VPP control thread is running on +a separate non-isolated core together with other Linux processes. Note that in +quite a few test cases running VPP workers on 2 physical cores hits the tested +NIC I/O bandwidth or packets-per-second limit. Methodology: Packet Throughput ------------------------------ Following values are measured and reported for packet throughput tests: -- NDR binary search per RFC2544: +- NDR binary search per :rfc:`2544`: - Packet rate: "RATE: pps (2x )" - Aggregate bandwidth: "BANDWIDTH: Gbps (untagged)" -- PDR binary search per RFC2544: +- PDR binary search per :rfc:`2544`: - Packet rate: "RATE: pps (2x )" @@ -281,6 +268,7 @@ Following values are measured and reported for packet throughput tests: - IPv4: 64B, IMIX_v4_1 (28x64B,16x570B,4x1518B), 1518B, 9000B. - IPv6: 78B, 1518B, 9000B. +All rates are reported from external Traffic Generator perspective. Methodology: Packet Latency --------------------------- @@ -310,20 +298,171 @@ latency values are measured using following methodology: Methodology: KVM VM vhost ------------------------- -CSIT |release| introduced environment configuration changes to KVM Qemu vhost- -user tests in order to more representatively measure VPP-17.04 performance in -configurations with vhost-user interfaces and VMs. +CSIT |release| introduced test environment configuration changes to KVM Qemu +vhost-user tests in order to more representatively measure |vpp-release| +performance in configurations with vhost-user interfaces and different Qemu +settings. + +FD.io CSIT performance lab is testing VPP vhost with KVM VMs using following +environment settings: + +- Tests with varying Qemu virtio queue (a.k.a. vring) sizes: [vr256] default 256 + descriptors, [vr1024] 1024 descriptors to optimize for packet throughput; + +- Tests with varying Linux :abbr:`CFS (Completely Fair Scheduler)` settings: + [cfs] default settings, [cfsrr1] CFS RoundRobin(1) policy applied to all data + plane threads handling test packet path including all VPP worker threads and + all Qemu testpmd poll-mode threads; + +- Resulting test cases are all combinations with [vr256,vr1024] and + [cfs,cfsrr1] settings; + +- Adjusted Linux kernel :abbr:`CFS (Completely Fair Scheduler)` scheduler policy + for data plane threads used in CSIT is documented in + `CSIT Performance Environment Tuning wiki `_. + The purpose is to verify performance impact (NDR, PDR throughput) and + same test measurements repeatability, by making VPP and VM data plane + threads less susceptible to other Linux OS system tasks hijacking CPU + cores running those data plane threads. + +Methodology: LXC and Docker Containers memif +-------------------------------------------- + +CSIT |release| introduced additional tests taking advantage of VPP memif +virtual interface (shared memory interface) tests to interconnect VPP +instances. VPP vswitch instance runs in bare-metal user-mode handling +Intel x520 NIC 10GbE interfaces and connecting over memif (Master side) +virtual interfaces to more instances of VPP running in :abbr:`LXC (Linux +Container)` or in Docker Containers, both with memif virtual interfaces +(Slave side). LXCs and Docker Containers run in a priviliged mode with +VPP data plane worker threads pinned to dedicated physical CPU cores per +usual CSIT practice. All VPP instances run the same version of software. +This test topology is equivalent to existing tests with vhost-user and +VMs as described earlier in :ref:`tested_physical_topologies`. + +More information about CSIT LXC and Docker Container setup and control +is available in :ref:`containter_orchestration_in_csit`. + +Methodology: Container Topologies Orchestrated by K8s +----------------------------------------------------- + +CSIT |release| introduced new tests of Container topologies connected +over the memif virtual interface (shared memory interface). In order to +provide simple topology coding flexibility and extensibility container +orchestration is done with `Kubernetes `_ +using `Docker `_ images for all container +applications including VPP. `Ligato `_ is +used to address the container networking orchestration that is +integrated with K8s, including memif support. + +For these tests VPP vswitch instance runs in a Docker Container handling +Intel x520 NIC 10GbE interfaces and connecting over memif (Master side) +virtual interfaces to more instances of VPP running in Docker Containers +with memif virtual interfaces (Slave side). All Docker Containers run in +a priviliged mode with VPP data plane worker threads pinned to dedicated +physical CPU cores per usual CSIT practice. All VPP instances run the +same version of software. This test topology is equivalent to existing +tests with vhost-user and VMs as described earlier in +:ref:`tested_physical_topologies`. + +More information about CSIT Container Topologies Orchestrated by K8s is +available in :ref:`containter_orchestration_in_csit`. + +Methodology: IPSec with Intel QAT HW cards +------------------------------------------ + +VPP IPSec performance tests are using DPDK cryptodev device driver in +combination with HW cryptodev devices - Intel QAT 8950 50G - present in +LF FD.io physical testbeds. DPDK cryptodev can be used for all IPSec +data plane functions supported by VPP. + +Currently CSIT |release| implements following IPSec test cases: + +- AES-GCM, CBC-SHA1 ciphers, in combination with IPv4 routed-forwarding + with Intel xl710 NIC. +- CBC-SHA1 ciphers, in combination with LISP-GPE overlay tunneling for + IPv4-over-IPv4 with Intel xl710 NIC. + +Methodology: TRex Traffic Generator Usage +----------------------------------------- + +`TRex traffic generator `_ is used for all +CSIT performance tests. TRex stateless mode is used to measure NDR and PDR +throughputs using binary search (NDR and PDR discovery tests) and for quick +checks of DUT performance against the reference NDRs (NDR check tests) for +specific configuration. + +TRex is installed and run on the TG compute node. The typical procedure is: + +- If the TRex is not already installed on TG, it is installed in the + suite setup phase - see `TRex intallation`_. +- TRex configuration is set in its configuration file + :: + + /etc/trex_cfg.yaml + +- TRex is started in the background mode + :: + + $ sh -c 'cd /scripts/ && sudo nohup ./t-rex-64 -i -c 7 --iom 0 > /tmp/trex.log 2>&1 &' > /dev/null + +- There are traffic streams dynamically prepared for each test, based on traffic + profiles. The traffic is sent and the statistics obtained using + :command:`trex_stl_lib.api.STLClient`. + +**Measuring packet loss** + +- Create an instance of STLClient +- Connect to the client +- Add all streams +- Clear statistics +- Send the traffic for defined time +- Get the statistics + +If there is a warm-up phase required, the traffic is sent also before test and +the statistics are ignored. + +**Measuring latency** + +If measurement of latency is requested, two more packet streams are created (one +for each direction) with TRex flow_stats parameter set to STLFlowLatencyStats. In +that case, returned statistics will also include min/avg/max latency values. + +Methodology: TCP/IP tests with WRK tool +--------------------------------------- + +`WRK HTTP benchmarking tool `_ is used for +experimental TCP/IP and HTTP tests of VPP TCP/IP stack and built-in +static HTTP server. WRK has been chosen as it is capable of generating +significant TCP/IP and HTTP loads by scaling number of threads across +multi-core processors. + +This in turn enables quite high scale benchmarking of the main TCP/IP +and HTTP service including HTTP TCP/IP Connections-Per-Second (CPS), +HTTP Requests-Per-Second and HTTP Bandwidth Throughput. + +The initial tests are designed as follows: + +- HTTP and TCP/IP Connections-Per-Second (CPS) + + - WRK configured to use 8 threads across 8 cores, 1 thread per core. + - Maximum of 50 concurrent connections across all WRK threads. + - Timeout for server responses set to 5 seconds. + - Test duration is 30 seconds. + - Expected HTTP test sequence: + + - Single HTTP GET Request sent per open connection. + - Connection close after valid HTTP reply. + - Resulting flow sequence - 8 packets: >S,A,>Req,F, A. -Current setup of CSIT FD.io performance lab is using tuned settings for more -optimal performance of KVM Qemu: +- HTTP Requests-Per-Second -- Qemu virtio queue size has been increased from default value of 256 to 1024 - descriptors. -- Adjusted Linux kernel CFS scheduler settings, as detailed on this CSIT wiki - page: https://wiki.fd.io/view/CSIT/csit-perf-env-tuning-ubuntu1604. + - WRK configured to use 8 threads across 8 cores, 1 thread per core. + - Maximum of 50 concurrent connections across all WRK threads. + - Timeout for server responses set to 5 seconds. + - Test duration is 30 seconds. + - Expected HTTP test sequence: -Adjusted Linux kernel CFS settings make the NDR and PDR throughput performance -of VPP+VM system less sensitive to other Linux OS system tasks by reducing -their interference on CPU cores that are designated for critical software -tasks under test, namely VPP worker threads in host and Testpmd threads in -guest dealing with data plan. + - Multiple HTTP GET Requests sent in sequence per open connection. + - Connection close after set test duration time. + - Resulting flow sequence: >S,A,>Req[1],Req[n],F,A.