X-Git-Url: https://gerrit.fd.io/r/gitweb?p=csit.git;a=blobdiff_plain;f=docs%2Freport%2Fvpp_performance_tests%2Foverview.rst;h=98a4d066817c443deef6741debebecf9af80f93a;hp=dbc1612f84a241e0f5b7630bf0e01d513513ee63;hb=db98afadb9664144386c1642d217a21e5f223b00;hpb=02f2a2176ec92efdf63399fb7dba1eb586465f38 diff --git a/docs/report/vpp_performance_tests/overview.rst b/docs/report/vpp_performance_tests/overview.rst index dbc1612f84..98a4d06681 100644 --- a/docs/report/vpp_performance_tests/overview.rst +++ b/docs/report/vpp_performance_tests/overview.rst @@ -52,20 +52,21 @@ Going forward CSIT project will be looking to add more hardware into FD.io performance labs to address larger scale multi-interface and multi-NIC performance testing scenarios. -For test cases that require DUT (VPP) to communicate with VM(s) over vhost-user -interfaces, N of VM instances are created on SUT1 and SUT2. For N=1 DUT (VPP) -forwards packets between vhostuser and physical interfaces. For N>1 DUT (VPP) a -logical service chain forwarding topology is created on DUT (VPP) by applying L2 -or IPv4/IPv6 configuration depending on the test suite. -DUT (VPP) test topology with N VM instances -is shown in the figure below including applicable packet flow thru the DUTs and -VMs (marked in the figure with ``***``). +For test cases that require DUT (VPP) to communicate with +VirtualMachines(VMs)/LinuxContainers(LXCs) over vhost-user/memif +interfaces, N of VM/LXC instances are created on SUT1 and SUT2. For N=1 +DUT forwards packets between vhost/memif and physical interfaces. For +N>1 DUT a logical service chain forwarding topology is created on DUT by +applying L2 or IPv4/IPv6 configuration depending on the test suite. DUT +test topology with N VM/LXC instances is shown in the figure below +including applicable packet flow thru the DUTs and VMs/LXCs (marked in +the figure with ``***``). :: +-------------------------+ +-------------------------+ | +---------+ +---------+ | | +---------+ +---------+ | - | | VM[1] | | VM[N] | | | | VM[1] | | VM[N] | | + | |VM/LXC[1]| |VM/LXC[N]| | | |VM/LXC[1]| |VM/LXC[N]| | | | ***** | | ***** | | | | ***** | | ***** | | | +--^---^--+ +--^---^--+ | | +--^---^--+ +--^---^--+ | | *| |* *| |* | | *| |* *| |* | @@ -85,31 +86,33 @@ VMs (marked in the figure with ``***``). **********************| |********************** +-----------+ -For VM tests, packets are switched by DUT (VPP) multiple times: twice for a -single VM, three times for two VMs, N+1 times for N VMs. -Hence the external -throughput rates measured by TG and listed in this report must be multiplied -by (N+1) to represent the actual DUT aggregate packet forwarding rate. - -Note that reported VPP performance results are specific to the SUTs tested. -Current LF FD.io SUTs are based on Intel XEON E5-2699v3 2.3GHz CPUs. SUTs with -other CPUs are likely to yield different results. A good rule of thumb, that -can be applied to estimate VPP packet thoughput for Phy-to-Phy (NIC-to-NIC, -PCI-to-PCI) topology, is to expect the forwarding performance to be -proportional to CPU core frequency, assuming CPU is the only limiting factor -and all other SUT parameters equivalent to FD.io CSIT environment. The same rule -of thumb can be also applied for Phy-to-VM-to-Phy (NIC-to-VM-to-NIC) topology, -but due to much higher dependency on intensive memory operations and -sensitivity to Linux kernel scheduler settings and behaviour, this estimation -may not always yield good enough accuracy. - -For detailed LF FD.io test bed specification and physical topology please refer -to `LF FDio CSIT testbed wiki page `_. +For VM/LXC tests, packets are switched by DUT multiple times: twice for +a single VM/LXC, three times for two VMs/LXCs, N+1 times for N VMs/LXCs. +Hence the external throughput rates measured by TG and listed in this +report must be multiplied by (N+1) to represent the actual DUT aggregate +packet forwarding rate. + +Note that reported DUT (VPP) performance results are specific to the +SUTs tested. Current LF FD.io SUTs are based on Intel XEON E5-2699v3 +2.3GHz CPUs. SUTs with other CPUs are likely to yield different results. +A good rule of thumb, that can be applied to estimate VPP packet +thoughput for Phy-to-Phy (NIC-to-NIC, PCI-to-PCI) topology, is to expect +the forwarding performance to be proportional to CPU core frequency, +assuming CPU is the only limiting factor and all other SUT parameters +equivalent to FD.io CSIT environment. The same rule of thumb can be also +applied for Phy-to-VM/LXC-to-Phy (NIC-to-VM/LXC-to-NIC) topology, but +due to much higher dependency on intensive memory operations and +sensitivity to Linux kernel scheduler settings and behaviour, this +estimation may not always yield good enough accuracy. + +For detailed LF FD.io test bed specification and physical topology +please refer to +`LF FD.io CSIT testbed wiki page `_. Performance Tests Coverage -------------------------- -Performance tests are split into the two main categories: +Performance tests are split into two main categories: - Throughput discovery - discovery of packet forwarding rate using binary search in accordance to RFC2544. @@ -147,6 +150,8 @@ CSIT |release| includes following performance test suites, listed per NIC type: - **VXLAN** - VXLAN overlay tunnelling integration with L2XC and L2BD. - **QoS Policer** - ingress packet rate measuring, marking and limiting (IPv4). + - **CGNAT** - Carrier Grade Network Address Translation tests with varying + number of users and ports per user. - 2port40GE XL710 Intel @@ -180,80 +185,46 @@ CSIT |release| includes following performance test suites, listed per NIC type: - **L2BD** - L2 Bridge-Domain switched-forwarding of untagged Ethernet frames with MAC learning. -Execution of performance tests takes time, especially the throughput discovery -tests. Due to limited HW testbed resources available within FD.io labs hosted -by Linux Foundation, the number of tests for NICs other than X520 (a.k.a. -Niantic) has been limited to few baseline tests. Over time we expect the HW -testbed resources to grow, and will be adding complete set of performance -tests for all models of hardware to be executed regularly and(or) -continuously. +Execution of performance tests takes time, especially the throughput +discovery tests. Due to limited HW testbed resources available within +FD.io labs hosted by Linux Foundation, the number of tests for NICs +other than X520 (a.k.a. Niantic) has been limited to few baseline tests. +CSIT team expect the HW testbed resources to grow over time, so that +complete set of performance tests can be regularly and(or) continuously +executed against all models of hardware present in FD.io labs. Performance Tests Naming ------------------------ CSIT |release| follows a common structured naming convention for all -performance and system functional tests, introduced in CSIT rls1701. +performance and system functional tests, introduced in CSIT |release-1|. The naming should be intuitive for majority of the tests. Complete description of CSIT test naming convention is provided on `CSIT test naming wiki `_. -Here few illustrative examples of the new naming usage for performance test -suites: - -#. **Physical port to physical port - a.k.a. NIC-to-NIC, Phy-to-Phy, P2P** - - - *PortNICConfig-WireEncapsulation-PacketForwardingFunction- - PacketProcessingFunction1-...-PacketProcessingFunctionN-TestType* - - *10ge2p1x520-dot1q-l2bdbasemaclrn-ndrdisc.robot* => 2 ports of 10GE on - Intel x520 NIC, dot1q tagged Ethernet, L2 bridge-domain baseline switching - with MAC learning, NDR throughput discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-ndrchk.robot* => 2 ports of 10GE - on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain baseline - switching with MAC learning, NDR throughput discovery. - - *10ge2p1x520-ethip4-ip4base-ndrdisc.robot* => 2 ports of 10GE on Intel - x520 NIC, IPv4 baseline routed forwarding, NDR throughput discovery. - - *10ge2p1x520-ethip6-ip6scale200k-ndrdisc.robot* => 2 ports of 10GE on - Intel x520 NIC, IPv6 scaled up routed forwarding, NDR throughput - discovery. - -#. **Physical port to VM (or VM chain) to physical port - a.k.a. NIC2VM2NIC, - P2V2P, NIC2VMchain2NIC, P2V2V2P** - - - *PortNICConfig-WireEncapsulation-PacketForwardingFunction- - PacketProcessingFunction1-...-PacketProcessingFunctionN-VirtEncapsulation- - VirtPortConfig-VMconfig-TestType* - - *10ge2p1x520-dot1q-l2bdbasemaclrn-eth-2vhost-1vm-ndrdisc.robot* => 2 ports - of 10GE on Intel x520 NIC, dot1q tagged Ethernet, L2 bridge-domain - switching to/from two vhost interfaces and one VM, NDR throughput - discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-eth-2vhost-1vm-ndrdisc.robot* => 2 - ports of 10GE on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain - switching to/from two vhost interfaces and one VM, NDR throughput - discovery. - - *10ge2p1x520-ethip4vxlan-l2bdbasemaclrn-eth-4vhost-2vm-ndrdisc.robot* => 2 - ports of 10GE on Intel x520 NIC, IPv4 VXLAN Ethernet, L2 bridge-domain - switching to/from four vhost interfaces and two VMs, NDR throughput - discovery. - -Methodology: Multi-Thread and Multi-Core ----------------------------------------- - -**HyperThreading** - CSIT |release| performance tests are executed with SUT -servers' Intel XEON CPUs configured in HyperThreading Disabled mode (BIOS -settings). This is the simplest configuration used to establish baseline -single-thread single-core SW packet processing and forwarding performance. -Subsequent releases of CSIT will add performance tests with Intel -HyperThreading Enabled (requires BIOS settings change and hard reboot). - -**Multi-core Test** - CSIT |release| multi-core tests are executed in the -following VPP thread and core configurations: +Methodology: Multi-Core and Multi-Threading +------------------------------------------- + +**Intel Hyper-Threading** - CSIT |release| performance tests are +executed with SUT servers' Intel XEON processors configured in Intel +Hyper-Threading Disabled mode (BIOS setting). This is the simplest +configuration used to establish baseline single-thread single-core +application packet processing and forwarding performance. Subsequent +releases of CSIT will add performance tests with Intel Hyper-Threading +Enabled (requires BIOS settings change and hard reboot of server). + +**Multi-core Tests** - CSIT |release| multi-core tests are executed in +the following VPP thread and core configurations: #. 1t1c - 1 VPP worker thread on 1 CPU physical core. #. 2t2c - 2 VPP worker threads on 2 CPU physical cores. -Note that in quite a few test cases running VPP on 2 physical cores hits -the tested NIC I/O bandwidth or packets-per-second limit. +VPP worker threads are the data plane threads. VPP control thread is +running on a separate non-isolated core together with other Linux +processes. Note that in quite a few test cases running VPP workers on 2 +physical cores hits the tested NIC I/O bandwidth or packets-per-second +limit. Methodology: Packet Throughput ------------------------------ @@ -281,6 +252,7 @@ Following values are measured and reported for packet throughput tests: - IPv4: 64B, IMIX_v4_1 (28x64B,16x570B,4x1518B), 1518B, 9000B. - IPv6: 78B, 1518B, 9000B. +All rates are reported from external Traffic Generator perspective. Methodology: Packet Latency --------------------------- @@ -310,20 +282,102 @@ latency values are measured using following methodology: Methodology: KVM VM vhost ------------------------- -CSIT |release| introduced environment configuration changes to KVM Qemu vhost- -user tests in order to more representatively measure VPP-17.04 performance in -configurations with vhost-user interfaces and VMs. +CSIT |release| introduced test environment configuration changes to KVM Qemu vhost- +user tests in order to more representatively measure |vpp-release| performance +in configurations with vhost-user interfaces and different Qemu settings. + +FD.io CSIT performance lab is testing VPP vhost with KVM VMs using following environment settings + +- Tests with varying Qemu virtio queue (a.k.a. vring) sizes: + [vr256] default 256 descriptors, [vr1024] 1024 descriptors to + optimize for packet throughput; + +- Tests with varying Linux CFS (Completely Fair Scheduler) + settings: [cfs] default settings, [cfsrr1] CFS RoundRobin(1) + policy applied to all data plane threads handling test packet + path including all VPP worker threads and all Qemu testpmd + poll-mode threads; + +- Resulting test cases are all combinations with [vr256,vr1024] and + [cfs,cfsrr1] settings; + +- Adjusted Linux kernel CFS scheduler policy for data plane threads used + in CSIT is documented in + `CSIT Performance Environment Tuning wiki `_. + The purpose is to verify performance impact (NDR, PDR throughput) and + same test measurements repeatability, by making VPP and VM data plane + threads less susceptible to other Linux OS system tasks hijacking CPU + cores running those data plane threads. + +Methodology: LXC Container memif +-------------------------------- + +CSIT |release| introduced new tests - VPP Memif virtual interface +(shared memory interface) tests interconnecting VPP instances over +memif. VPP vswitch instance runs in bare-metal user-mode handling Intel +x520 NIC 10GbE interfaces and connecting over memif (Master side) +virtual interfaces to another instance of VPP running in bare-metal +Linux Container (LXC) with memif virtual interfaces (Slave side). LXC +runs in a priviliged mode with VPP data plane worker threads pinned to +dedicated physical CPU cores per usual CSIT practice. Both VPP run the +same version of software. This test topology is equivalent to existing +tests with vhost-user and VMs. + +Methodology: IPSec with Intel QAT HW cards +------------------------------------------ + +VPP IPSec performance tests are using DPDK cryptodev device driver in +combination with HW cryptodev devices - Intel QAT 8950 50G - present in +LF FD.io physical testbeds. DPDK cryptodev can be used for all IPSec +data plane functions supported by VPP. + +Currently CSIT |release| implements following IPSec test cases: + +- AES-GCM, CBC-SHA1 ciphers, in combination with IPv4 routed-forwarding + with Intel xl710 NIC. +- CBC-SHA1 ciphers, in combination with LISP-GPE overlay tunneling for + IPv4-over-IPv4 with Intel xl710 NIC. + +Methodology: TRex Traffic Generator Usage +----------------------------------------- + +The `TRex traffic generator `_ is used for all +CSIT performance tests. TRex stateless mode is used to measure NDR and PDR +throughputs using binary search (NDR and PDR discovery tests) and for quick +checks of DUT performance against the reference NDRs (NDR check tests) for +specific configuration. + +TRex is installed and run on the TG compute node. The typical procedure is: + +- If the TRex is not already installed on TG, it is installed in the + suite setup phase - see `TRex intallation`_. +- TRex configuration is set in its configuration file + :: + + /etc/trex_cfg.yaml + +- TRex is started in the background mode + :: + + $ sh -c 'cd /opt/trex-core-2.25/scripts/ && sudo nohup ./t-rex-64 -i -c 7 --iom 0 > /dev/null 2>&1 &' > /dev/null + +- There are traffic streams dynamically prepared for each test. The traffic + is sent and the statistics obtained using trex_stl_lib.api.STLClient. + +**Measuring packet loss** + +- Create an instance of STLClient +- Connect to the client +- Add all streams +- Clear statistics +- Send the traffic for defined time +- Get the statistics -Current setup of CSIT FD.io performance lab is using tuned settings for more -optimal performance of KVM Qemu: +If there is a warm-up phase required, the traffic is sent also before test and +the statistics are ignored. -- Qemu virtio queue size has been increased from default value of 256 to 1024 - descriptors. -- Adjusted Linux kernel CFS scheduler settings, as detailed on this CSIT wiki - page: https://wiki.fd.io/view/CSIT/csit-perf-env-tuning-ubuntu1604. +**Measuring latency** -Adjusted Linux kernel CFS settings make the NDR and PDR throughput performance -of VPP+VM system less sensitive to other Linux OS system tasks by reducing -their interference on CPU cores that are designated for critical software -tasks under test, namely VPP worker threads in host and Testpmd threads in -guest dealing with data plan. +If measurement of latency is requested, two more packet streams are created (one +for each direction) with TRex flow_stats parameter set to STLFlowLatencyStats. In +that case, returned statistics will also include min/avg/max latency values.