docs/content/methodology/per_patch_testing.md

   1 ---
   2 title: "Per-patch Testing"
   3 weight: 5
   4 ---
   5
   6 # Per-patch Testing
   7
   8 Updated for CSIT git commit id: 72b45cfe662107c8e1bb549df71ba51352a898ee.
   9
  10 A methodology similar to trending analysis is used for comparing performance
  11 before a DUT code change is merged. This can act as a verify job to disallow
  12 changes which would decrease performance without a good reason.
  13
  14 ## Existing jobs
  15
  16 VPP is the only project currently using such jobs.
  17 They are not started automatically, must be triggered on demand.
  18 They allow full tag expressions, but some tags are enforced (such as MRR).
  19
  20 There are jobs available for multiple types of testbeds,
  21 based on various processors.
  22 Their Gerrit triggers words are of the form "perftest-{node_arch}"
  23 where the node_arch combinations currently supported are:
  24 2n-clx, 2n-tx2, 2n-zn2, 3n-tsh.
  25
  26 ## Test selection
  27
  28 Gerrit trigger line without any additional arguments selects
  29 a small set of test cases to run.
  30 If additional arguments are added to the Gerrit trigger, they are treated
  31 as Robot tag expressions to select tests to run.
  32 While very flexible, this method of test selection also allows the user
  33 to accidentally select too high number of tests, blocking the testbed for days.
  34
  35 What follows is a list of explanations and recommendations
  36 to help users to select the minimal set of tests cases.
  37
  38 ### Verify cycles
  39
  40 When Gerrit schedules multiple jobs to run for the same patch set,
  41 it waits until all runs are complete.
  42 While it is waiting, it is possible to trigger more jobs
  43 (adding runs to the set Gerrit is waiting for), but it is not possible
  44 to trigger more runs for the same job, until Gerrit is done waiting.
  45 After Gerrit is done waiting, it becames possible to trigger
  46 the same job again.
  47
  48 Example. User triggers one set of tests on 2n-icx and immediately
  49 also triggers other set of tests on 3n-icx. Then the user notices
  50 2n-icx run end early because of a typo in tag expression.
  51 When the user tries to re-trigger 2n-icx (with fixed tag expression),
  52 that comment gets ignored by Jenkins.
  53 Only when 3n-icx job finishes, the user can trigger 2n-icx.
  54
  55 ### One comment many jobs
  56
  57 In the past, the CSIT code which parses for perftest trigger comments
  58 was buggy, which lead to bad behavior (as in selection all performance test,
  59 because "perftest" is also a robot tag) when user included multiple
  60 perftest trigger words in the same comment.
  61
  62 The worst bugs were fixed since then, but it is still recommended
  63 to use just one trigger word per Gerrit comment, just to be safe.
  64
  65 ### Multiple test cases in run
  66
  67 While Robot supports OR operator, it does not support parentheses,
  68 so the OR operator is not very useful. It is recommended
  69 to use space instead of OR operator.
  70
  71 Example template:
  72 perftest-2n-icx {tag_expression_1} {tag_expression_2}
  73
  74 See below for more concrete examples.
  75
  76 ### Suite tags
  77
  78 Traditionally, CSIT maintains broad Robot tags that can be used to select tests.
  79
  80 But it is not recommended to use them for test selection,
  81 as it is not that easy to determine how many test cases are selected.
  82
  83 The recommended way is to look into CSIT repository first,
  84 and locate a specific suite the user is interested in,
  85 and use its suite tag. For example, "ethip4-ip4base" is a suite tag
  86 selecting just one suite in CSIT git repository,
  87 avoiding all scale, container, and other simialr variants.
  88
  89 Note that CSIT uses "autogen" code generator,
  90 so the robot running in Jenkins has access to more suites
  91 than visible just by looking into CSIT git repository,
  92 so suite tag is not enough to select even the intended suite,
  93 and user still probably wants to narrow down
  94 to a single test case within a suite.
  95
  96 ### Fully specified tag expressions
  97
  98 Here is one template to select a single test case:
  99 {test_type}AND{nic_model}AND{nic_driver}AND{cores}AND{frame_size}AND{suite_tag}
 100 where the variables are all lower case (so AND operator stands out).
 101
 102 Currently only one test type is supported by the performance comparison jobs:
 103 "mrr".
 104 The nic_driver options depend on nic_model. For Intel cards "drv_avf"
 105 (AVF plugin) and "drv_vfio_pci" (DPDK plugin) are popular, for Mellanox
 106 "drv_rdma_core". Currently, the performance using "drv_af_xdp" is not reliable
 107 enough, so do not use it unless you are specifically testing for AF_XDP.
 108
 109 The most popular nic_model is "nic_intel-xxv710", but that is not available
 110 on all testbed types.
 111 It is safe to use "1c" for cores (unless you are suspection multi-core
 112 performance is affected differently) and "64b" for frame size ("78b" for ip6
 113 and more for dot1q and other encapsulated traffic;
 114 "1518b" is popular for ipsec and other payload-bound tests).
 115
 116 As there are more test cases than CSIT can periodically test,
 117 it is possible to encounter an old test case that currently fails.
 118 To avoid that, you can look at "job spec" files we use for periodic testing,
 119 for example
 120 [this one](https://github.com/FDio/csit/blob/master/resources/job_specs/report_iterative/2n-icx/vpp-mrr-00.md).
 121
 122 ### Shortening triggers
 123
 124 Advanced users may use the following tricks to avoid writing long trigger
 125 comments.
 126
 127 Robot supports glob matching, which can be used to select multiple suite tags at
 128 once.
 129
 130 Not specifying one of 6 parts of the recommended expression pattern
 131 will select all available options. For example not specifying nic_driver
 132 for nic_intel-xxv710 will select all 3 applicable drivers.
 133 You can use NOT operator to reject some options (e.g. NOTdrv_af_xdp),
 134 but beware, with NOT the order matters:
 135 tag1ANDtag2NOTtag3 is not the same as tag1NOTtag3ANDtag2,
 136 the latter is evaluated as tag1AND(NOT(tag3ANDtag2)).
 137
 138 Beware when not specifying nic_model. As a precaution,
 139 CSIT code will insert the defailt NIC model for the tetsbed used.
 140 Example: Specifying drv_rdma_core without specifying nic_model
 141 will fail, as the default nic_model is nic_intel-xxv710
 142 which does not support RDMA core driver.
 143
 144 ### Complete example
 145
 146 A user wants to test a VPP change which may affect load balance whith bonding.
 147 Searching tag documentation for "bonding" finds LBOND tag and its variants.
 148 Searching CSIT git repository (directory tests/) finds 8 suite files,
 149 all suited only for 3-node testbeds.
 150 All suites are using vhost, but differ by the forwarding app inside VM
 151 (DPDK or VPP), by the forwarding mode of VPP acting as host level vswitch
 152 (MAC learning or cross connect), and by the number of DUT1-DUT2 links
 153 available (1 or 2).
 154
 155 As not all NICs and testbeds offer enogh ports for 2 parallel DUT-DUT links,
 156 the user looks at
 157 [testbed specifications](https://github.com/FDio/csit/tree/master/topologies/available)
 158 and finds that only xxv710 NIC on 3n-icx testbed matches the requirements.
 159 Quick look into the suites confirm the smallest frame size is 64 bytes
 160 (despite DOT1Q robot tag, as the encapsulation does not happen on TG-DUT links).
 161 It is ok to use just 1 physical core, as 3n-icx has hyperthreading enabled,
 162 so VPP vswitch will use 2 worker threads.
 163
 164 The user decides the vswitch forwarding mode is not important
 165 (so choses cross connect as that has less CPU overhead),
 166 but wants to test both NIC drivers (not AF_XDP), both apps in VM,
 167 and both 1 and 2 parallel links.
 168
 169 After shortening, this is the trigger comment fianlly used:
 170 perftest-3n-icx mrrANDnic_intel-x710AND1cAND64bAND?lbvpplacp-dot1q-l2xcbase-eth-2vhostvr1024-1vm*NOTdrv_af_xdp
 171
 172 ## Basic operation
 173
 174 The job builds VPP .deb packages for both the patch under test
 175 (called "current") and its parent patch (called "parent").
 176
 177 For each test (from a set defined by tag expression),
 178 both builds are subjected to several trial measurements (BMRR).
 179 Measured samples are grouped to "parent" sequence,
 180 followed by "current" sequence. The same Minimal Description Length
 181 algorithm as in trending is used to decide whether it is one big group,
 182 or two smaller gropus. If it is one group, a "normal" result
 183 is declared for the test. If it is two groups, and current average
 184 is less then parent average, the test is declared a regression.
 185 If it is two groups and current average is larger or equal,
 186 the test is declared a progression.
 187
 188 The whole job fails (giving -1) if some trial measurement failed,
 189 or if any test was declared a regression.
 190
 191 ## Temporary specifics
 192
 193 The Minimal Description Length analysis is performed by
 194 CSIT code equivalent to jumpavg-0.1.3 library available on PyPI.
 195
 196 In hopes of strengthening of signal (code performance) compared to noise
 197 (all other factors influencing the measured values), several workarounds
 198 are applied.
 199
 200 In contrast to trending, trial duration is set to 10 seconds,
 201 and only 5 samples are measured for each build.
 202 Both parameters are set in ci-management.
 203
 204 This decreases sensitivity to regressions, but also decreases
 205 probability of false positives.
 206
 207 ## Console output
 208
 209 The following information as visible towards the end of Jenkins console output,
 210 repeated for each analyzed test.
 211
 212 The original 5 values are visible in order they were measured.
 213 The 5 values after processing are also visible in output,
 214 this time sorted by value (so people can see minimum and maximum).
 215
 216 The next output is difference of averages. It is the current average
 217 minus the parent average, expressed as percentage of the parent average.
 218
 219 The next three outputs contain the jumpavg representation
 220 of the two groups and a combined group.
 221 Here, "bits" is the description length; for "current" sequence
 222 it includes effect from "parent" average value
 223 (jumpavg-0.1.3 penalizes sequences with too close averages).
 224
 225 Next, a sentence describing which grouping description is shorter,
 226 and by how much bits.
 227 Finally, the test result classification is visible.
 228
 229 The algorithm does not track test case names,
 230 so test cases are indexed (from 0).