docs/cpta/methodology/perpatch_performance_tests.rst

   1 Per-patch performance tests
   2 ---------------------------
   3
   4 Updated for CSIT git commit id: 661035ac4ce6e51649f302fe2b7a8218257c0587.
   5
   6 A methodology similar to trending analysis is used for comparing performance
   7 before a DUT code change is merged. This can act as a verify job to disallow
   8 changes which would decrease performance without a good reason.
   9
  10 Existing jobs
  11 `````````````
  12
  13 VPP is the only project currently using such jobs.
  14 They are not started automatically, must be triggered on demand.
  15 They allow full tag expressions, but some tags are enforced (such as MRR).
  16
  17 Only the three types of testbed based on Xeon processors have jobs created.
  18 Their Gerrit triggers words are "perftest-3n-skx"
  19 and "perftest-2n-skx".
  20
  21 If additional arguments are added to the Gerrit trigger, they are treated
  22 as Robot tag expressions to select tests to run. For more details
  23 on existing tags, see
  24 `CSIT Tags <https://github.com/FDio/csit/blob/master/docs/tag_documentation.rst>`_.
  25
  26 Basic operation
  27 ```````````````
  28
  29 The job builds VPP .deb packages for both the patch under test
  30 (called "current") and its parent patch (called "parent").
  31
  32 For each test (from a set defined by tag expression),
  33 both builds are subjected to several trial measurements (BMRR).
  34 Measured samples are grouped to "parent" sequence,
  35 followed by "current" sequence. The same Minimal Description Length
  36 algorithm as in trending is used to decide whether it is one big group,
  37 or two smaller gropus. If it is one group, a "normal" result
  38 is declared for the test. If it is two groups, and current average
  39 is less then parent average, the test is declared a regression.
  40 If it is two groups and current average is larger or equal,
  41 the test is declared a progression.
  42
  43 The whole job fails (giving -1) if some trial measurement failed,
  44 or if any test was declared a regression.
  45
  46 Temporary specifics
  47 ```````````````````
  48
  49 The Minimal Description Length analysis is performed by
  50 jumpavg-0.1.3 available on PyPI.
  51
  52 In hopes of strengthening of signal (code performance) compared to noise
  53 (all other factors influencing the measured values), several workarounds
  54 are applied.
  55
  56 In contrast to trending, trial duration is set to 10 seconds,
  57 and only 5 samples are measured for each build.
  58 Both parameters are set in ci-management.
  59
  60 This decreases sensitivity to regressions, but also decreases
  61 probability of false positives.
  62
  63 Console output
  64 ``````````````
  65
  66 The following information as visible towards the end of Jenkins console output,
  67 repeated for each analyzed test.
  68
  69 The original 5 values are visible in order they were measured.
  70 The 5 values after processing are also visible in output,
  71 this time sorted by value (so people can see minimum and maximum).
  72
  73 The next output is difference of averages. It is the current average
  74 minus the parent average, expressed as percentage of the parent average.
  75
  76 The next three outputs contain the jumpavg representation
  77 of the two groups and a combined group.
  78 Here, "bits" is the description length; for "current" sequence
  79 it includes effect from "parent" average value
  80 (jumpavg-0.1.3 penalizes sequences with too close averages).
  81
  82 Next, a sentence describing which grouping description is shorter,
  83 by how much bits.
  84 Finally, the test result classification is visible.
  85
  86 The algorithm does not track test case names,
  87 so test cases are indexed (from 0).