docs/cpta/methodology/perpatch_performance_tests.rst

   1 Per-patch performance tests
   2 ---------------------------
   3
   4 Updated for CSIT git commit id: 661035ac4ce6e51649f302fe2b7a8218257c0587.
   5
   6 A methodology similar to trending analysis is used for comparing performance
   7 before a DUT code change is merged. This can act as a verify job to disallow
   8 changes which would decrease performance without a good reason.
   9
  10 Existing jobs
  11 `````````````
  12
  13 VPP is the only project currently using such jobs.
  14 They are not started automatically, must be triggered on demand.
  15 They allow full tag expressions, but some tags are enforced (such as MRR).
  16
  17 Only the three types of tesbed based on Xeon processors have jobs created.
  18 Their Gerrit triggers words are "perftest-3n-skx"
  19 and "perftest-2n-skx".
  20
  21 If additional arguments are added to the Gerrit trigger, they are treated
  22 as Robot tag expressions to select tests to run. For more details
  23 on existing tags, see `tag documentation rst file`_.
  24
  25 Basic operation
  26 ```````````````
  27
  28 The job builds VPP .deb packages for both the patch under test
  29 (called "current") and its parent patch (called "parent").
  30
  31 For each test (from a set defined by tag expression),
  32 both builds are subjected to several trial measurements (BMRR).
  33 Measured samples are grouped to "parent" sequence,
  34 followed by "current" sequence. The same Minimal Description Length
  35 algorithm as in trending is used to decide whether it is one big group,
  36 or two smaller gropus. If it is one group, a "normal" result
  37 is declared for the test. If it is two groups, and current average
  38 is less then parent average, the test is declared a regression.
  39 If it is two groups and current average is larger or equal,
  40 the test is declared a progression.
  41
  42 The whole job fails (giving -1) if some trial measurement failed,
  43 or if any test was declared a regression.
  44
  45 Temporary specifics
  46 ```````````````````
  47
  48 The Minimal Description Length analysis is performed by
  49 jumpavg-0.1.3 available on PyPI.
  50
  51 In hopes of strengthening of signal (code performance) compared to noise
  52 (all other factors influencing the measured values), several workarounds
  53 are applied.
  54
  55 In contrast to trending, trial duration is set to 10 seconds,
  56 and only 5 samples are measured for each build.
  57 Both parameters are set in ci-management.
  58
  59 This decreases sensitivity to regressions, but also decreases
  60 probability of false positives.
  61
  62 Console output
  63 ``````````````
  64
  65 The following information as visible towards the end of Jenkins console output,
  66 repeated for each analyzed test.
  67
  68 The original 5 values are visible in order they were measured.
  69 The 5 values after processing are also visible in output,
  70 this time sorted by value (so people can see minimum and maximum).
  71
  72 The next output is difference of averages. It is the current average
  73 minus the parent average, expressed as percentage of the parent average.
  74
  75 The next three outputs contain the jumpavg representation
  76 of the two groups and a combined group.
  77 Here, "bits" is the description length; for "current" sequence
  78 it includes effect from "parent" average value
  79 (jumpavg-0.1.3 penalizes sequences with too close averages).
  80
  81 Next, a sentence describing which grouping description is shorter,
  82 by how much bits.
  83 Finally, the test result classification is visible.
  84
  85 The algorithm does not track test case names,
  86 so test cases are indexed (from 0).