docs/content/methodology/bisecting.md

   1 ---
   2 title: "Bisecting"
   3 weight: 5
   4 ---
   5
   6 # Bisecting
   7
   8 Updated for CSIT git commit hash: 153c9e1215f27ad166df0ce4bd2541d9f37a7afa.
   9
  10 When trending (or report release comparison) detects a performance anomaly,
  11 it is possible to narrow down its cause in VPP repository.
  12 This document explains how.
  13
  14 ## Naming
  15
  16 Bisect is a binary search, it relies on "git bisect" command.
  17 At the start, two commits need to be marked. One as "old", the other as "new".
  18 Upon second mark, "git bisect" checks out a commit in the middle,
  19 and waits for the user to mark it either old or new.
  20 This effectively replaces the previous mark of the same type,
  21 so "new middle" is checked out, halving the search interval.
  22
  23 But, "old" and "new" frequently refers to the time order bisect chooses commits,
  24 so in this document we use different adjectives:
  25 Early, mid, late. Early commit and late commit are the current
  26 boundaries of the search interval, mid commit is the next one
  27 to test and classify.
  28 The initial boundaries, as input parameters to the whole search process
  29 are called the earliest commit and the latest commit.
  30
  31 ## Bisect jobs
  32
  33 VPP is the only project currently using such jobs.
  34 They are not started automatically, they must be triggered on demand.
  35 They allow full tag expressions, but only some result types are supported.
  36 Currently it is all perf types in UTI model:
  37 "mrr", "ndrpdr", "soak", "reconf" and "hoststack".
  38 Device tests (pass/fail) are not supported yet.
  39 If a test fails, a low fake value is used instead of results,
  40 so the bisect procedure can also find breakages (and fixes).
  41
  42 The trigger word contains the intended testbed type,
  43 e.g. "bisecttest-2n-spr".
  44
  45 The next word needs to be a commit hash of the intended earliest VPP build.
  46 The latest VPP build is the change the comment is added to.
  47
  48 If additional arguments are added to the Gerrit trigger, they are treated
  49 as Robot tag expressions to select tests to run.
  50
  51 ## Basic operation
  52
  53 The job builds VPP .deb packages for both the earliest and latest VPP commit,
  54 then runs the selected tests on both (using CSIT code at HEAD
  55 of the newest CSIT oper branch, or CSIT_REF if started manually on Jenkins).
  56 In archived logs, the results of earliest VPP build are in "earliest" directory,
  57 and results of latest VPP build are in "latest" directory.
  58
  59 Then the job follows VPP mid commits selected by "git bisect".
  60 They are built and tested, results appear in "middle" directory,
  61 numbered in order "git bisect" has chosen them.
  62
  63 When classifying the newly measured performance of the current mid commit,
  64 the three sets of current results (early, mid, late) are grouped
  65 in three ways. The mid is either added to early group, or to late group,
  66 or kept as a separate group.
  67 The same Minimal Description Length algorithm as in trend analysis
  68 is used to select the grouping with smallest information content.
  69 If the grouping with the mid results added to the early group
  70 is the smallest, the mid commit becomes the new early.
  71 If the grouping with the mid results added to the late group
  72 is the smallest, the mid commit becomes the new late.
  73 If the grouping with the mid results separate is the smallest,
  74 the mid commit becomes that boundary which keeps larger difference
  75 of average performances (relative to the larger value, pairwise).
  76
  77 ## Temporary specifics
  78
  79 The Minimal Description Length analysis is performed by
  80 jumpavg-0.4.1 (available on PyPI).
  81
  82 In contrast to trending, MRR trial duration is kept at 1 second,
  83 but trial multiplicity is set to 60 samples.
  84 Both parameters are set in ci-management,
  85 overridable when triggerring manually on Jenkins.
  86
  87 The 60x1s setting increases probability of false anomalies,
  88 but bisect always converges to a commit;
  89 it is up to humans to decide if that is a real anomaly.
  90 On upside, stdev is estimated better, making the bisection less sensitive
  91 to randomness. Systematic errors are still possible,
  92 but overall this choice leads to more human-like search decisions.
  93
  94 As test failures are tolerated, the bisect job usually succeeds
  95 (unless there is a fatal infrastructure issue).
  96 Human investigation is needed to confirm the identified commit is the real cause.
  97 For example, if the cause is in CSIT and all builds lead to failed tests,
  98 the bisect will converge to the earliest commit, which is probably innocent.
  99
 100 ## Console output
 101
 102 After each mid build is tested, the tree sets of relevant results
 103 are visible in the console output, the prefixes are (without quotes)
 104 "Read csit_early: "
 105 "Read csit_late: "
 106 "Read csit_mid: ".
 107 Each prefix is followed by a list of float values extracted from the tests,
 108 the meaning and units depend on tests chosen
 109 (but do not matter as the same set of tests is executed for each build).
 110 There are also lines starting with "Stats: AvgStdevStats"
 111 which give and overview for average and standard deviation.
 112 Then, the information content in bits for the three possible groupings is listed,
 113 followed by the decision, bits saving and new performance difference.
 114 After the last iteration, the commit message of the offending commit is listed.