From f2430562835ded8aeb66db2c36379cf3ea54c748 Mon Sep 17 00:00:00 2001
From: Vratko Polak <vrpolak@cisco.com>
Date: Thu, 14 Jun 2018 14:03:44 +0200
Subject: [PATCH] CSIT-1110: Improve new detection methodology doc

Change-Id: I068fd4e9418f232ee1e1f13994e9c5c431478ec8
Signed-off-by: Vratko Polak <vrpolak@cisco.com>
---
 docs/new/cpta/introduction/index.rst |  14 ++---
 docs/new/cpta/methodology/index.rst  | 105 +++++++++++++++++++++++++----------
 2 files changed, 82 insertions(+), 37 deletions(-)

diff --git a/docs/new/cpta/introduction/index.rst b/docs/new/cpta/introduction/index.rst
index 991181aff4..229e9e3da9 100644
--- a/docs/new/cpta/introduction/index.rst
+++ b/docs/new/cpta/introduction/index.rst
@@ -8,17 +8,18 @@ Performance dashboard tables provide the latest VPP throughput trend,
 trend compliance and detected anomalies, all on a per VPP test case
 basis.  Linked trendline graphs enable further drill-down into the
 trendline compliance, sequence and nature of anomalies, as well as
-pointers to performance test builds/logs and VPP builds. Performance
-trending is currently based on the Maximum Receive Rate (MRR) tests. MRR
-tests measure the packet forwarding rate under the maximum load offered
+pointers to performance test builds/logs and VPP (or DPDK) builds.
+Performance trending is currently based on the Maximum Receive Rate (MRR) tests.
+MRR tests measure the packet forwarding rate under the maximum load offered
 by traffic generator over a set trial duration, regardless of packet
 loss. See :ref:`trending_methodology` section for more detail including
 trend and anomaly calculations.
 
-Data samples are generated by the CSIT VPP performance trending jobs
+Data samples are generated by the CSIT VPP (and DPDK) performance trending jobs
 executed twice a day (target start: every 12 hrs, 02:00, 14:00 UTC). All
-trend and anomaly evaluation is based on a rolling window of <N=14> data
-samples, covering last 7 days.
+trend and anomaly evaluation is based on an algorithm which divides test runs
+into groups according to minimum description length principle.
+The trend value is the population average of the results within a group.
 
 Failed tests
 ------------
@@ -53,7 +54,6 @@ Legend to the tables:
       maximum of trend values over the last quarter except last week.
     - **Regressions [#]**: Number of regressions detected.
     - **Progressions [#]**: Number of progressions detected.
-    - **Outliers [#]**: Number of outliers detected.
 
 Tested VPP worker-thread-core combinations (1t1c, 2t2c, 4t4c) are listed
 in separate tables in section 1.x. Followed by trending methodology in
diff --git a/docs/new/cpta/methodology/index.rst b/docs/new/cpta/methodology/index.rst
index ff69eb1f9a..612f6b32db 100644
--- a/docs/new/cpta/methodology/index.rst
+++ b/docs/new/cpta/methodology/index.rst
@@ -66,7 +66,63 @@ are far from being distributed normally, we do not have a better tractable model
 
 The group boundaries are selected based on `Minimum Description Length`_.
 
-TODO: Decide the level of detail for describing group selection.
+Minimum Description Length
+--------------------------
+
+`Minimum Description Length`_ (MDL) is a particular formalization
+of `Occam's razor`_ principle.
+
+The general formulation mandates to evaluate a large set of models,
+but for anomaly detection purposes, it is usefuls to consider
+a smaller set of models, so that scoring and comparing them is easier.
+
+For each candidate model, the data should be compressed losslessly,
+which includes model definitions, encoded model parameters,
+and the raw data encoded based on probabilities computed by the model.
+The model resulting in shortest compressed message is the "the" correct model.
+
+For our model set (groups of normally distributed samples),
+we need to encode group length (which penalizes too many groups),
+group average (more on that later), group stdev and then all the samples.
+
+Luckily, the "all the samples" part turns out to be quite easy to compute.
+If sample values are considered as coordinates in (multi-dimensional)
+Euclidean space, fixing stdev means the point with allowed coordinates
+lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane,
+and Gaussian probability density on the resulting sphere is constant.
+So the only contribution is the "area" of the sphere, which only depends
+on the number of samples and stdev.
+
+A somehow ambiguous part is in choosing which encoding
+is used for group size, average and stdev.
+Diferent encodings cause different biases to large or small values.
+In our implementation we have chosen probability density
+corresponding to uniform distribution (from zero to maximal sample value)
+for stdev and average of the first group,
+but for averages of subsequent groups we have chosen a distribution
+which disourages deliminating groups with averages close together.
+
+One part of our implementation which is not precise enough
+is handling of measurement precision.
+The minimal difference in MRR values is currently 0.1 pps
+(the difference of one packet over 10 second trial),
+but the code assumes the precision is 1.0.
+Also, all the calculations assume 1.0 is totally negligible,
+compared to stdev value.
+
+The group selection algorithm currently has no parameters,
+all the aforementioned encodings and handling of precision is hardcoded.
+In principle, every group selection is examined, and the one encodable
+with least amount of bits is selected.
+As the bit amount for a selection is just sum of bits for every group,
+finding the best selection takes number of comparisons
+quadratically increasing with the size of data,
+the overall time complexity being probably cubic.
+
+The resulting group distribution looks good
+if samples are distributed normally enough within a group.
+But for obviously different distributions (for example `bimodal distribution`_)
+the groups tend to focus on less relevant factors (such as "outlier" density).
 
 Anomaly Detection
 `````````````````
@@ -75,37 +131,23 @@ Once the trend data is divided into groups, each group has its population averag
 The start of the following group is marked as a regression (or progression)
 if the new group's average is lower (higher) then the previous group's.
 
-Metrics
-```````
-
-TODO: Only needed for current trend compliance.
-
-Following statistical metrics are used as performance trend indicators
-over the rolling window of last <N> sets of historical measurement data:
-
-- **TMM** : **Trimmed Moving Median**, median across the data set of <N>
-  values excluding the outliers. Used as a trending value and as a
-  reference for anomaly detection.
-
 Trend Compliance
 ````````````````
 
-TODO: Apply new detection algorithm also to trend compliance.
-
 Trend compliance metrics are targeted to provide an indication of trend
 changes over a short-term (i.e. weekly) and a long-term (i.e.
-quarterly), comparing the last trend value, TMM[last], to one from week
-ago, TMM[last - 1week] and to the maximum of trend values over last
-quarter except last week, max(TMM[(last - 3mths)..(last - 1week)]),
+quarterly), comparing the last group average AVG[last], to the one from week
+ago, AVG[last - 1week] and to the maximum of trend values over last
+quarter except last week, max(AVG[last - 3mths]..ANV[last - 1week]),
 respectively. This results in following trend compliance calculations:
 
-+-------------------------+---------------------------------+-----------+------------------------------------------+
-| Trend Compliance Metric | Trend Change Formula            | Value     | Reference                                |
-+=========================+=================================+===========+==========================================+
-| Short-Term Change       | (Value - Reference) / Reference | TMM[last] | TMM[last - 1week]                        |
-+-------------------------+---------------------------------+-----------+------------------------------------------+
-| Long-Term Change        | (Value - Reference) / Reference | TMM[last] | max(TMM[(last - 3mths)..(last - 1week)]) |
-+-------------------------+---------------------------------+-----------+------------------------------------------+
++-------------------------+---------------------------------+-----------+-------------------------------------------+
+| Trend Compliance Metric | Trend Change Formula            | Value     | Reference                                 |
++=========================+=================================+===========+===========================================+
+| Short-Term Change       | (Value - Reference) / Reference | AVG[last] | AVG[last - 1week]                         |
++-------------------------+---------------------------------+-----------+-------------------------------------------+
+| Long-Term Change        | (Value - Reference) / Reference | AVG[last] | max(AVG[last - 3mths]..AVG[last - 1week]) |
++-------------------------+---------------------------------+-----------+-------------------------------------------+
 
 Trend Presentation
 ------------------
@@ -131,10 +173,11 @@ associated gruop averages. The graphs are constructed as follows:
 - Y-axis represents MRR throughput in Mpps.
 - Markers to indicate anomaly classification:
 
-  - Outlier - gray circle around MRR value point.
   - Regression - red circle.
   - Progression - green circle.
 
+- The line shows average of each group.
+
 In addition the graphs show dynamic labels while hovering over graph
 data points, representing (trend job build Id, MRR value) and the actual
 vpp build number (b<XXX>) tested.
@@ -184,12 +227,12 @@ PA is defined as follows:
 
 3. Re-calculate new groups and their averages.
 
-4. Evaluate new test data (TODO: Update.):
+4. Evaluate new test data:
 
-   a) If within the range of (TMA +/- 3*TMSD) => Result = Pass,
+   a) If the existing group is prolonged => Result = Pass,
       Reason = Normal. (to be updated base on the final Jenkins code).
-   b) If below the range => Result = Fail, Reason = Regression.
-   c) If above the range => Result = Pass, Reason = Progression.
+   b) If a new group is detected with lower average => Result = Fail, Reason = Regression.
+   c) If a new group is detected with higher average => Result = Pass, Reason = Progression.
 
 5. Generate and publish results
 
@@ -206,3 +249,5 @@ The testbed HW configuration is described on
 `this FD.IO wiki page <https://wiki.fd.io/view/CSIT/CSIT_LF_testbed#FD.IO_CSIT_testbed_-_Server_HW_Configuration>`_.
 
 .. _Minimum Description Length: https://en.wikipedia.org/wiki/Minimum_description_length
+.. _Occam's razor: https://en.wikipedia.org/wiki/Occam%27s_razor
+.. _bimodal distribution: https://en.wikipedia.org/wiki/Bimodal_distribution
-- 
2.16.6