CSIT-913: Continuous Trending, Analysis and Change Detection

[csit.git] / docs / cpta / introduction / index.rst
diff --git a/docs/cpta/introduction/index.rst b/docs/cpta/introduction/index.rst

new file mode 100644 (file)

index 0000000..aad683b
--- /dev/null
+++ b/docs/cpta/introduction/index.rst
@@ -0,0 +1,182 @@
+Introduction
+============
+
+Purpose
+-------
+
+With increasing number of features and code changes in the FD.io VPP data plane
+codebase, it is increasingly difficult to measure and detect VPP data plane
+performance changes. Similarly, once degradation is detected, it is getting
+harder to bisect the source code in search of the Bad code change or addition.
+The problem is further escalated by a large combination of compute platforms
+that VPP is running and used on, including Intel Xeon, Intel Atom, ARM Aarch64.
+
+Existing FD.io CSIT continuous performance trending test jobs help, but they
+rely on human factors for anomaly detection, and as such are error prone and
+unreliable, as the volume of data generated by these jobs is growing
+exponentially.
+
+Proposed solution is to eliminate human factor and fully automate performance
+trending, regression and progression detection, as well as bisecting.
+
+This document describes a high-level design of a system for continuous
+measuring, trending and performance change detection for FD.io VPP SW data
+plane. It builds upon the existing CSIT framework with extensions to its
+throughput testing methodology, CSIT data analytics engine
+(PAL – Presentation-and-Analytics-Layer) and associated Jenkins jobs
+definitions.
+
+Continuous Performance Trending and Analysis
+--------------------------------------------
+
+Proposed design replaces existing CSIT performance trending jobs and tests with
+new Performance Trending (PT) CSIT module and separate Performance Analysis (PA)
+module ingesting results from PT and analysing, detecting and reporting any
+performance anomalies using historical trending data and statistical metrics.
+PA does also produce trending graphs with summary and drill-down views across
+all specified tests that can be reviewed and inspected regularly by FD.io
+developers and users community.
+
+Trend Analysis
+``````````````
+
+All measured performance trend data is treated as time-series data that can be
+modelled using normal distribution. After trimming the outliers, the average and
+deviations from average are used for detecting performance change anomalies
+following the three-sigma rule of thumb (a.k.a. 68-95-99.7 rule).
+
+Analysis Metrics
+````````````````
+
+Following statistical metrics are proposed as performance trend indicators over
+the rolling window of last <N> sets of historical measurement data:
+
+    #. Quartiles Q1, Q2, Q3 – three points dividing a ranked set of data set
+       into four equal parts, Q2 is the median of the data.
+    #. Inter Quartile Range IQR=Q3-Q1 – measure of variability, used here to
+       eliminate outliers.
+    #. Outliers – extreme values that are at least 1.5*IQR below Q1, or at
+       least 1.5*IQR above Q3.
+    #. Trimmed Moving Average (TMA) – average across the data set of the rolling
+       window of <N> values without the outliers. Used here to calculate TMSD.
+    #. Trimmed Moving Standard Deviation (TMSD) – standard deviation over the
+       data set of the rolling window of <N> values without the outliers,
+       requires calculating TMA. Used here for anomaly detection.
+    #. Moving Median (MM) - median across the data set of the rolling window of
+       <N> values with all data points, including the outliers. Used here for
+       anomaly detection.
+
+Anomaly Detection
+`````````````````
+
+Based on the assumption that all performance measurements can be modelled using
+normal distribution, a three-sigma rule of thumb is proposed as the main
+criteria for anomaly detection.
+
+Three-sigma rule of thumb, aka 68–95–99.7 rule, is a shorthand used to capture
+the percentage of values that lie within a band around the average (mean) in a
+normal distribution within a width of two, four and six standard deviations.
+More accurately 68.27%, 95.45% and 99.73% of the result values should lie within
+one, two or three standard deviations of the mean, see figure below.
+
+To verify compliance of test result with value X against defined trend analysis
+metric and detect anomalies, three simple evaluation criteria are proposed:
+
+::
+
+    Test Result Evaluation      Reported Result     Reported Reason     Trending Graph Markers
+    ==========================================================================================
+          Normal                      Pass              Normal            Part of plot line
+          Regression                  Fail              Regression        Red circle
+          Progression                 Pass              Progression       Green circle
+
+Jenkins job cumulative results:
+
+    #. Pass - if all detection results are Pass or Warning.
+    #. Fail - if any detection result is Fail.
+
+Performance Trending (PT)
+`````````````````````````
+
+CSIT PT runs regular performance test jobs finding MRR, PDR and NDR per test
+cases. PT is designed as follows:
+
+    #. PT job triggers:
+
+        #. Periodic e.g. daily.
+        #. On-demand gerrit triggered.
+        #. Other periodic TBD.
+
+    #. Measurements and calculations per test case:
+
+        #. MRR Max Received Rate
+
+            #. Measured: Unlimited tolerance of packet loss.
+            #. Send packets at link rate, count total received packets, divide
+               by test trial period.
+
+        #. Optimized binary search bounds for PDR and NDR tests:
+
+            #. Calculated: High and low bounds for binary search based on MRR
+               and pre-defined Packet Loss Ratio (PLR).
+            #. HighBound=MRR, LowBound=to-be-determined.
+            #. PLR – acceptable loss ratio for PDR tests, currently set to 0.5%
+               for all performance tests.
+
+        #. PDR and NDR:
+
+            #. Run binary search within the calculated bounds, find PDR and NDR.
+            #. Measured: PDR Partial Drop Rate – limited non-zero tolerance of
+               packet loss.
+            #. Measured: NDR Non Drop Rate - zero packet loss.
+
+    #. Archive MRR, PDR and NDR per test case.
+    #. Archive counters collected at MRR, PDR and NDR.
+
+Performance Analysis (PA)
+`````````````````````````
+
+CSIT PA runs performance analysis, change detection and trending using specified
+trend analysis metrics over the rolling window of last <N> sets of historical
+measurement data. PA is defined as follows:
+
+    #. PA job triggers:
+
+        #. By PT job at its completion.
+        #. On-demand gerrit triggered.
+        #. Other periodic TBD.
+
+    #. Download and parse archived historical data and the new data:
+
+        #. New data from latest PT job is evaluated against the rolling window
+           of <N> sets of historical data.
+        #. Download RF output.xml files and compressed archived data.
+        #. Parse out the data filtering test cases listed in PA specification
+           (part of CSIT PAL specification file).
+
+    #. Calculate trend metrics for the rolling window of <N> sets of historical data:
+
+        #. Calculate quartiles Q1, Q2, Q3.
+        #. Trim outliers using IQR.
+        #. Calculate TMA and TMSD.
+        #. Calculate normal trending range per test case based on TMA and TMSD.
+
+    #. Evaluate new test data against trend metrics:
+
+        #. If within the range of (TMA +/- 3*TMSD) => Result = Pass,
+           Reason = Normal.
+        #. If below the range => Result = Fail, Reason = Regression.
+        #. If above the range => Result = Pass, Reason = Progression.
+
+    #. Generate and publish results
+
+        #. Relay evaluation result to job result.
+        #. Generate a new set of trend analysis summary graphs and drill-down
+           graphs.
+
+            #. Summary graphs to include measured values with Normal,
+               Progression and Regression markers. MM shown in the background if
+               possible.
+            #. Drill-down graphs to include MM, TMA and TMSD.
+
+        #. Publish trend analysis graphs in html format.