X-Git-Url: https://gerrit.fd.io/r/gitweb?p=csit.git;a=blobdiff_plain;f=docs%2Fcpta%2Fmethodology%2Findex.rst;h=7d7604bee8ab44b31a9aeb1a60e1daa00381824f;hp=349778999ebdd7cb14f75db5f094aa27b053970c;hb=1aeacc2c1efb0ce39c2d0789ec04a694a0943ba7;hpb=3dcef45002a1b82c4503ec590d680950930fa193 diff --git a/docs/cpta/methodology/index.rst b/docs/cpta/methodology/index.rst index 349778999e..7d7604bee8 100644 --- a/docs/cpta/methodology/index.rst +++ b/docs/cpta/methodology/index.rst @@ -8,26 +8,22 @@ Overview This document describes a high-level design of a system for continuous performance measuring, trending and change detection for FD.io VPP SW -data plane. It builds upon the existing FD.io CSIT framework with -extensions to its throughput testing methodology, CSIT data analytics -engine (PAL – Presentation-and-Analytics-Layer) and associated Jenkins -jobs definitions. +data plane (and other performance tests run within CSIT sub-project). -Proposed design replaces existing CSIT performance trending jobs and -tests with new Performance Trending (PT) CSIT module and separate +There is a Performance Trending (PT) CSIT module, and a separate Performance Analysis (PA) module ingesting results from PT and analysing, detecting and reporting any performance anomalies using -historical trending data and statistical metrics. PA does also produce -trending dashboard and graphs with summary and drill-down views across -all specified tests that can be reviewed and inspected regularly by -FD.io developers and users community. +historical data and statistical metrics. PA does also produce +trending dashboard, list of failed tests and graphs with summary and +drill-down views across all specified tests that can be reviewed and +inspected regularly by FD.io developers and users community. Performance Tests ----------------- -Performance trending is currently relying on the Maximum Receive Rate -(MRR) tests. MRR tests measure the packet forwarding rate under the -maximum load offered by traffic generator over a set trial duration, +Performance trending relies on Maximum Receive Rate (MRR) tests. +MRR tests measure the packet forwarding rate, in multiple trials of set +duration, under the maximum load offered by traffic generator regardless of packet loss. Maximum load for specified Ethernet frame size is set to the bi-directional link rate. @@ -44,11 +40,14 @@ Current parameters for performance trending MRR tests: - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, a 10GE bi-directional link rate. - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, - a 40GE bi-directional link sub-rate limited by TG 40GE NIC used, - XL710. + a 40GE bi-directional link sub-rate limited by the packet forwarding + capacity of 2-port 40GE NIC model (XL710) used on T-Rex Traffic + Generator. -- **Trial duration**: 10sec. -- **Execution frequency**: twice a day, every 12 hrs (02:00, 14:00 UTC). +- **Trial duration**: 1 sec. +- **Number of trials per test**: 10. +- **Test execution frequency**: twice a day, every 12 hrs (02:00, + 14:00 UTC). Note: MRR tests should be reporting bi-directional link rate (or NIC rate, if lower) if tested VPP configuration can handle the packet rate @@ -60,88 +59,104 @@ Trend Analysis -------------- All measured performance trend data is treated as time-series data that -can be modelled using normal distribution. After trimming the outliers, -the median and deviations from median are used for detecting performance -change anomalies following the three-sigma rule of thumb (a.k.a. -68-95-99.7 rule). - -Metrics -```````````````` - -Following statistical metrics are used as performance trend indicators -over the rolling window of last sets of historical measurement data: - -- **Q1**, **Q2**, **Q3** : **Quartiles**, three points dividing a ranked - data set of values into four equal parts, Q2 is the median of the - data. -- **IQR** = Q3 - Q1 : **Inter Quartile Range**, measure of variability, - used here to calculate and eliminate outliers. -- **Outliers** : extreme values that are at least (1.5 * IQR) below Q1. - - - Note: extreme values that are at least (1.5 * IQR) above Q3 are not - considered outliers, and are likely to be classified as - progressions. - -- **TMA** : **Trimmed Moving Average**, average across the data set of - values without the outliers. Used here to calculate TMSD. -- **TMSD** : **Trimmed Moving Standard Deviation**, standard deviation - over the data set of values without the outliers, requires - calculating TMA. Used for anomaly detection. -- **TMM** : **Trimmed Moving Median**, median across the data set of - values excluding the outliers. Used as a trending value and as a - reference for anomaly detection. - -Outlier Detection -````````````````` - -Outlier evaluation of test result of value *X* follows the definition -from previous section: - -+----------------------------+----------------------+ -| Outlier Evaluation Formula | Evaluation Result | -+============================+======================+ -| X < (Q1 - 1.5 * IQR) | Outlier | -+----------------------------+----------------------+ -| X >= (Q1 - 1.5 * IQR) | Valid (For Trending) | -+----------------------------+----------------------+ +can be modelled as concatenation of groups, each group modelled +using normal distribution. While sometimes the samples within a group +are far from being distributed normally, currently we do not have a +better tractable model. + +Here, "sample" should be the result of single trial measurement, +with group boundaries set only at test run granularity. +But in order to avoid detecting causes unrelated to VPP performance, +the default presentation (without /new/ in URL) +takes average of all trials within the run as the sample. +Effectively, this acts as a single trial with aggregate duration. + +Performance graphs always show the run average (not all trial results). + +The group boundaries are selected based on `Minimum Description Length`_. + +Minimum Description Length +-------------------------- + +`Minimum Description Length`_ (MDL) is a particular formalization +of `Occam's razor`_ principle. + +The general formulation mandates to evaluate a large set of models, +but for anomaly detection purposes, it is useful to consider +a smaller set of models, so that scoring and comparing them is easier. + +For each candidate model, the data should be compressed losslessly, +which includes model definitions, encoded model parameters, +and the raw data encoded based on probabilities computed by the model. +The model resulting in shortest compressed message is the "the" correct model. + +For our model set (groups of normally distributed samples), +we need to encode group length (which penalizes too many groups), +group average (more on that later), group stdev and then all the samples. + +Luckily, the "all the samples" part turns out to be quite easy to compute. +If sample values are considered as coordinates in (multi-dimensional) +Euclidean space, fixing stdev means the point with allowed coordinates +lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane, +and Gaussian probability density on the resulting sphere is constant. +So the only contribution is the "area" of the sphere, which only depends +on the number of samples and stdev. + +A somehow ambiguous part is in choosing which encoding +is used for group size, average and stdev. +Different encodings cause different biases to large or small values. +In our implementation we have chosen probability density +corresponding to uniform distribution (from zero to maximal sample value) +for stdev and average of the first group, +but for averages of subsequent groups we have chosen a distribution +which disourages delimiting groups with averages close together. + +Our implementation assumes that measurement precision is 1.0 pps. +Thus it is slightly wrong for trial durations other than 1.0 seconds. +Also, all the calculations assume 1.0 pps is totally negligible, +compared to stdev value. + +The group selection algorithm currently has no parameters, +all the aforementioned encodings and handling of precision is hardcoded. +In principle, every group selection is examined, and the one encodable +with least amount of bits is selected. +As the bit amount for a selection is just sum of bits for every group, +finding the best selection takes number of comparisons +quadratically increasing with the size of data, +the overall time complexity being probably cubic. + +The resulting group distribution looks good +if samples are distributed normally enough within a group. +But for obviously different distributions (for example `bimodal distribution`_) +the groups tend to focus on less relevant factors (such as "outlier" density). Anomaly Detection ````````````````` -To verify compliance of test result of valid value against defined -trend metrics and detect anomalies, three simple evaluation formulas are -used: - -+-------------------------------------------+-----------------------------+-------------------+ -| Anomaly Evaluation Formula | Compliance Confidence Level | Evaluation Result | -+===========================================+=============================+===================+ -| (TMM - 3 * TMSD) <= X <= (TMM + 3 * TMSD) | 99.73% | Normal | -+-------------------------------------------+-----------------------------+-------------------+ -| X < (TMM - 3 * TMSD) | Anomaly | Regression | -+-------------------------------------------+-----------------------------+-------------------+ -| X > (TMM + 3 * TMSD) | Anomaly | Progression | -+-------------------------------------------+-----------------------------+-------------------+ +Once the trend data is divided into groups, each group has its population average. +The start of the following group is marked as a regression (or progression) +if the new group's average is lower (higher) then the previous group's. -TMM is used for the central trend reference point instead of TMA as it -is more robust to anomalies. +In the text below, "average at time ", shorthand "AVG[t]" +means "the group average of the group the sample at time belongs to". Trend Compliance ```````````````` Trend compliance metrics are targeted to provide an indication of trend changes over a short-term (i.e. weekly) and a long-term (i.e. -quarterly), comparing the last trend value, TMM[last], to one from week -ago, TMM[last - 1week] and to the maximum of trend values over last -quarter except last week, max(TMM[(last - 3mths)..(last - 1week)]), +quarterly), comparing the last group average AVG[last], to the one from week +ago, AVG[last - 1week] and to the maximum of trend values over last +quarter except last week, max(AVG[last - 3mths]..ANV[last - 1week]), respectively. This results in following trend compliance calculations: -+-------------------------+---------------------------------+-----------+------------------------------------------+ -| Trend Compliance Metric | Trend Change Formula | Value | Reference | -+=========================+=================================+===========+==========================================+ -| Short-Term Change | (Value - Reference) / Reference | TMM[last] | TMM[last - 1week] | -+-------------------------+---------------------------------+-----------+------------------------------------------+ -| Long-Term Change | (Value - Reference) / Reference | TMM[last] | max(TMM[(last - 3mths)..(last - 1week)]) | -+-------------------------+---------------------------------+-----------+------------------------------------------+ ++-------------------------+---------------------------------+-----------+-------------------------------------------+ +| Trend Compliance Metric | Trend Change Formula | Value | Reference | ++=========================+=================================+===========+===========================================+ +| Short-Term Change | (Value - Reference) / Reference | AVG[last] | AVG[last - 1week] | ++-------------------------+---------------------------------+-----------+-------------------------------------------+ +| Long-Term Change | (Value - Reference) / Reference | AVG[last] | max(AVG[last - 3mths]..AVG[last - 1week]) | ++-------------------------+---------------------------------+-----------+-------------------------------------------+ Trend Presentation ------------------ @@ -152,29 +167,39 @@ Performance Dashboard Dashboard tables list a summary of per test-case VPP MRR performance trend and trend compliance metrics and detected number of anomalies. -Separate tables are generated for tested VPP worker-thread-core -combinations (1t1c, 2t2c, 4t4c). Test case names are linked to -respective trending graphs for ease of navigation thru the test data. +Separate tables are generated for each testbed and each tested number of +physical cores for VPP workers (1c, 2c, 4c). Test case names are linked to +respective trending graphs for ease of navigation through the test data. + +Failed tests +```````````` + +The Failed tests tables list the tests which failed over the specified seven- +day period together with the number of fails over the period and last failure +details - Time, VPP-Build-Id and CSIT-Job-Build-Id. + +Separate tables are generated for each testbed. Test case names are linked to +respective trending graphs for ease of navigation through the test data. Trendline Graphs ```````````````` -Trendline graphs show per test case measured MRR throughput values with -associated trendlines. The graphs are constructed as follows: +Trendline graphs show measured per run averages of MRR values, +group average values, and detected anomalies. +The graphs are constructed as follows: -- X-axis represents performance trend job build Id (csit-vpp-perf-mrr- - daily-master-build). -- Y-axis represents MRR throughput in Mpps. +- X-axis represents the date in the format MMDD. +- Y-axis represents run-average MRR value in Mpps. - Markers to indicate anomaly classification: - - Outlier - gray circle around MRR value point. - Regression - red circle. - Progression - green circle. -In addition the graphs show dynamic labels while hovering over graph -data points, representing (trend job build Id, MRR value) and the actual -vpp build number (b) tested. +- The line shows average MRR value of each group. +In addition the graphs show dynamic labels while hovering over graph +data points, presenting the CSIT build date, measured MRR value, VPP +reference, trend job build ID and the LF testbed ID. Jenkins Jobs ------------ @@ -187,67 +212,64 @@ data per test case. PT is designed as follows: 1. PT job triggers: - a) Periodic e.g. daily. + a) Periodic e.g. twice a day. b) On-demand gerrit triggered. 2. Measurements and data calculations per test case: - a) Max Received Rate (MRR) - send packets at link rate over a trial - period, count total received packets, divide by trial period. + a) Max Received Rate (MRR) - for each trial measurement, + send packets at link rate for trial duration, + count total received packets, divide by trial duration. -3. Archive MRR per test case. +3. Archive MRR values per test case. 4. Archive all counters collected at MRR. Performance Analysis (PA) ````````````````````````` -CSIT PA runs performance analysis including trendline calculation, trend -compliance and anomaly detection using specified trend analysis metrics -over the rolling window of last sets of historical measurement data. +CSIT PA runs performance analysis +including anomaly detection as described above. PA is defined as follows: 1. PA job triggers: - a) By PT job at its completion. + a) By PT jobs at their completion. b) On-demand gerrit triggered. 2. Download and parse archived historical data and the new data: a) Download RF output.xml files from latest PT job and compressed - archived data. + archived data from nexus. b) Parse out the data filtering test cases listed in PA specification (part of CSIT PAL specification file). - c) Evalute new data from latest PT job against the rolling window of - sets of historical data for trendline calculation, anomaly - detection and short-term trend compliance. And against long-term - trendline metrics for long-term trend compliance. - -3. Calculate trend metrics for the rolling window of sets of - historical data: - a) Calculate quartiles Q1, Q2, Q3. - b) Trim outliers using IQR. - c) Calculate TMA and TMSD. - d) Calculate normal trending range per test case based on TMM and - TMSD. +3. Re-calculate new groups and their averages. -4. Evaluate new test data against trend metrics: +4. Evaluate new test data: - a) If within the range of (TMA +/- 3*TMSD) => Result = Pass, - Reason = Normal. (to be updated base on the final Jenkins code). - b) If below the range => Result = Fail, Reason = Regression. - c) If above the range => Result = Pass, Reason = Progression. + a) If the existing group is prolonged => Result = Pass, + Reason = Normal. + b) If a new group is detected with lower average => + Result = Fail, Reason = Regression. + c) If a new group is detected with higher average => + Result = Pass, Reason = Progression. 5. Generate and publish results - a) Relay evaluation result to job result. (to be updated base on the - final Jenkins code). - b) Generate a new set of trend summary dashboard and graphs. + a) Relay evaluation result to job result. + b) Generate a new set of trend summary dashboard, list of failed + tests and graphs. c) Publish trend dashboard and graphs in html format on https://docs.fd.io/. + d) Generate an alerting email. This email is sent by Jenkins to + csit-report@lists.fd.io Testbed HW configuration ------------------------ The testbed HW configuration is described on `this FD.IO wiki page `_. + +.. _Minimum Description Length: https://en.wikipedia.org/wiki/Minimum_description_length +.. _Occam's razor: https://en.wikipedia.org/wiki/Occam%27s_razor +.. _bimodal distribution: https://en.wikipedia.org/wiki/Bimodal_distribution