From: Vratko Polak Date: Tue, 16 Apr 2019 16:59:33 +0000 (+0200) Subject: Add perpatch info to cpta methodology X-Git-Url: https://gerrit.fd.io/r/gitweb?p=csit.git;a=commitdiff_plain;h=4f2d0c379b50b66e70d9615fc8425cd4772f7738 Add perpatch info to cpta methodology Also, split methodology file into multiple, per section. Change-Id: I973b93d1a99205d7adb80996a3657215e05b8985 Signed-off-by: Vratko Polak --- diff --git a/docs/cpta/methodology/index.rst b/docs/cpta/methodology/index.rst index 7d7604bee8..cbcfcb50cb 100644 --- a/docs/cpta/methodology/index.rst +++ b/docs/cpta/methodology/index.rst @@ -3,273 +3,12 @@ Trending Methodology ==================== -Overview --------- - -This document describes a high-level design of a system for continuous -performance measuring, trending and change detection for FD.io VPP SW -data plane (and other performance tests run within CSIT sub-project). - -There is a Performance Trending (PT) CSIT module, and a separate -Performance Analysis (PA) module ingesting results from PT and -analysing, detecting and reporting any performance anomalies using -historical data and statistical metrics. PA does also produce -trending dashboard, list of failed tests and graphs with summary and -drill-down views across all specified tests that can be reviewed and -inspected regularly by FD.io developers and users community. - -Performance Tests ------------------ - -Performance trending relies on Maximum Receive Rate (MRR) tests. -MRR tests measure the packet forwarding rate, in multiple trials of set -duration, under the maximum load offered by traffic generator -regardless of packet loss. Maximum load for specified Ethernet frame -size is set to the bi-directional link rate. - -Current parameters for performance trending MRR tests: - -- **Ethernet frame sizes**: 64B (78B for IPv6 tests) for all tests, IMIX for - selected tests (vhost, memif); all quoted sizes include frame CRC, but - exclude per frame transmission overhead of 20B (preamble, inter frame - gap). -- **Maximum load offered**: 10GE and 40GE link (sub-)rates depending on NIC - tested, with the actual packet rate depending on frame size, - transmission overhead and traffic generator NIC forwarding capacity. - - - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, - a 10GE bi-directional link rate. - - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, - a 40GE bi-directional link sub-rate limited by the packet forwarding - capacity of 2-port 40GE NIC model (XL710) used on T-Rex Traffic - Generator. - -- **Trial duration**: 1 sec. -- **Number of trials per test**: 10. -- **Test execution frequency**: twice a day, every 12 hrs (02:00, - 14:00 UTC). - -Note: MRR tests should be reporting bi-directional link rate (or NIC -rate, if lower) if tested VPP configuration can handle the packet rate -higher than bi-directional link rate, e.g. large packet tests and/or -multi-core tests. In other words MRR = min(VPP rate, bi-dir link rate, -NIC rate). - -Trend Analysis --------------- - -All measured performance trend data is treated as time-series data that -can be modelled as concatenation of groups, each group modelled -using normal distribution. While sometimes the samples within a group -are far from being distributed normally, currently we do not have a -better tractable model. - -Here, "sample" should be the result of single trial measurement, -with group boundaries set only at test run granularity. -But in order to avoid detecting causes unrelated to VPP performance, -the default presentation (without /new/ in URL) -takes average of all trials within the run as the sample. -Effectively, this acts as a single trial with aggregate duration. - -Performance graphs always show the run average (not all trial results). - -The group boundaries are selected based on `Minimum Description Length`_. - -Minimum Description Length --------------------------- - -`Minimum Description Length`_ (MDL) is a particular formalization -of `Occam's razor`_ principle. - -The general formulation mandates to evaluate a large set of models, -but for anomaly detection purposes, it is useful to consider -a smaller set of models, so that scoring and comparing them is easier. - -For each candidate model, the data should be compressed losslessly, -which includes model definitions, encoded model parameters, -and the raw data encoded based on probabilities computed by the model. -The model resulting in shortest compressed message is the "the" correct model. - -For our model set (groups of normally distributed samples), -we need to encode group length (which penalizes too many groups), -group average (more on that later), group stdev and then all the samples. - -Luckily, the "all the samples" part turns out to be quite easy to compute. -If sample values are considered as coordinates in (multi-dimensional) -Euclidean space, fixing stdev means the point with allowed coordinates -lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane, -and Gaussian probability density on the resulting sphere is constant. -So the only contribution is the "area" of the sphere, which only depends -on the number of samples and stdev. - -A somehow ambiguous part is in choosing which encoding -is used for group size, average and stdev. -Different encodings cause different biases to large or small values. -In our implementation we have chosen probability density -corresponding to uniform distribution (from zero to maximal sample value) -for stdev and average of the first group, -but for averages of subsequent groups we have chosen a distribution -which disourages delimiting groups with averages close together. - -Our implementation assumes that measurement precision is 1.0 pps. -Thus it is slightly wrong for trial durations other than 1.0 seconds. -Also, all the calculations assume 1.0 pps is totally negligible, -compared to stdev value. - -The group selection algorithm currently has no parameters, -all the aforementioned encodings and handling of precision is hardcoded. -In principle, every group selection is examined, and the one encodable -with least amount of bits is selected. -As the bit amount for a selection is just sum of bits for every group, -finding the best selection takes number of comparisons -quadratically increasing with the size of data, -the overall time complexity being probably cubic. - -The resulting group distribution looks good -if samples are distributed normally enough within a group. -But for obviously different distributions (for example `bimodal distribution`_) -the groups tend to focus on less relevant factors (such as "outlier" density). - -Anomaly Detection -````````````````` - -Once the trend data is divided into groups, each group has its population average. -The start of the following group is marked as a regression (or progression) -if the new group's average is lower (higher) then the previous group's. - -In the text below, "average at time ", shorthand "AVG[t]" -means "the group average of the group the sample at time belongs to". - -Trend Compliance -```````````````` - -Trend compliance metrics are targeted to provide an indication of trend -changes over a short-term (i.e. weekly) and a long-term (i.e. -quarterly), comparing the last group average AVG[last], to the one from week -ago, AVG[last - 1week] and to the maximum of trend values over last -quarter except last week, max(AVG[last - 3mths]..ANV[last - 1week]), -respectively. This results in following trend compliance calculations: - -+-------------------------+---------------------------------+-----------+-------------------------------------------+ -| Trend Compliance Metric | Trend Change Formula | Value | Reference | -+=========================+=================================+===========+===========================================+ -| Short-Term Change | (Value - Reference) / Reference | AVG[last] | AVG[last - 1week] | -+-------------------------+---------------------------------+-----------+-------------------------------------------+ -| Long-Term Change | (Value - Reference) / Reference | AVG[last] | max(AVG[last - 3mths]..AVG[last - 1week]) | -+-------------------------+---------------------------------+-----------+-------------------------------------------+ - -Trend Presentation ------------------- - -Performance Dashboard -````````````````````` - -Dashboard tables list a summary of per test-case VPP MRR performance -trend and trend compliance metrics and detected number of anomalies. - -Separate tables are generated for each testbed and each tested number of -physical cores for VPP workers (1c, 2c, 4c). Test case names are linked to -respective trending graphs for ease of navigation through the test data. - -Failed tests -```````````` - -The Failed tests tables list the tests which failed over the specified seven- -day period together with the number of fails over the period and last failure -details - Time, VPP-Build-Id and CSIT-Job-Build-Id. - -Separate tables are generated for each testbed. Test case names are linked to -respective trending graphs for ease of navigation through the test data. - -Trendline Graphs -```````````````` - -Trendline graphs show measured per run averages of MRR values, -group average values, and detected anomalies. -The graphs are constructed as follows: - -- X-axis represents the date in the format MMDD. -- Y-axis represents run-average MRR value in Mpps. -- Markers to indicate anomaly classification: - - - Regression - red circle. - - Progression - green circle. - -- The line shows average MRR value of each group. - -In addition the graphs show dynamic labels while hovering over graph -data points, presenting the CSIT build date, measured MRR value, VPP -reference, trend job build ID and the LF testbed ID. - -Jenkins Jobs ------------- - -Performance Trending (PT) -````````````````````````` - -CSIT PT runs regular performance test jobs measuring and collecting MRR -data per test case. PT is designed as follows: - -1. PT job triggers: - - a) Periodic e.g. twice a day. - b) On-demand gerrit triggered. - -2. Measurements and data calculations per test case: - - a) Max Received Rate (MRR) - for each trial measurement, - send packets at link rate for trial duration, - count total received packets, divide by trial duration. - -3. Archive MRR values per test case. -4. Archive all counters collected at MRR. - -Performance Analysis (PA) -````````````````````````` - -CSIT PA runs performance analysis -including anomaly detection as described above. -PA is defined as follows: - -1. PA job triggers: - - a) By PT jobs at their completion. - b) On-demand gerrit triggered. - -2. Download and parse archived historical data and the new data: - - a) Download RF output.xml files from latest PT job and compressed - archived data from nexus. - b) Parse out the data filtering test cases listed in PA specification - (part of CSIT PAL specification file). - -3. Re-calculate new groups and their averages. - -4. Evaluate new test data: - - a) If the existing group is prolonged => Result = Pass, - Reason = Normal. - b) If a new group is detected with lower average => - Result = Fail, Reason = Regression. - c) If a new group is detected with higher average => - Result = Pass, Reason = Progression. - -5. Generate and publish results - - a) Relay evaluation result to job result. - b) Generate a new set of trend summary dashboard, list of failed - tests and graphs. - c) Publish trend dashboard and graphs in html format on - https://docs.fd.io/. - d) Generate an alerting email. This email is sent by Jenkins to - csit-report@lists.fd.io - -Testbed HW configuration ------------------------- - -The testbed HW configuration is described on -`this FD.IO wiki page `_. - -.. _Minimum Description Length: https://en.wikipedia.org/wiki/Minimum_description_length -.. _Occam's razor: https://en.wikipedia.org/wiki/Occam%27s_razor -.. _bimodal distribution: https://en.wikipedia.org/wiki/Bimodal_distribution +.. toctree:: + + overview + performance_tests + trend_analysis + trend_presentation + jenkins_jobs + testbed_hw_configuration + perpatch_performance_tests diff --git a/docs/cpta/methodology/jenkins_jobs.rst b/docs/cpta/methodology/jenkins_jobs.rst new file mode 100644 index 0000000000..677e0bc748 --- /dev/null +++ b/docs/cpta/methodology/jenkins_jobs.rst @@ -0,0 +1,62 @@ +Jenkins Jobs +------------ + +Performance Trending (PT) +````````````````````````` + +CSIT PT runs regular performance test jobs measuring and collecting MRR +data per test case. PT is designed as follows: + +1. PT job triggers: + + a) Periodic e.g. twice a day. + b) On-demand gerrit triggered. + +2. Measurements and data calculations per test case: + + a) Max Received Rate (MRR) - for each trial measurement, + send packets at link rate for trial duration, + count total received packets, divide by trial duration. + +3. Archive MRR values per test case. +4. Archive all counters collected at MRR. + +Performance Analysis (PA) +````````````````````````` + +CSIT PA runs performance analysis +including anomaly detection as described above. +PA is defined as follows: + +1. PA job triggers: + + a) By PT jobs at their completion. + b) On-demand gerrit triggered. + +2. Download and parse archived historical data and the new data: + + a) Download RF output.xml files from latest PT job and compressed + archived data from nexus. + b) Parse out the data filtering test cases listed in PA specification + (part of CSIT PAL specification file). + +3. Re-calculate new groups and their averages. + +4. Evaluate new test data: + + a) If the existing group is prolonged => Result = Pass, + Reason = Normal. + b) If a new group is detected with lower average => + Result = Fail, Reason = Regression. + c) If a new group is detected with higher average => + Result = Pass, Reason = Progression. + +5. Generate and publish results + + a) Relay evaluation result to job result. + b) Generate a new set of trend summary dashboard, list of failed + tests and graphs. + c) Publish trend dashboard and graphs in html format on + https://docs.fd.io/. + d) Generate an alerting email. This email is sent by Jenkins to + csit-report@lists.fd.io diff --git a/docs/cpta/methodology/overview.rst b/docs/cpta/methodology/overview.rst new file mode 100644 index 0000000000..ecea051116 --- /dev/null +++ b/docs/cpta/methodology/overview.rst @@ -0,0 +1,14 @@ +Overview +-------- + +This document describes a high-level design of a system for continuous +performance measuring, trending and change detection for FD.io VPP SW +data plane (and other performance tests run within CSIT sub-project). + +There is a Performance Trending (PT) CSIT module, and a separate +Performance Analysis (PA) module ingesting results from PT and +analysing, detecting and reporting any performance anomalies using +historical data and statistical metrics. PA does also produce +trending dashboard, list of failed tests and graphs with summary and +drill-down views across all specified tests that can be reviewed and +inspected regularly by FD.io developers and users community. diff --git a/docs/cpta/methodology/performance_tests.rst b/docs/cpta/methodology/performance_tests.rst new file mode 100644 index 0000000000..82e64f870a --- /dev/null +++ b/docs/cpta/methodology/performance_tests.rst @@ -0,0 +1,36 @@ +Performance Tests +----------------- + +Performance trending relies on Maximum Receive Rate (MRR) tests. +MRR tests measure the packet forwarding rate, in multiple trials of set +duration, under the maximum load offered by traffic generator +regardless of packet loss. Maximum load for specified Ethernet frame +size is set to the bi-directional link rate. + +Current parameters for performance trending MRR tests: + +- **Ethernet frame sizes**: 64B (78B for IPv6 tests) for all tests, IMIX for + selected tests (vhost, memif); all quoted sizes include frame CRC, but + exclude per frame transmission overhead of 20B (preamble, inter frame + gap). +- **Maximum load offered**: 10GE and 40GE link (sub-)rates depending on NIC + tested, with the actual packet rate depending on frame size, + transmission overhead and traffic generator NIC forwarding capacity. + + - For 10GE NICs the maximum packet rate load is 2* 14.88 Mpps for 64B, + a 10GE bi-directional link rate. + - For 40GE NICs the maximum packet rate load is 2* 18.75 Mpps for 64B, + a 40GE bi-directional link sub-rate limited by the packet forwarding + capacity of 2-port 40GE NIC model (XL710) used on T-Rex Traffic + Generator. + +- **Trial duration**: 1 sec. +- **Number of trials per test**: 10. +- **Test execution frequency**: twice a day, every 12 hrs (02:00, + 14:00 UTC). + +Note: MRR tests should be reporting bi-directional link rate (or NIC +rate, if lower) if tested VPP configuration can handle the packet rate +higher than bi-directional link rate, e.g. large packet tests and/or +multi-core tests. In other words MRR = min(VPP rate, bi-dir link rate, +NIC rate). diff --git a/docs/cpta/methodology/perpatch_performance_tests.rst b/docs/cpta/methodology/perpatch_performance_tests.rst new file mode 100644 index 0000000000..c1d3d669b1 --- /dev/null +++ b/docs/cpta/methodology/perpatch_performance_tests.rst @@ -0,0 +1,86 @@ +Per-patch performance tests +--------------------------- + +Updated for CSIT git commit id: 661035ac4ce6e51649f302fe2b7a8218257c0587. + +A methodology similar to trending analysis is used for comparing performance +before a DUT code change is merged. This can act as a verify job to disallow +changes which would decrease performance without a good reason. + +Existing jobs +````````````` + +VPP is the only project currently using such jobs. +They are not started automatically, must be triggered on demand. +They allow full tag expressions, but some tags are enforced (such as MRR). + +Only the three types of tesbed based on Xeon processors have jobs created. +Their Gerrit triggers words are "perftest-3n-hsw", "perftest-3n-skx" +and "perftest-2n-skx". + +If additional arguments are added to the Gerrit trigger, they are treated +as Robot tag expressions to select tests to run. For more details +on existing tags, see `tag documentation rst file`_. + +Basic operation +``````````````` + +The job builds VPP .deb packages for both the patch under test +(called "current") and its parent patch (called "parent"). + +For each test (from a set defined by tag expression), +both builds are subjected to several trial measurements (BMRR). +Measured samples are grouped to "parent" sequence, +followed by "current" sequence. The same Minimal Description Length +algorithm as in trending is used to decide whether it is one big group, +or two smaller gropus. If it is one group, a "normal" result +is declared for the test. If it is two groups, and current average +is less then parent average, the test is declared a regression. +If it is two groups and current average is larger or equal, +the test is declared a progression. + +The whole job fails (giving -1) if some trial measurement failed, +or if any test was declared a regression. + +Temporary specifics +``````````````````` + +The Minimal Description Length analysis is performed by +jumpavg-0.1.3 available on PyPI. + +In hopes of strengthening of signal (code performance) compared to noise +(all other factors influencing the measured values), several workarounds +are applied. + +In contrast to trending, trial duration is set to 10 seconds, +and only 5 samples are measured for each build. +Both parameters are set in ci-management. + +This decreases sensitivity to regressions, but also decreases +probability of false positives. + +Console output +`````````````` + +The following information as visible towards the end of Jenkins console output, +repeated for each analyzed test. + +The original 5 values are visible in order they were measured. +The 5 values after processing are also visible in output, +this time sorted by value (so people can see minimum and maximum). + +The next output is difference of averages. It is the current average +minus the parent average, expressed as percentage of the parent average. + +The next three outputs contain the jumpavg representation +of the two groups and a combined group. +Here, "bits" is the description length; for "current" sequence +it includes effect from "parent" average value +(jumpavg-0.1.3 penalizes sequences with too close averages). + +Next, a sentence describing which grouping description is shorter, +by how much bits. +Finally, the test result classification is visible. + +The algorithm does not track test case names, +so test cases are indexed (from 0). diff --git a/docs/cpta/methodology/testbed_hw_configuration.rst b/docs/cpta/methodology/testbed_hw_configuration.rst new file mode 100644 index 0000000000..7914de5674 --- /dev/null +++ b/docs/cpta/methodology/testbed_hw_configuration.rst @@ -0,0 +1,5 @@ +Testbed HW configuration +------------------------ + +The testbed HW configuration is described on +`this FD.IO wiki page `_. diff --git a/docs/cpta/methodology/trend_analysis.rst b/docs/cpta/methodology/trend_analysis.rst new file mode 100644 index 0000000000..9916f20350 --- /dev/null +++ b/docs/cpta/methodology/trend_analysis.rst @@ -0,0 +1,106 @@ +Trend Analysis +-------------- + +All measured performance trend data is treated as time-series data that +can be modelled as concatenation of groups, each group modelled +using normal distribution. While sometimes the samples within a group +are far from being distributed normally, currently we do not have a +better tractable model. + +Here, "sample" should be the result of single trial measurement, +with group boundaries set only at test run granularity. +But in order to avoid detecting causes unrelated to VPP performance, +the default presentation (without /new/ in URL) +takes average of all trials within the run as the sample. +Effectively, this acts as a single trial with aggregate duration. + +Performance graphs always show the run average (not all trial results). + +The group boundaries are selected based on `Minimum Description Length`_. + +Minimum Description Length +`````````````````````````` + +`Minimum Description Length`_ (MDL) is a particular formalization +of `Occam's razor`_ principle. + +The general formulation mandates to evaluate a large set of models, +but for anomaly detection purposes, it is useful to consider +a smaller set of models, so that scoring and comparing them is easier. + +For each candidate model, the data should be compressed losslessly, +which includes model definitions, encoded model parameters, +and the raw data encoded based on probabilities computed by the model. +The model resulting in shortest compressed message is the "the" correct model. + +For our model set (groups of normally distributed samples), +we need to encode group length (which penalizes too many groups), +group average (more on that later), group stdev and then all the samples. + +Luckily, the "all the samples" part turns out to be quite easy to compute. +If sample values are considered as coordinates in (multi-dimensional) +Euclidean space, fixing stdev means the point with allowed coordinates +lays on a sphere. Fixing average intersects the sphere with a (hyper)-plane, +and Gaussian probability density on the resulting sphere is constant. +So the only contribution is the "area" of the sphere, which only depends +on the number of samples and stdev. + +A somehow ambiguous part is in choosing which encoding +is used for group size, average and stdev. +Different encodings cause different biases to large or small values. +In our implementation we have chosen probability density +corresponding to uniform distribution (from zero to maximal sample value) +for stdev and average of the first group, +but for averages of subsequent groups we have chosen a distribution +which disourages delimiting groups with averages close together. + +Our implementation assumes that measurement precision is 1.0 pps. +Thus it is slightly wrong for trial durations other than 1.0 seconds. +Also, all the calculations assume 1.0 pps is totally negligible, +compared to stdev value. + +The group selection algorithm currently has no parameters, +all the aforementioned encodings and handling of precision is hardcoded. +In principle, every group selection is examined, and the one encodable +with least amount of bits is selected. +As the bit amount for a selection is just sum of bits for every group, +finding the best selection takes number of comparisons +quadratically increasing with the size of data, +the overall time complexity being probably cubic. + +The resulting group distribution looks good +if samples are distributed normally enough within a group. +But for obviously different distributions (for example `bimodal distribution`_) +the groups tend to focus on less relevant factors (such as "outlier" density). + +Anomaly Detection +````````````````` + +Once the trend data is divided into groups, each group has its population average. +The start of the following group is marked as a regression (or progression) +if the new group's average is lower (higher) then the previous group's. + +In the text below, "average at time ", shorthand "AVG[t]" +means "the group average of the group the sample at time belongs to". + +Trend Compliance +```````````````` + +Trend compliance metrics are targeted to provide an indication of trend +changes over a short-term (i.e. weekly) and a long-term (i.e. +quarterly), comparing the last group average AVG[last], to the one from week +ago, AVG[last - 1week] and to the maximum of trend values over last +quarter except last week, max(AVG[last - 3mths]..ANV[last - 1week]), +respectively. This results in following trend compliance calculations: + ++-------------------------+---------------------------------+-----------+-------------------------------------------+ +| Trend Compliance Metric | Trend Change Formula | Value | Reference | ++=========================+=================================+===========+===========================================+ +| Short-Term Change | (Value - Reference) / Reference | AVG[last] | AVG[last - 1week] | ++-------------------------+---------------------------------+-----------+-------------------------------------------+ +| Long-Term Change | (Value - Reference) / Reference | AVG[last] | max(AVG[last - 3mths]..AVG[last - 1week]) | ++-------------------------+---------------------------------+-----------+-------------------------------------------+ + +.. _Minimum Description Length: https://en.wikipedia.org/wiki/Minimum_description_length +.. _Occam's razor: https://en.wikipedia.org/wiki/Occam%27s_razor +.. _bimodal distribution: https://en.wikipedia.org/wiki/Bimodal_distribution diff --git a/docs/cpta/methodology/trend_presentation.rst b/docs/cpta/methodology/trend_presentation.rst new file mode 100644 index 0000000000..e9918020c5 --- /dev/null +++ b/docs/cpta/methodology/trend_presentation.rst @@ -0,0 +1,42 @@ +Trend Presentation +------------------ + +Performance Dashboard +````````````````````` + +Dashboard tables list a summary of per test-case VPP MRR performance +trend and trend compliance metrics and detected number of anomalies. + +Separate tables are generated for each testbed and each tested number of +physical cores for VPP workers (1c, 2c, 4c). Test case names are linked to +respective trending graphs for ease of navigation through the test data. + +Failed tests +```````````` + +The Failed tests tables list the tests which failed over the specified seven- +day period together with the number of fails over the period and last failure +details - Time, VPP-Build-Id and CSIT-Job-Build-Id. + +Separate tables are generated for each testbed. Test case names are linked to +respective trending graphs for ease of navigation through the test data. + +Trendline Graphs +```````````````` + +Trendline graphs show measured per run averages of MRR values, +group average values, and detected anomalies. +The graphs are constructed as follows: + +- X-axis represents the date in the format MMDD. +- Y-axis represents run-average MRR value in Mpps. +- Markers to indicate anomaly classification: + + - Regression - red circle. + - Progression - green circle. + +- The line shows average MRR value of each group. + +In addition the graphs show dynamic labels while hovering over graph +data points, presenting the CSIT build date, measured MRR value, VPP +reference, trend job build ID and the LF testbed ID.