docs/ietf/draft-ietf-bmwg-mlrsearch-05.md

   1 ---
   2 title: Multiple Loss Ratio Search
   3 abbrev: MLRsearch
   4 docname: draft-ietf-bmwg-mlrsearch-05
   5 date: 2023-10-23
   6
   7 ipr: trust200902
   8 area: ops
   9 wg: Benchmarking Working Group
  10 kw: Internet-Draft
  11 cat: info
  12
  13 coding: us-ascii
  14 pi:    # can use array (if all yes) or hash here
  15   toc: yes
  16   sortrefs:   # defaults to yes
  17   symrefs: yes
  18
  19 author:
  20       -
  21         ins: M. Konstantynowicz
  22         name: Maciek Konstantynowicz
  23         org: Cisco Systems
  24         email: mkonstan@cisco.com
  25       -
  26         ins: V. Polak
  27         name: Vratko Polak
  28         org: Cisco Systems
  29         email: vrpolak@cisco.com
  30
  31 normative:
  32   RFC1242:
  33   RFC2285:
  34   RFC2544:
  35   RFC9004:
  36
  37 informative:
  38   TST009:
  39     target: https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/009/03.04.01_60/gs_NFV-TST009v030401p.pdf
  40     title: "TST 009"
  41   FDio-CSIT-MLRsearch:
  42     target: https://csit.fd.io/cdocs/methodology/measurements/data_plane_throughput/mlr_search/
  43     title: "FD.io CSIT Test Methodology - MLRsearch"
  44     date: 2023-10
  45   PyPI-MLRsearch:
  46     target: https://pypi.org/project/MLRsearch/0.4.1/
  47     title: "MLRsearch 0.4.1, Python Package Index"
  48     date: 2021-07
  49
  50 --- abstract
  51
  52 This document proposes extensions to [RFC2544] throughput search by
  53 defining a new methodology called Multiple Loss Ratio search
  54 (MLRsearch). The main objectives of MLRsearch are to minimize the
  55 total search duration, to support searching for multiple loss ratios
  56 and to improve results repeatability and comparability.
  57
  58 The main motivation behind extending [RFC2544] is the new set of challenges
  59 and requirements posed by evaluating and testing software based networking
  60 systems, specifically their data planes.
  61
  62 MLRsearch offers several ways to address these challenges, giving user
  63 configuration options to select their preferred way.
  64
  65 --- middle
  66
  67 {::comment}
  68     As we use kramdown to convert from markdown,
  69     we use this way of marking comments not to be visible in rendered draft.
  70     https://stackoverflow.com/a/42323390
  71     If other engine is used, convert to this way:
  72     https://stackoverflow.com/a/20885980
  73 {:/comment}
  74
  75 # Purpose and Scope
  76
  77 The purpose of this document is to describe Multiple Loss Ratio search
  78 (MLRsearch), a data plane throughput search methodology optimized for software
  79 networking DUTs.
  80
  81 Applying vanilla [RFC2544] throughput bisection to software DUTs
  82 results in a number of problems:
  83
  84 - Binary search takes too long as most of trials are done far from the
  85   eventually found throughput.
  86 - The required final trial duration and pauses between trials
  87   prolong the overall search duration.
  88 - Software DUTs show noisy trial results,
  89   leading to a big spread of possible discovered throughput values.
  90 - Throughput requires loss of exactly zero frames, but the industry
  91   frequently allows for small but non-zero losses.
  92 - The definition of throughput is not clear when trial results are inconsistent.
  93
  94 MLRsearch library aims to address these problems by applying the following set
  95 of enhancements:
  96
  97 - Allow multiple shorter trials instead of one big trial per load.
  98   - Optionally, tolerate a percentage of trial result with higher loss.
  99 - Allow searching for multiple search goals, with differing loss ratios.
 100   - Any trial result can affect each search goal in principle.
 101 - Insert multiple coarse targets for each search goal, earlier ones need
 102   to spend less time on trials.
 103   - Earlier targets also aim for lesser precision.
 104   - Use Forwarding Rate (FR) at maximum offered load
 105     [RFC2285] (section 3.6.2) to initialize the initial targets.
 106 - Take care when dealing with inconsistent trial results.
 107   - Reported throughput is smaller than smallest load with high loss.
 108   - Smaller load candidates are measured first.
 109 - Apply several load selection heuristics to save even more time
 110   by trying hard to avoid unnecessarily narrow bounds.
 111
 112 Some of these enhancements are formalized as MLRsearch specification,
 113 the remaining enhancements are treated as implementation details,
 114 thus achieving high comparability without limiting future improvements.
 115
 116 MLRsearch configuration options are flexible enough to
 117 support both conservative settings and aggressive settings.
 118 Where the conservative settings lead to results
 119 unconditionally compliant with [RFC2544],
 120 but longer search duration and worse repeatability.
 121 Conversely, aggressive settings lead to shorter search duration
 122 and better repeatability, but the results are not compliant with [RFC2544].
 123
 124 No part of [RFC2544] is intended to be obsoleted by this document.
 125
 126 # Identified Problems
 127
 128 This chapter describes the problems affecting usability
 129 of various preformance testing methodologies,
 130 mainly a binary search for [RFC2544] unconditionally compliant throughput.
 131
 132 The last chapter will summarize how the problems are addressed,
 133 the middle chapters provide explanations and definitions needed for that.
 134
 135 ## Long Search Duration
 136
 137 Emergence of software DUTs, with frequent software updates and a
 138 number of different frame processing modes and configurations,
 139 has increased both the number of performance tests requred to verify DUT update
 140 and the frequency of running those tests.
 141 This makes the overall test execution time even more important than before.
 142
 143 In the context of characterising particular DUT's network performance,
 144 this calls for improving the time efficiency of throughput search.
 145 A vanilla bisection (at 60sec trial duration for unconditional [RFC2544]
 146 compliance) is slow, because most trials spend time quite far from the
 147 eventual throughput.
 148
 149 [RFC2544] does not specify any stopping condition for throughput search,
 150 so users can trade-off between search duration and achieved precision.
 151 But, due to logarithmic nature of bisection, even small improvement
 152 in search duration needs relatively big sacrifice in the precision of the
 153 discovered throughput.
 154
 155 ## DUT in SUT
 156
 157 [RFC2285] defines:
 158 - DUT as
 159   - The network forwarding device to which stimulus is offered and
 160     response measured [RFC2285] (section 3.1.1).
 161 - SUT as
 162   - The collective set of network devices to which stimulus is offered
 163     as a single entity and response measured [RFC2285] (section 3.1.2).
 164
 165 [RFC2544] specifies a test setup with an external tester stimulating the
 166 networking system, treating it either as a single DUT, or as a system
 167 of devices, an SUT.
 168
 169 In case of software networking, the SUT consists nt only of the DUT
 170 as a software program processing frames, but also of
 171 a server hardware and operating system functions,
 172 with server hardware resources shared across all programs
 173 and the operating system running on the same server.
 174
 175 The DUT is effectively nested within the rest of the SUT.
 176
 177 Due to a shared multi-tenant nature of SUT, DUT is subject to
 178 possible interference coming from the operating system and any other
 179 software running on the same server. Some sources of such interference
 180 can be to some degree eliminated, e.g. by pinning DUT program threads
 181 to specific CPU cores and isolating those cores to avoid context switching.
 182 But some level of adverse effects may remain even after
 183 all such reasonable precautions are applied.
 184 These effects affect DUT's network performance negatively.
 185 As the effects are hard to predict in general, they have impact similar to
 186 what other engineering disciplines define as a noise.
 187 Thus, all such effects are called an SUT noise.
 188
 189 DUT can also exhibit fluctuating performance itself, for reasons
 190 not related to the rest of SUT, for example due to pauses in execution
 191 as needed for internal stateful processing. In many cases this
 192 may be an expected per-design behavior, as it would be observable even
 193 in a hypothetical scenario where all sources of SUT noise are eliminated.
 194 Such behavior affects trial results in a way similar to SUT noise.
 195 As the two phenomenons are hard to destinguish,
 196 this document uses the word noise as a shorthand covering both
 197 this internal DUT performance fluctuations and genuine SUT noise.
 198
 199 A simple model of SUT performance consists of an idealized noiseless performance,
 200 and additional noise effects. The noiseless performance is assumed to be constant,
 201 all observed performance variations are due to noise.
 202 The impact of the noise can vary in time, sometimes wildly,
 203 even within a single trial.
 204 The noise can sometimes be negligible, but frequently
 205 it lowers the observed SUT performance as observed in trial results.
 206
 207 In this model, SUT does not have a single performance value, it has a spectrum.
 208 One end of the spectrum is the idealized noiseless performance value,
 209 the other end can be called a noiseful performance. In practice, trial results
 210 close to the noiseful end of the spectrum happen only rarely.
 211 The worse the performance value is, the more rarely it is seen in a trial.
 212 Therefore, the extreme noiseful end of SUT spectrum is not really observable
 213 among trial results. Also, the extreme noiseless end of SUT spectrum
 214 is unlikely to be observable, this time because some small noise effects
 215 are likely to occur multiple times during a trial.
 216
 217 Unless specified otherwise, this document talks about potentially observable
 218 ends of the SUT performance spectrum, not about the extreme ones.
 219
 220 Focusing on DUT, the benchmarking effort should aim
 221 at eliminating only the SUT noise from SUT measurements.
 222 In practice that is not really possible, as based on authors experience
 223 and available literature, there are no realistic enough models
 224 able to distinguish SUT noise from DUT fluctuations.
 225
 226 However, assuming that a well-constructed SUT has the DUT as its
 227 performance bottleneck, the DUT ideal noiseless performance can be defined
 228 as the noiseless end of SUT performance spectrum. At least for
 229 throughput. For other performance quantities such as latency there may be an
 230 additive difference.
 231
 232 Note that by this definition, DUT noiseless performance
 233 also minimizes the impact of DUT fluctuations, as much as realistically possible
 234 for a given trial duration.
 235
 236 In this document, we reduce the DUT in SUT problem to estimating
 237 the noiseless end of SUT performance spectrum from a limited number of
 238 trial results.
 239
 240 Any improvements to throughput search algorithm, aimed for better
 241 dealing with software networking SUT and DUT setup, should employ
 242 strategies recognizing the presence of SUT noise, and allow discovery of
 243 (proxies for) DUT noiseless performance
 244 at different levels of sensitivity to SUT noise.
 245
 246 ## Repeatability and Comparability
 247
 248 [RFC2544] does not suggest to repeat throughput search. And from just one
 249 discovered throughput value, it cannot be determined how repeatable that value is.
 250 In practice, poor repeatability is also the main cause of poor
 251 comparability, that is different benchmarking teams can test the same SUT
 252 but get throughput values differing more than expected from search precision.
 253
 254 [RFC2544] throughput requirements (60 seconds trial and
 255 no tolerance of a single frame loss) affect the throughput results
 256 in the following way.
 257 The SUT behavior close to the noiseful end of its performance spectrum
 258 consists of rare occasions of significantly low performance,
 259 but the long trial duration makes those occasions not so rare on the trial level.
 260 Therefore, the binary search results tend to wander away from the noiseless end
 261 of SUT performance spectrum, more frequently and more widely than shorter
 262 trials would, thus resulting in poor throughput repeatability.
 263
 264 The repeatability problem can be addressed by defining a search procedure
 265 which reports more stable results,
 266 even if they can no longer be called throughput in [RFC2544] sense.
 267 According to the SUT performance spectrum model, better repeatability
 268 will be at the noiseless end of the spectrum.
 269 Therefore, solutions to the DUT in SUT problem
 270 will help also with the repeatability problem.
 271
 272 Conversely, any alteration to [RFC2544] throughput search
 273 that improves repeatability should be considered
 274 as less dependent on the SUT noise.
 275
 276 An alternative option is to simply run a search multiple times, and report some
 277 statistics (e.g. average and standard deviation). This can be used
 278 for a subset of tests deemed more important,
 279 but it makes the search duration problem even more pronounced.
 280
 281 ## Throughput with Non-Zero Loss
 282
 283 [RFC1242] (section 3.17) defines throughput as:
 284     The maximum rate at which none of the offered frames
 285     are dropped by the device.
 286
 287 Then, it says:
 288     Since even the loss of one frame in a
 289     data stream can cause significant delays while
 290     waiting for the higher level protocols to time out,
 291     it is useful to know the actual maximum data
 292     rate that the device can support.
 293
 294 Contrary to that, many benchmarking teams settle with small, non-zero
 295 loss ratio as the goal for a their load search.
 296
 297 Motivations are many:
 298
 299 - Modern protocols tolerate frame loss better,
 300   compared to the time when [RFC1242] and [RFC2544] were specified.
 301
 302 - Trials nowadays send way more frames within the same duration,
 303   increasing the chance small SUT performance fluctuatios
 304   is enough to cause frame loss.
 305
 306 - Small bursts of frame loss caused by noise have otherwise smaller impact
 307   on the average frame loss ratio ovserved in the trial,
 308   as during other parts of the same trial the SUT may work mroe closely
 309   to its noiseless performance, thus perhaps lowering the trial loss ratio
 310   below the goal loss ratio value.
 311
 312 - If an approximation of the SUT noise impact on the trial loss ratio is known,
 313   it can be set as the goal loss ratio.
 314
 315 Regardless of validity of any and all similar motivations,
 316 support for non-zero loss goals makes any search algorithm more user friendly.
 317 [RFC2544] throughput is not user friendly in this regard.
 318
 319 Assuming users are allowed to specify the goal loss ratio value,
 320 the usefulness is enhanced even more if users can specify multiple
 321 loss ratio values, especially when a single search can find all relevant bounds.
 322
 323 Searching for multiple search goals also helps to describe the SUT performance
 324 spectrum better than a single search goal result.
 325 For example, repeated wide gap between zero and non-zero loss loads
 326 indicates the noise has a large impact on the observed performance,
 327 which is not evident from a single goal load search procedure result.
 328
 329 It is easy to modify the vanilla bisection to find a lower bound
 330 for intended load that satisfies a non-zero goal loss ratio.
 331 But it is not that obvious how to search for multiple goals at once,
 332 hence the support for multiple search goals remains a problem.
 333
 334 ## Inconsistent Trial Results
 335
 336 While performing throughput search by executing a sequence of
 337 measurement trials, there is a risk of encountering inconsistencies
 338 between trial results.
 339
 340 The plain bisection never encounters inconsistent trials.
 341 But [RFC2544] hints about possibility of inconsistent trial results,
 342 in two places in its text.
 343 The first place is section 24, where full trial durations are required,
 344 presumably because they can be inconsistent with results
 345 from shorter trial durations.
 346 The second place is section 26.3, where two successive zero-loss trials
 347 are recommended, presumably because after one zero-loss trial
 348 there can be subsequent inconsistent non-zero-loss trial.
 349
 350 Examples include:
 351
 352 - A trial at the same load (same or different trial duration) results
 353   in a different trial loss ratio.
 354 - A trial at higher load (same or different trial duration) results
 355   in a smaller trial loss ratio.
 356
 357 Any robust throughput search algorithm needs to decide how to continue
 358 the search in presence of such inconsistencies.
 359 Definitions of throughput in [RFC1242] and [RFC2544] are not specific enough
 360 to imply a unique way of handling such inconsistencies.
 361
 362 Ideally, there will be a definition of a new quantity which both generalizes
 363 throughput for non-zero-loss (and other possible repeatibility enhancements),
 364 while being precise enough to force a specific way to resolve trial result
 365 inconsistencies.
 366 But until such definition is agreed upon, the correct way to handle
 367 inconsistent trial results remains an open problem.
 368
 369 # MLRsearch Specification
 370
 371 This chapter focuses on technical definitions needed for evaluating
 372 whether a particular test procedure adheres to MLRsearch specification.
 373
 374 For motivations, explanations, and other comments see other chapters.
 375
 376 ## MLRsearch Architecture
 377
 378 MLRsearch architecture consists of three main components:
 379 the manager, the controller and the measurer.
 380 Presence of other components (mainly the SUT) is also implied.
 381
 382 While the manager and the measurer can be seen a abstractions
 383 present in any testing procedure, the behavior of the controller
 384 is what distinguishes MLRsearch algorithms from other search procedures.
 385
 386 ### Measurer
 387
 388 The measurer is the component which performs one trial
 389 as described in [RFC2544] section 23, when requested by the controller.
 390
 391 Specifically, one call to the measurer accepts a trial load value
 392 and trial duration value, performs the trial, and returns
 393 the measured trial loss ratio, and optionally a different duration value.
 394
 395 It is responsibility of the measurer to uphold any requirements
 396 and assumptions present in MLRsearch specification
 397 (e.g. trial forwarding ratio not being larger than one).
 398 Implementers have some freedom, for example in the way they deal with
 399 duplicated frames, or what to return if tester sent zero frames towards SUT.
 400 Implementations are RECOMMENDED to document their behavior
 401 related to such freedoms in as detailed way as possible
 402
 403 Implemenations MUST document any deviations from RFC documents,
 404 for example if the wait time around traffic
 405 is shorter than what [RFC2544] section 23 specifies.
 406
 407 ### Controller
 408
 409 The controller is the component of MLRsearch architecture
 410 that is called by the manager (just once), calls the measurer
 411 (usually multiple times in succession),
 412 and returns the result of the search to the manager.
 413
 414 The only required argument in the call to the controller
 415 is a list of search goals. For the structure of the search result,
 416 see subsection Search Result.
 417
 418 ### Manager
 419
 420 The manager is the component that initializes SUT, traffic generator
 421 (tester in [RFC2544]), the measurer and the controller
 422 with intended configurations. It then hands over the execution
 423 to the controller and receives its result.
 424
 425 Creation of reports of appropriate format can also be understood
 426 as the responsibility of the manager.
 427
 428 ## Units
 429
 430 The specification deals with physical quantities, so it is assumed
 431 each numeric value is accompanied by an appropriate physical unit.
 432
 433 The specification does not state which unit is appropriate,
 434 but implementations MUST make it explicit which unit is used
 435 for each value provided or received by the user.
 436
 437 For example, load quantities (including the conditional throughput)
 438 returned by the controller are defined to be based on single-interface
 439 (unidirectional) loads. For bidirectional traffic, users are likely
 440 to expect bidirectional throughput quantities, so the manager is responsible
 441 for making its report clear.
 442
 443 ## SUT
 444
 445 As defined in [RFC2285]:
 446 The collective set of network devices to which stimulus is offered
 447 as a single entity and response measured.
 448
 449 ## Trial
 450
 451 A trial is the part of test described in [RFC2544] section 23.
 452
 453 ### Trial Load
 454
 455 Trial load is the intended constant load for a trial.
 456
 457 Load is the quantity implied by Constant Load of [RFC1242],
 458 Data Rate of [RFC2544] and Intended Load of [RFC2285].
 459 All three specify this value applies to one (input or output) interface.
 460
 461 ### Trial Duration
 462
 463 Trial duration is the intended duration of the traffic for a trial.
 464
 465 In general, this general quantity does not include any preparation nor waiting
 466 described in section 23 of [RFC2544].
 467
 468 However, the measurer MAY return a duration value different
 469 from the intended duration. This may be useful for users
 470 who want to control the overal search duration, not just the traffic part of it.
 471 The manager MUST report how does the measurer computes the returned duration
 472 values in that case.
 473
 474 ### Trial Forwarding Ratio
 475
 476 Trial forwarding ratio is dimensionless floating point value,
 477 assumed to be between 0.0 and 1.0, both including.
 478 It is computed as the number of frames forwarded by SUT, divided by
 479 the number of frames that should have been forwarded during the trial.
 480
 481 Note that, contrary to load, frame counts used to compute
 482 trial forwarding ratio are aggregates over all SUT ports.
 483
 484 Questions around what is the correct number of frames
 485 that should have been forwarded it outside of the scope of this document.
 486 E.g. what should the measurer return when it detects
 487 that the offered load differs significantly from the intended load.
 488
 489 It is RECOMMENDED implementations return an irregular goal result
 490 if they detect questionable (in comparability sense) trial results
 491 affecting their goal result.
 492
 493 ### Trial Loss Ratio
 494
 495 Trial loss ratio is equal to one minus the trial forwarding ratio.
 496
 497 ### Trial Forwarding Rate
 498
 499 The trial forwarding rate is the trial load multiplied by
 500 the trial forwarding ratio.
 501
 502 Note that this is very similar, but not identical to Forwarding Rate
 503 as defined in [RFC2285] section 3.6.1, as that definition
 504 is specific to one output interface, while trial forwarding ratio
 505 is based on frame counts aggregated over all SUT interfaces.
 506
 507 ## Traffic profile
 508
 509 Any other specifics (besides trial load and trial duration)
 510 the measurer needs to perform the trial are understood as a composite
 511 called the traffic profile.
 512 All its attributes are assumed to be constant during the search,
 513 and the composite is configured on the measurer by the manager
 514 before the search starts.
 515
 516 Traffic profile is REQUIRED by [RFC2544] to contain some specific quantities,
 517 for example frame size.
 518 Several more specific quantities may be RECOMMENDED.
 519
 520 Depending on SUT configuration, e.g. when testing specific protocols,
 521 additional values need to be included in the traffic profile
 522 and in the test report. See other IETF documents.
 523
 524 ## Search Goal
 525
 526 A search goal is one item of the list required as an argument
 527 when the manager calls the controller.
 528
 529 Each search goal is composite consisting of several attributes,
 530 some of them are required.
 531 Implementations are free to add their own attributes.
 532
 533 Subsections list all required attributes and one recommended attribute.
 534
 535 The meaning of the attributes is formally given only by their effect
 536 on the computation of the attributes of the goal result.
 537 The subsections do contain a short informal description,
 538 but see other chapters for more in-depth explanations.
 539
 540 ### Goal Final Trial Duration
 541
 542 A threshold value for trial durations.
 543 REQUIRED attribute, MUST be positive.
 544
 545 Informally, the conditional throughput for this goal will be computed
 546 only from trial results from trials as long as this.
 547
 548 ### Goal Duration Sum
 549
 550 A threshold value for a particular sum of trial durations.
 551 REQUIRED attribute, MUST be positive.
 552
 553 This uses the duration values returned by the measurer,
 554 in case it returns something else than the intended durations
 555 for the traffic part of the search.
 556
 557 Informally, even at looking only at trials done at this goal's
 558 final trial duration, MLRsearch may spend up to this time measuring
 559 the same load value.
 560
 561 ### Goal Loss Ratio
 562
 563 A threshold value for trial loss ratios.
 564 REQUIRED attribute, MUST be non-negative and smaller than one.
 565
 566 Informally, if a load causes too many trials with trial results larger than this,
 567 the conditional throughput for this goal will be smaller than that load.
 568
 569 ### Goal Exceed Ratio
 570
 571 A threshold value for particular ratio of duration sums.
 572 REQUIRED attribute, MUST be non-negative and smaller than one.
 573
 574 This uses the duration values returned by the measurer,
 575 in case it returns something else than the intended durations
 576 for the traffic part of the search.
 577
 578 Informally, this acts as the q-value for a quantile when selecting
 579 the forwarding rate of which trial result becomes the conditional throughput.
 580 For example, when the goal exceed ratio is 0.5 and MLRsearch
 581 happened to use the whole goal duration sum when determining conditional
 582 throughput, it means the conditional throughput is the median
 583 of trial forwarding rates.
 584 In practice, MLRsearch may stop measuring a load before the goal duration sum
 585 is reached, and the conditional throughput in that case frequently
 586 is the worst trial still not exceeding the goal loss ratio.
 587 If the goal duration sum is no larger than the goal fina trial duration,
 588 MLRsearch performs only one trial per load (unless other goals need more)
 589 and the goal exceed ratio has no effect on the search result.
 590
 591 ### Goal Width
 592
 593 A value used as a threshold for telling when two trial load values
 594 are close enough.
 595
 596 RECOMMENDED attribute, positive. Implementations without this attribute
 597 MUST give the manager other ways to control the search exit condition.
 598
 599 Absolute load difference and relative load difference are two popular choices,
 600 but implementations may choose a different way to specify width.
 601
 602 Informally, this acts as a stopping condition, controling the precision
 603 of the search. The search stops if every goal has reached its precision.
 604
 605 ## Search Result
 606
 607 Search result is a single composite object returned from the controller
 608 to the manager.
 609 It is a mapping from the search goals (see section Search Goal) into goal results
 610 (see section Goal Result).
 611 The mapping MUST map from all the search goals present in the controller input.
 612
 613 Each search goal instance is mapped to a goal result instance.
 614 Multiple search goal instances may map to the same goal result instance.
 615
 616 ## Goal Result
 617
 618 Goal result is a composite object consisting of several attributes.
 619 All attributes are related to the same search goal, the one search goal instance
 620 the Search Result is mapping into this instance of the Goal Result.
 621
 622 Some of the attributes are required, some are recommended,
 623 implementations are free to add their own.
 624
 625 The subsections define attributes for regular goal result.
 626 Implementations are free to define their own irregular goal results,
 627 but the manager MUST report them clearly as not regular according to this section.
 628
 629 A typical irregular result is when all trials at maximum offered load
 630 have zero loss, as the relevant upper bound does not exist in that case.
 631
 632 ### Relevant Upper Bound
 633
 634 Relevant upper bound is the intended load value that is classified
 635 at the end of the search as the relevant upper bound (see Appendix A)
 636 for this goal.
 637 This is a REQUIRED attribute.
 638
 639 Informally, this is the smallest intended load that failed to uphold
 640 all the requirements of this search goal, mainly the goal loss ratio
 641 in combination with the goal exceed ratio.
 642
 643 ### Relevant Lower Bound
 644
 645 Relevant lower bound is the intended load value that got classified
 646 (after all trials) as the relevant lower bound (see Appendix A) for this goal.
 647 This is a REQUIRED attribute.
 648
 649 The distance between the relevant lower bound and the relevant upper bound
 650 MUST NOT be larger than the goal width, for a regular goal result,
 651 if the implementation offers width as a goal attribute.
 652
 653 Informally, this is the smallest intended load that managed to uphold
 654 all the requirements of this search goal, mainly the goal loss ratio
 655 in combination with the goal exceed ratio, while not being larger
 656 than the relevant upper bound.
 657
 658 ### Conditional Throughput
 659
 660 The conditional throughput (see Appendix B) as evaluated
 661 at the relevant lower bound.
 662 This is a RECOMMENDED attribute.
 663
 664 Informally, a typical forwarding rate expected to be seen
 665 at the relevant lower bound. But frequently just a conservative estimate thereof,
 666 as MLRsearch implementations tend to stop gathering more data
 667 as soon as they confirm this estimate cannot get worse within
 668 the goal duration sum.
 669
 670 # MLRsearch Explanations
 671
 672 This chapter focuses on intuitions and motivations
 673 and skips over some important details.
 674
 675 Familiarity with the MLRsearch specification is not required here,
 676 so this chapter can act as an introduction.
 677 For example, this chapter start talking about tightest lower bounds
 678 before it is ready to talk about the relevant lower bound from the specification.
 679
 680 ## MLRsearch Versions
 681
 682 The MLRsearch algorithm has been developed in a code-first approach,
 683 a Python library has been created, debugged and used in production
 684 before first descriptions (even informal) were published.
 685 In fact, multiple versions of the library were used in production
 686 over past few years, and later code was usually not compatible
 687 with earlier descriptions.
 688
 689 The code in (any version of) MLRsearch library fully determines
 690 the search process (for given configuration parameters),
 691 leaving no space for deviations.
 692 MLRsearch as a name for a broad class of possible algorithms
 693 leaves plenty of space for future improvements, at the cost
 694 of poor comparability of results of different MLRsearch implementations.
 695
 696 This document aspires to prescribe a MLRsearch specification
 697 in a way that restricts the important parts related to comparability,
 698 while leaving other parts vague enough so implementations can improve freely.
 699
 700 ## Exit Condition
 701
 702 [RFC2544] prescribes that after performing one trial at a specific offered load,
 703 the next offered load should be larger or smaller, based on frame loss.
 704
 705 The usual implementation uses binary search. Here a lossy trial becomes
 706 a new upper bound, a lossless trial becomes a new lower bound.
 707 The span of values between the tightest lower bound and the tightest upper bound
 708 forms an interval of possible results,
 709 and after each trial the width of that interval halves.
 710
 711 Usually the binary search implementation tracks only the two tightest bounds,
 712 simply calling them bounds, but the old values still remain valid bounds,
 713 just not as tight as the new ones.
 714
 715 After some number of trials, the tightest lower bound becomes the throughput.
 716 [RFC2544] does not specify when (if ever) should the search stop.
 717
 718 MLRsearch library introduces a concept of goal width. The search stops
 719 when the distance between the tightest upper bound and the tightest lower bound
 720 is smaller than a user-configured value called goal width from now on.
 721 In other words, interval width has to be smaller than goal width
 722 at the end of the search.
 723
 724 This goal width value therefore determines the precision of the result.
 725 As MLRsearch specification requires a particular structure of the result,
 726 the result itself does contain enough information to determine its precision,
 727 thus it is not required to report the goal width value.
 728
 729 This allows MLRsearch implementations to use exit conditions
 730 different from goal width.
 731 The MLRsearch specification only REQUIRES the search procedure
 732 to always finish in a finite time, regardless of possible trial results.
 733
 734 ## Load Classification
 735
 736 MLRsearch keeps the basic logic of binary search (tracking tightest bounds,
 737 measuring at the middle), perhaps with minor technical clarifications.
 738 The algorithm chooses an intended load (as opposed to offered load),
 739 the interval between bounds does not need to be split exactly in two equal halves,
 740 and the final reported structure specifies both bounds
 741 (optionally also the conditional throughput at the lower bound, defined later).
 742
 743 The biggest difference is that in order to classify a load
 744 as an upper or lower bound, MLRsearch may need more than one trial
 745 (depending on configuration options) to be performed at the same intended load.
 746
 747 As a consequence, even if a load already does have few trial results,
 748 it still may be classified as undecided, neither a lower bound nor an upper bound.
 749
 750 Explanation of the classification logic is given in the next chatper,
 751 as it relies heavily on other sections of this chapter.
 752
 753 For repeatability and comparability reasons, it is important that
 754 given a set of trial results, all implementations of MLRsearch
 755 classify the load in an equivalent way.
 756
 757 ## Loss Ratios
 758
 759 Next difference is in goals of the search. [RFC2544] has a single goal,
 760 based on classifying full-length trials as either loss-less or lossy.
 761
 762 As the name suggests, MLRsearch can seach for multiple goals, differing in their
 763 loss ratios. Precise definition of goal loss ratio will be given later.
 764 The [RFC2544] throughput goal then simply becomes a zero goal loss ratio.
 765 Different goals also may have different goal width.
 766
 767 A set of trial results for one specific intended load value
 768 can classify the load as an upper bound for some goals, but a lower bound
 769 for some other goals, and undecided for the rest of the goals.
 770
 771 Therefore, the load classification depends not only on ttrial results,
 772 but also on the goal. The overall search procedure becomes more complicated
 773 (compared to binary search with a single goal),
 774 but most of the complications do not affect the final result,
 775 except for one phenomenon, loss inversion.
 776
 777 ## Loss Inversion
 778
 779 In [RFC2544] throuhput search using bisection, any load with lossy trial
 780 becomes a hard upper bound, meaning every subsequent trial has smaller
 781 intended load.
 782
 783 But in MLRsearch, a load that is classified as an upper bound for one goal
 784 may still be a lower bound for another goal, and due to that other goal
 785 MLRsearch will probably perform trials at even higher loads.
 786 What to do when all such higher load trials happen to have zero loss?
 787 Does it mean the earlier upper bound was not real?
 788 Does it mean the later lossless trials are not considered a lower bound?
 789 Surely we do not want to have an upper bound at a load smaller than a lower bound.
 790
 791 MLRsearch is conservative in these situations.
 792 The upper bound is considered real, and the lossless trials at higher loads
 793 are considered to be a coincidence, at least when computing the final result.
 794
 795 This is formalized using new notions, the relevant upper bound and
 796 the relevant lower bound.
 797 Load classification is still based just on the set of trial results
 798 at a given intended load (trials at other loads are ignored),
 799 making it possible to have a lower load classified as an upper bound
 800 and a higher load classified as a lower bound (for the same goal).
 801 The relevant upper bound (for a goal) is the smallest load classified
 802 as an upper bound. But the relevant lower bound is not simply
 803 the largest among lower bounds. It is the largest load among loads
 804 that are lower bounds while also being smaller than the relevant upper bound.
 805
 806 With these definitions, the relevant lower bound is always smaller
 807 than the relevant upper bound (if both exist), and the two relevant bounds
 808 are used analogously as the two tightest bounds in the binary search.
 809 When they are less than goal width apart, the relevant bounds are used in output.
 810
 811 One consequence is that every trial result can have an impact on the search result.
 812 That means if your SUT (or your traffic generator) needs a warmup,
 813 be sure to warm it up before starting the search.
 814
 815 ## Exceed Ratio
 816
 817 The idea of performing multiple trials at the same load comes from
 818 a model where some trial results (those with high loss) are affected
 819 by infrequent effects, causing poor repeatability of [RFC2544] throughput results.
 820 See the discussion about noiseful and noiseless ends of SUT performance spectrum.
 821 Stable results are closer to the noiseless end of SUT preformance spectrum,
 822 so MLRsearch may need to allow some frequency of high-loss trials
 823 to ignore the reare but big effects near the noisefull end.
 824
 825 MLRsearch is able to do such trial result filtering, but it needs
 826 a configuration option to tell it how much frequent can the infrequent big loss be.
 827 This option is called exceed ratio. It tells MLRsearch what ratio of trials
 828 (more exactly what ratio of trial seconds) can have trial loss ratio
 829 larger than goal loss ratio and still be classified as a lower bound.
 830 Zero exceed ratio means all trials have to have trial loss ratio
 831 equal to or smaller than the goal loss ratio.
 832
 833 For explainability reasons, the RECOMMENDED value for exceed ratio is 0.5,
 834 as it simplifies some later concepts by relating them to the concept of median.
 835
 836 ## Duration Sum
 837
 838 When more than one trial is needed to classify a load,
 839 MLRsearch also needs something that controlls the number of trials needed.
 840 Therefore, each goal also has an attribute called duration sum.
 841
 842 The meaning of a goal duration sum is that when a load has trials
 843 (at full trial duration, details later)
 844 whose trial durations when summed up give a value at least this,
 845 the load is guaranteed to be classified as an upper bound or a lower bound
 846 for the goal.
 847
 848 As the duration sum has a big impact on the overall search duration,
 849 and [RFC2544] prescibes wait intervals around trial traffic,
 850 the MLRsearch algorithm may sum durations that are different
 851 from the actual trial traffic durations.
 852
 853 ## Short Trials
 854
 855 Section 24 of [RFC2544] already anticipates possible time savings
 856 when short trials (shorter than full length trials) are used.
 857
 858 MLRsearch requires each goal to specify its final trial duration.
 859 Full-length trial is the short name for a trial whose intended trial duration
 860 is equal to the goal final trial duration.
 861
 862 Any MLRsearch implementation may include its own configuration options
 863 which control when and how MLRsearch chooses to use shorter trial durations.
 864
 865 For explainability reasons, when exceed ratio of 0.5 is used,
 866 it is recommended for the goal duration sum to be an odd multiple
 867 of the full trial durations, so conditional throughput becomes identical to
 868 a median of a particular set of forwarding rates.
 869
 870 Presence of shorter trial results complicates the load classification logic.
 871 Full details are given later. In short, results from short trials
 872 may cause a load to be classified as an upper bound.
 873 This may cause loss inversion, and thus lower the relevant lower bound
 874 (below what would classification say when considering full-length trials only).
 875
 876 For explainability reasons, it is RECOMMENDED users use such configurations
 877 that guarantee all trials have the same length.
 878 Alas, such configurations are usually not compliant with [RFC2544] requirements,
 879 or not time-saving enough.
 880
 881 ## Conditional Throughput
 882
 883 As testing equipment takes intended load as input parameter
 884 for a trial measurement, any load search algorithm needs to deal
 885 with intended load values internally.
 886
 887 But in presence of goals with non-zero loss ratio, the intended load
 888 usually does not match the user intuition of what a throughput is.
 889 The forwarding rate (as defined in [RFC2285] section 3.6.1) is better,
 890 but it is not obvious how to generalize it
 891 for loads with multiple trial results,
 892 especially with non-zero goal exceed ratio.
 893
 894 MLRsearch defines one such generalization, called the conditional throughput.
 895 It is the forwarding rate from one of the trials performed at the load
 896 in question. Specification of which trial exactly is quite technical.
 897 More detailed explanations are given in the next chapter.
 898
 899 Conditional throughput is partially related to load classification.
 900 If a load is classified as a lower bound for a goal,
 901 the conditional throughpt can be calculated,
 902 and guaranteed to show effective loss ratio no larger than goal loss ratio.
 903
 904 While the conditional throughput gives more intuitive-looking values
 905 than the relevant lower bound, especially for non-zero goal loss ratio values,
 906 the actual definition is more complicated than the definition of the relevant
 907 lower bound. In future, other intuitive values may become popular,
 908 but they are unlikely to supersede the definition of the relevant lower bound
 909 as the most fitting value for comparability purposes,
 910 therefore the relevant lower bound remains a required attribute
 911 of the goal result structure.
 912
 913 Note that comparing best and worst case, the same relevant lower bound value
 914 may result in the conditional throughput differing up to the goal loss ratio.
 915 Therefore it is rarely needed to set the goal width (if expressed
 916 as relative difference of loads) below the goal loss ratio.
 917 In other words, setting the goal width below the goal loss ratio
 918 may cause the conditional throughput for a larger loss ratio to became smaller
 919 than a conditional throughput for a goal with a smaller goal loss ratio,
 920 which is counter-intuitive, considering they come from the same search.
 921 Therefore it is RECOMMENDED to set the goal width to a value no smaller
 922 than the goal loss ratio.
 923
 924 ## Search Time
 925
 926 The main motivation for MLRsearch was to have an algorithm that spends less time
 927 finding a throughput, either the [RFC2544] compliant one,
 928 or some generalization thereof. The art of achieving short search times
 929 is mainly in smart selection of intended loads (and intended durations)
 930 for the next trial to perform.
 931
 932 While there is an indirect impact of the load selection on the reported values,
 933 in practice such impact tends to be small,
 934 even for SUTs with quite broad performance spectrum.
 935
 936 A typical example of two approaches to load selection leading to different
 937 relevant lower bounds is when the interval is split in a very uneven way.
 938 An implementation chosing loads very close to the current relevant lower bound
 939 are quite likely to eventually stumble upon a trial result
 940 with poor performance (due to SUT noise).
 941 For an implementation chosing load very close to the current relevant upper bound
 942 this is unlikely, as it examines more loads that can see a performance
 943 close to the noiseless end of the SUT performance spectrum.
 944 The reason why it is unlikely to have two MLRsearch implementation showing
 945 this kind of preference in load selection is precisely
 946 in the desire to have short searches.
 947 Even splits are the best way to achive the desired precision,
 948 so the more optimized a search algorithm is for the overall search duration,
 949 the better the repeatability and comparability
 950 of its results will be, assuming the user configuration remains the same.
 951
 952 Therefore, this document remains quite vague on load selection
 953 and other optimisation details, and configuration attributes related to them.
 954 Assuming users prefer libreries that achieve short overall search time,
 955 the definition of the relevant lower bound
 956 should be strict enough to ensure result repeatability
 957 and comparability between different implementations,
 958 while not restricting future implementations much.
 959
 960 Sadly, different implementations may exhibit their sweet spot of
 961 best repeatability at given search duration at different goals attribute values,
 962 especially with respect to optional goal attributes
 963 such as initial trial duration.
 964 Thus, this document does not comment much on which configurations
 965 are good for comparability between different implementations.
 966 For comparability between different SUTs using the same implementation,
 967 refer to configurations recommended by that particular implementation.
 968
 969 ## [RFC2544] compliance
 970
 971 The following search goal ensures unconditional compliance with
 972 [RFC2544] throughput search procedure:
 973
 974 - Goal loss ratio: zero.
 975
 976 - Goal final trial duration: 60 seconds.
 977
 978 - Goal duration sum: 60 seconds.
 979
 980 - Goal exceed ratio: zero.
 981
 982 Presence of other search goals does not affect compliance of this goal result.
 983 The relevant lower bound and the conditional throughput are in this case
 984 equal to each other, and the value is the [RFC2544] throughput.
 985
 986 If the 60 second quantity is replaced by a smaller quantity in both attributes,
 987 the conditional throughput is still conditionally compliant with
 988 [RFC2544] throughput.
 989
 990 # Selected Functional Details
 991
 992 This chapter continues with explanations,
 993 but this time more precise definitions are needed
 994 for readers to follow the explanations.
 995 The definitions here are wordy, implementers can look into the next chapter
 996 for more concise definitions.
 997
 998 The two areas of focus in this chapter are the load classification
 999 and the conditional throughput, starting with the latter.
1000
1001 ## Performance Spectrum
1002
1003 There are several equivalent ways to define the conditional throughput computation.
1004 One of the ways relies on an object called the performance spectrum.
1005 First, two heavy definitions are needed.
1006
1007 Take an intended load value, and a finite set of trial results, all trials
1008 measured at that load value. The performance spectrum is the function that maps
1009 any non-negative real number into a sum of trial durations among all trials
1010 in the set that have that number as their forwarding rate,
1011 e.g. map to zero if no trial has that particular forwarding rate.
1012
1013 A related function, defined if there is at least one trial in the set,
1014 is the performance spectrum divided by sum of durations of all trials in the set.
1015 That function is called the performance probability function, as it satisfies
1016 all the requirements for probability mass function function
1017 of a discrete probability distribution,
1018 the one-dimensional random variable being the trial forwarding rate.
1019
1020 These functions are related to the SUT performance spectrum,
1021 as sampled by the trials in the set.
1022
1023 As for any other probability function, we can talk about percentiles,
1024 of the performance probability function, and bout other quantiles
1025 such as the median. The conditional throughput will be
1026 one such quantile value for a specifically chosen set of trials.
1027
1028 Take a set of all full-length trials performed at the load in question.
1029 The sum of durations of those trials may be less than goal duration sum, or not.
1030 If it is less, add an imaginary trial result with zero forwarding rate
1031 such that the new sum of durations is equal to the goal duration sum.
1032 This is the set of trials to use. The q-value for the quantile
1033 is the goal exceed ratio. If the quantile touches two trials,
1034 the larger forwarding rate is used.
1035
1036 First example. For zero exceed ratio when goal duration sum
1037 has been reached. The conditional throughput is the smallest forwarding
1038 rate among the trials.
1039
1040 Second example. For zero exceed ratio when goal duration sum
1041 has not been reached yet. Due to the missing duration sum,
1042 the worst case may still happen, so the conditional througput
1043 is zero. This is not reported to the user, as this load
1044 cannot become the relevant lower bound yet.
1045
1046 Third example. Exceed ratio 50%, goal duration sum two seconds,
1047 one trial present with duration one second and zero loss.
1048 An imaginary trial is added with duration one second and zero forwarding rate.
1049 Median would touch both trials, so the conditional throughput
1050 is the forwarding rate of the one non-imaginary trial.
1051 As that had zero loss, the value is equal to the offered load.
1052
1053 The classification does not need the whole performance spectrum,
1054 only few duration sums.
1055
1056 A trial is called bad (according to a goal) if its trial loss ratio
1057 is larger than the goal loss ratio. Trial that is not bad is called good.
1058
1059 ## Single Trial Duration
1060
1061 When goal attributes are chosen in such a way that every trial has the same
1062 intended duration, the load classification is sipler.
1063
1064 The following description looks technical, but it follows the motivation
1065 of goal loss ratio, goal exceed ratio and goal duration sum.
1066 If sum of durations of all trials (at given load) is less than the goal
1067 duration sum, imagine best case scenario (all subsequent trials having zero loss)
1068 and worst case scenario (all subsequent trials having 100% loss).
1069 Here we assume there is as many subsequent trials as needed
1070 to make the sum of all trials to become equal to the goal duration sum.
1071 As the exceed ratio is defined just using sums of durations
1072 (number of trials does not matter), it does not matter whether
1073 the "subsequent trials" can consist of integer number of full-length trials.
1074
1075 If even in the best case scenario the load exceed ratio would be larger
1076 than the goal exceed ratio, the load is an upper bound.
1077 If even in the worst case scenario the load exceed ratio would not be larger
1078 than the goal exceed ratio, the load is a lower bound.
1079
1080 Even more specifically.
1081 Take all trials measured at a given load.
1082 Sum of durations of all bad full-length trials is called the bad sum.
1083 Sum of durations of all good full-length trials is called the good sum.
1084 The result of adding bad sum plus the good sum is called the measured sum.
1085 The larger of the measured sum and the goal duration sum is called the whole sum.
1086 The whole sum minus the measured sum is called the missing sum.
1087 Optimistic exceed ratio is the bad sum divided by the whole sum.
1088 Pessimistic exceed ratio is the bad sum plus the missing sum, that divided by
1089 the whole sum.
1090 If optimistic exceed ratio is larger than the goal exceed ratio,
1091 the load is classified as an upper bound.
1092 If pessimistic exceed ratio is not larger than the goal exceed ratio,
1093 the load is classified as a lower bound.
1094 Else, the load is classified as undecided.
1095
1096 The definition of pessimistic exceed ratio matches the logic in
1097 the conditional throughput computation, so a load is a lower bound
1098 if and only if the conditional throughput effective loss ratio
1099 is not larger than the goal loss ratio.
1100 If it is larger, the load is either an upper bound or undecided.
1101
1102 ## Short Trial Scenarios
1103
1104 Trials with intended duration smaller than the goal final trial duration
1105 are called short trials.
1106 The motivation for load classification logic in presence of short trials
1107 is based around a counter-factual case: What would the trial result be
1108 if a short trial has been measured as a full-length trial instead?
1109
1110 There are three main scenarios where human intuition guides
1111 the intended behavior of load classification.
1112
1113 Scenario one. The user had their reason for not configuring shorter goal
1114 final trial duration. Perhaps SUT has buffers that may get full at longer
1115 trial durations. Perhaps SUT shows periodic decreases of performance
1116 the user does not want to treat as noise. In any case, many good short trials
1117 may became bad full-length trial in the counter-factual case.
1118 In extreme case, there are no bad short trials.
1119 In this scenario, we want the load classification NOT to classify the load
1120 as a lower bound, despite the abundance of good short trials.
1121 Effectively, we want the good short trials to be ignored, so they
1122 do not contribute to comparisons with the goal duration sum.
1123
1124 Scenario two. When there is a frame loss in a short trial,
1125 the counter-factual full-length trial is expected to lose at least as many
1126 frames. And in practice, bad short trials are rarely turning into
1127 good full-length trials. In extreme case, there are no good short trials.
1128 In this scenario, we want the load classification
1129 to classify the load as an upper bound just based on abundance
1130 of short bad trials. Effectively we want the bad short trials
1131 to contribute to comparisons with the goal duration sum,
1132 so the load can be classified sooner.
1133
1134 Scenario three. Some SUTs are quite indifferent to trial duration.
1135 Performance probability function constructed from short trial results
1136 is likely to be similar to performance probability function constructed
1137 from full-length trial results (perhaps with smaller dispersion,
1138 but overall without big impact on the median quantiles).
1139 For moderate goal exceed ratio values, this may mean there are both
1140 good short trials and bad short trials.
1141 This scenario is there just to invalidate a simple heuristic
1142 of always ignoring good short trials and never ignoring bad short trials.
1143 That simple heuristic would be too biased. Yes, the short bad trials
1144 are likely to turn into full-length bad trials in the counter-factual case,
1145 but there is no information on what would the good short trials turn into.
1146 The only way to decide is to do more trials at full length,
1147 the same as in scenario one.
1148
1149 ## Short Trial Logic
1150
1151 MLRsearch picks a particular logic for load classification
1152 in presence of short trials, but it is still RECOMMENDED to use configurations
1153 that imply no short trials, so the possible inefficiencies in that logic
1154 do not affect the result, and the result has better explainability.
1155
1156 With thas said, the logic differs from the single trial duration case
1157 only in different definition of bad sum.
1158 Good sum is still the sum across all good full-length trials.
1159
1160 Few more notions are needed for definig the new bad sum.
1161 Sum of durations of all bad full-length trials is called the bad long sum.
1162 Sum of durations of all bad short trials is called the bad short sum.
1163 Sum of durations of all good short trials is called the good short sum.
1164 One minus the goal exceed ratio is called the inceed ratio.
1165 The goal exceed ratio divided by the inceed ratio is called the exceed coefficient.
1166 The good short sum multiplied by the exceed coefficient is called the balancing sum.
1167 The bad short sum minus the balancing sum is called the excess sum.
1168 If the excess sum is negative, the bad sum is equal to the bad long sum.
1169 Else, the bad sum is equal to the bad long sum plus the excess sum.
1170
1171 Here is how the new definition of the bad sum fares in the three scenarios,
1172 where the load is close to what would relevant bounds be
1173 if only full-length trials were used for the search.
1174
1175 Scenario one. If duration is too short, we expect to see higher frequency
1176 of good short trials. This could lead to negative excess sum,
1177 which has no impact, hence the load classification is given just by
1178 full-length trials.
1179 Thus, MLRsearch using too short trials has no detrimental effect
1180 on result comparability in this scenario.
1181 But also using short trials does not help with overall search duration,
1182 proably making it worse.
1183
1184 Scenario two. Settings with small exceed ratio have small exceed coefficient,
1185 so the impact of good short sum is small and the bad short sum
1186 is almost wholly converted into excess sum, thus bad short trials
1187 have almost as big impact as full-length bad trials.
1188 The same conclusion applies for moderate exceed ratio values
1189 when the good short sum is small.
1190 Thus, short trials can cause a load to get classified as an upper bound earlier
1191 bringing time savings (while not affecting comparability).
1192
1193 Scenario three. Here excess sum is small in absolute value, as balancing sum
1194 is expected to be be similar to the bad short sum.
1195 Once again, full-length trials are needed for final load classification,
1196 but usage of short trials probably means MLRsearch needed shorter search time
1197 before selecting this load for measurement, bringing time savings
1198 (while not affecting comparability).
1199
1200 ## Longer Trial Durations
1201
1202 If there are trial results with intended duration larger
1203 than the goal trial duration, the classification logic is intentionally undefined.
1204
1205 The implementations MAY treat such longer trials as if they were full-length.
1206 In any case, presence of such longer trials in either the relevant lower bound
1207 or the relevant upper bound SHOULD be mentioned, as for sume SUTs
1208 it is likely to affect comparability.
1209
1210
1211 TODO: Here will be a chapter summarizing how MLRsearch library
1212 adresses the problems from the Identified Problems chapter.
1213
1214 {::comment}
1215
1216     # Problems after MLRsearch
1217
1218     Now when MLRsearch is clearly specified and explained,
1219     it is possible to summarize how does MLRsearch specification help with problems.
1220
1221     Here, "multiple trials" is a shorthand for having the goal final trial duration
1222     significantly smaller than the goal duration sum.
1223     This results in MLRsearch performing multiple trials at the same load,
1224     which may not be the case with other configurations.
1225
1226     ## Long Test Duration
1227
1228     As shortening the overall search duration is the main motivation
1229     of MLRsearch library development, the library implements
1230     multiple improvements on this front, both big and small.
1231     Most of implementation details are not part of the MLRsearch specification,
1232     so that future implementations may keep shortening the search duration even more.
1233
1234     TODO: The rest is about attributes.
1235
1236     From the required goal attributes, the goal duration sum
1237     remains the best way to get even shorter searches.
1238
1239     Usage of multiple trials can also save time,
1240     depending on wait times around trial traffic.
1241
1242     The farther the goal exceed ratio is from 0.5 towards either side,
1243     the less predictable the overal search duration becomes in practice.
1244
1245     Width parameter does not change search duration much in practice
1246     (compared to other, mainly optional goal attributes).
1247
1248     ## DUT in SUT
1249
1250     Practice shows big improvements when multiple trials
1251     and moderate exceed ratios are used. Mainly when it comes to result
1252     repeatability, as sometimes it is not easy to distinguish
1253     SUT noise from DUT instability.
1254
1255     Conditional throughput has intuitive meaning when described
1256     using the performance spectrum, so this is an improvement,
1257     especially when compared to search procedures which use non-zero
1258     goal loss ratio but return only the intended load at a lower bound.
1259
1260     Multiple trials can save time also when the noisy end of
1261     the preformance spectrum needs to be examined, e.g. for [RFC9004].
1262
1263     Under some circumstances, testing the same DUT and SUT setup with different
1264     DUT configurations can give some hints on what part of noise us SUT noise
1265     and what part is DUT performance fluctuations.
1266     In practice, both types of noise tend to be too complicated for that analysis.
1267     MLRsearch does not offer additional tools in this regard,
1268     apart of giving users ability to search for more goals,
1269     hoping to get more insight at the cost of longer overall search time.
1270
1271     ## Repeatability and Comparability
1272
1273     Multiple trials improve repeatability, depending on exceed ratio.
1274
1275     In practice, 1s goal final trial duration with exceed ratio 0.5
1276     is good enough for modern SUTs (but that usually requires
1277     smaller wait times around the traffic part of the trial,
1278     otherwise too much search time is wasted waiting).
1279
1280     It is not clear whether exceed ratios higher than 0.5 are better
1281     for repeatability.
1282     The 0.5 value is still preferred due to explainability using median.
1283     TODO: Stress single value is for comparability, which one is due explainability.
1284
1285     It is possible that the conditional throughput values (with non-zero
1286     goal loss ratio) are better for repeatability than the relevant
1287     lower bound values, especially for implementations
1288     which pick load from a small set of discrete values.
1289
1290     Implementations focusing on shortening the overall search time
1291     are automatically forced to avoid comparability issues
1292     due to load selection, as they must prefer even splits wherever possible.
1293     But this conclusion only holds when the same goals are used.
1294     Larger adoption is needed before any further claims on comparability
1295     between MLRsearch implementations can be made.
1296
1297     ## Throughput with Non-Zero Loss
1298
1299     Suported by the goal loss ratio attribute.
1300     Improves repeatability as expected.
1301
1302     ## Inconsistent Trial Results
1303
1304     MLRsearch is conservative wherever possible,
1305     this is built into the definition of conditional throughput,
1306     and into the treatment of short trial results for load classification.
1307
1308     This is consistent with [RFC2544] zero loss tolerance motivation.
1309
1310     If the very best (noiseless) part of the SUT performance spectrum
1311     is of interest, it should be enough to set small value for
1312     the goal final trial duration, and perhaps also a large value
1313     for the goal exceed ratio.
1314
1315     Implementations may offer other (optional) configuration attributes
1316     (and optional goal result attributes)
1317     to become less conservative, but currently it is not clear
1318     what impact would that have on repeatability.
1319
1320 {:/comment}
1321
1322 # IANA Considerations
1323
1324 No requests of IANA.
1325
1326 # Security Considerations
1327
1328 Benchmarking activities as described in this memo are limited to
1329 technology characterization of a DUT/SUT using controlled stimuli in a
1330 laboratory environment, with dedicated address space and the constraints
1331 specified in the sections above.
1332
1333 The benchmarking network topology will be an independent test setup and
1334 MUST NOT be connected to devices that may forward the test traffic into
1335 a production network or misroute traffic to the test management network.
1336
1337 Further, benchmarking is performed on a "black-box" basis, relying
1338 solely on measurements observable external to the DUT/SUT.
1339
1340 Special capabilities SHOULD NOT exist in the DUT/SUT specifically for
1341 benchmarking purposes. Any implications for network security arising
1342 from the DUT/SUT SHOULD be identical in the lab and in production
1343 networks.
1344
1345 # Acknowledgements
1346
1347 Many thanks to Alec Hothan of OPNFV NFVbench project for thorough
1348 review and numerous useful comments and suggestions.
1349
1350 Special wholehearted gratitude and thanks to late Al Morton for his
1351 thorough reviews filled with very specific feedback and constructive
1352 guidelines. Thank you Al for the close collaboration over the years,
1353 for your continuous unwavering encouragements full of empathy and
1354 positive attitude.
1355 Al, you are dearly missed.
1356
1357 # Appendix A
1358
1359 This is a specification of load classification.
1360
1361 The block at the end of this appendix holds pseudocode
1362 which computes two values, stored in variables named optimistic and pessimistic.
1363 The pseudocode happens to be a valid Python code.
1364
1365 If both values are computed to be true, the load in question
1366 is classified as a lower bound according to the goal in question.
1367 If both values are false, the load is classified as an upper bound.
1368 Otherwise, the load is classifies as undecided.
1369
1370 The pseudocole expects the following variables hold values as follows:
1371
1372 - goal_duration_sum: The goal duration sum value.
1373
1374 - goal_exceed_ratio: The goal exceed ratio value.
1375
1376 - good_long_sum: Sum of durations across trials with trial duration
1377   at least equal to the goal final trial duration and with trial loss ratio
1378   not higher than the goal loss ratio.
1379
1380 - bad_long_sum: Sum of durations across trials with trial duration
1381   at least equal to the goal final trial duration and with trial loss ratio
1382   higher than the goal loss ratio.
1383
1384 - good_short_sum: Sum of durations across trials with trial duration
1385   shorter than the goal final trial duration and with trial loss ratio
1386   not higher than the goal loss ratio.
1387
1388 - bad_short_sum: Sum of durations across trials with trial duration
1389   shorter than the goal final trial duration and with trial loss ratio
1390   higher than the goal loss ratio.
1391
1392 Here the implicit set of available trial results consists of all trials
1393 measured at given intended load at the end of search.
1394
1395 The code works correctly also when there are no trial results at given load.
1396
1397 ```
1398 balancing_sum = good_short_sum * goal_exceed_ratio / (1.0 - goal_exceed_ratio)
1399 effective_bad_sum = bad_long_sum + max(0.0, bad_short_sum - balancing_sum)
1400 effective_whole_sum = max(good_long_sum + effective_bad_sum, goal_duration_sum)
1401 quantile_duration_sum = effective_whole_sum * goal_exceed_ratio
1402 optimistic = effective_bad_sum <= quantile_duration_sum
1403 pessimistic = (effective_whole_sum - good_long_sum) <= quantile_duration_sum
1404 ```
1405
1406 # Appendix B
1407
1408 This is a specification of conditional throughput.
1409
1410 The block at the end of this appendix holds pseudocode
1411 which computes a value stored as variable conditional_throughput.
1412 The pseudocode happens to be a valid Python code.
1413
1414 The pseudocole expects the following variables hold values as follows:
1415
1416 - goal_duration_sum: The goal duration sum value.
1417
1418 - goal_exceed_ratio: The goal exceed ratio value.
1419
1420 - good_long_sum: Sum of durations across trials with trial duration
1421   at least equal to the goal final trial duration and with trial loss ratio
1422   not higher than the goal loss ratio.
1423
1424 - bad_long_sum: Sum of durations across trials with trial duration
1425   at least equal to the goal final trial duration and with trial loss ratio
1426   higher than the goal loss ratio.
1427
1428 - long_trials: An iterable of all trial results from trials with trial duration
1429   at least equal to the goal final trial duration,
1430   sorted by increasing trial loss ratio.
1431   A trial result is a composite with the following two attributes available:
1432
1433   - trial.loss_ratio: The trial loss ratio as measured for this trial.
1434
1435   - trial.duration: The trial duration of this trial.
1436
1437 Here the implicit set of available trial results consists of all trials
1438 measured at given intended load at the end of search.
1439
1440 The code works correctly only when there if there is at leas one
1441 trial result measured at given load.
1442
1443 ```
1444 all_long_sum = max(goal_duration_sum, good_long_sum + bad_long_sum)
1445 remaining = all_long_sum * (1.0 - goal_exceed_ratio)
1446 quantile_loss_ratio = None
1447 for trial in long_trials:
1448     if quantile_loss_ratio is None or remaining > 0.0:
1449         quantile_loss_ratio = trial.loss_ratio
1450         remaining -= trial.duration
1451     else:
1452         break
1453 else:
1454     if remaining > 0.0:
1455         quantile_loss_ratio = 1.0
1456
1457 conditional_throughput = intended_load * (1.0 - quantile_loss_ratio)
1458 ```
1459
1460 --- back