docs/ietf/draft-ietf-bmwg-mlrsearch-06.md

   1 ---
   2
   3 title: Multiple Loss Ratio Search
   4 abbrev: MLRsearch
   5 docname: draft-ietf-bmwg-mlrsearch-06
   6 date: 2024-03-04
   7
   8 ipr: trust200902
   9 area: ops
  10 wg: Benchmarking Working Group
  11 kw: Internet-Draft
  12 cat: info
  13
  14 coding: us-ascii
  15 pi:    # can use array (if all yes) or hash here
  16   toc: yes
  17   sortrefs:   # defaults to yes
  18   symrefs: yes
  19
  20 author:
  21       -
  22         ins: M. Konstantynowicz
  23         name: Maciek Konstantynowicz
  24         org: Cisco Systems
  25         email: mkonstan@cisco.com
  26       -
  27         ins: V. Polak
  28         name: Vratko Polak
  29         org: Cisco Systems
  30         email: vrpolak@cisco.com
  31
  32 normative:
  33   RFC1242:
  34   RFC2285:
  35   RFC2544:
  36   RFC9004:
  37
  38 informative:
  39   TST009:
  40     target: https://www.etsi.org/deliver/etsi_gs/NFV-TST/001_099/009/03.04.01_60/gs_NFV-TST009v030401p.pdf
  41     title: "TST 009"
  42   FDio-CSIT-MLRsearch:
  43     target: https://csit.fd.io/cdocs/methodology/measurements/data_plane_throughput/mlr_search/
  44     title: "FD.io CSIT Test Methodology - MLRsearch"
  45     date: 2023-10
  46   PyPI-MLRsearch:
  47     target: https://pypi.org/project/MLRsearch/1.2.1/
  48     title: "MLRsearch 1.2.1, Python Package Index"
  49     date: 2023-10
  50
  51 --- abstract
  52
  53 This document proposes extensions to [RFC2544] throughput search by
  54 defining a new methodology called Multiple Loss Ratio search
  55 (MLRsearch). MLRsearch aims to minimize search duration,
  56 support multiple loss ratio searches,
  57 and enhance result repeatability and comparability.
  58
  59 The primary reason for extending [RFC2544] is to address the challenges
  60 and requirements presented by the evaluation and testing
  61 of software-based networking systems' data planes.
  62
  63 To give users more freedom, MLRsearch provides additional configuration options
  64 such as allowing multiple shorter trials per load instead of one large trial,
  65 tolerating a certain percentage of trial results with higher loss,
  66 and supporting the search for multiple goals with varying loss ratios.
  67
  68 --- middle
  69
  70 {::comment}
  71     As we use Kramdown to convert from Markdown,
  72     we use this way of marking comments not to be visible in the rendered draft.
  73     https://stackoverflow.com/a/42323390
  74     If another engine is used, convert to this way:
  75     https://stackoverflow.com/a/20885980
  76 {:/comment}
  77
  78 # Purpose and Scope
  79
  80 The purpose of this document is to describe Multiple Loss Ratio search
  81 (MLRsearch), a data plane throughput search methodology optimized for software
  82 networking DUTs.
  83
  84 Applying vanilla [RFC2544] throughput bisection to software DUTs
  85 results in several problems:
  86
  87 - Binary search takes too long as most trials are done far from the
  88   eventually found throughput.
  89 - The required final trial duration and pauses between trials
  90   prolong the overall search duration.
  91 - Software DUTs show noisy trial results,
  92   leading to a big spread of possible discovered throughput values.
  93 - Throughput requires a loss of exactly zero frames, but the industry
  94   frequently allows for small but non-zero losses.
  95 - The definition of throughput is not clear when trial results are inconsistent.
  96
  97
  98 To address the problems mentioned above,
  99 the MLRsearch library employs the following enhancements:
 100
 101 - Allow multiple shorter trials instead of one big trial per load.
 102   - Optionally, tolerate a percentage of trial results with higher loss.
 103 - Allow searching for multiple search goals, with differing loss ratios.
 104   - Any trial result can affect each search goal in principle.
 105 - Insert multiple coarse targets for each search goal, earlier ones need
 106   to spend less time on trials.
 107   - Earlier targets also aim for lesser precision.
 108   - Use Forwarding Rate (FR) at maximum offered load
 109     [RFC2285] (section 3.6.2) to initialize the initial targets.
 110 - Take care when dealing with inconsistent trial results.
 111   - Reported throughput is smaller than the smallest load with high loss.
 112   - Smaller load candidates are measured first.
 113 - Apply several load selection heuristics to save even more time
 114   by trying hard to avoid unnecessarily narrow bounds.
 115
 116 Some of these enhancements are formalized as MLRsearch specification,
 117 the remaining enhancements are treated as implementation details,
 118 thus achieving high comparability without limiting future improvements.
 119
 120 MLRsearch configuration options are flexible enough to
 121 support both conservative settings and aggressive settings.
 122 Where the conservative settings lead to results
 123 unconditionally compliant with [RFC2544],
 124 but longer search duration and worse repeatability.
 125 Conversely, aggressive settings lead to shorter search duration
 126 and better repeatability, but the results are not compliant with [RFC2544].
 127
 128 No part of [RFC2544] is intended to be obsoleted by this document.
 129
 130 # Identified Problems
 131
 132 This chapter describes the problems affecting usability
 133 of various performance testing methodologies,
 134 mainly a binary search for [RFC2544] unconditionally compliant throughput.
 135
 136 ## Long Search Duration
 137
 138 The emergence of software DUTs, with frequent software updates and a
 139 number of different frame processing modes and configurations,
 140 has increased both the number of performance tests
 141 required to verify the DUT update and the frequency of running those tests.
 142 This makes the overall test execution time even more important than before.
 143
 144 The current [RFC2544] throughput definition restricts the potential
 145 for time-efficiency improvements.
 146 A more generalized throughput concept could enable further enhancements
 147 while maintaining the precision of simpler methods.
 148
 149 The bisection method, when unconditionally compliant with [RFC2544],
 150 is excessively slow.
 151 This is because a significant amount of time is spent on trials
 152 with loads that, in retrospect, are far from the final determined throughput.
 153
 154 [RFC2544] does not specify any stopping condition for throughput search,
 155 so users already have an access to a limited trade-off
 156 between search duration and achieved precision.
 157 However, each full 60-second trials doubles the precision,
 158 so not many trials can be removed without a substantial loss of precision.
 159
 160 ## DUT in SUT
 161
 162 [RFC2285] defines:
 163 - DUT as
 164   - The network forwarding device to which stimulus is offered and
 165     response measured [RFC2285] (section 3.1.1).
 166 - SUT as
 167   - The collective set of network devices to which stimulus is offered
 168     as a single entity and response measured [RFC2285] (section 3.1.2).
 169
 170 [RFC2544] specifies a test setup with an external tester stimulating the
 171 networking system, treating it either as a single DUT, or as a system
 172 of devices, an SUT.
 173
 174 In the case of software networking, the SUT consists of not only the DUT
 175 as a software program processing frames, but also of
 176 a server hardware and operating system functions,
 177 with server hardware resources shared across all programs
 178 and the operating system running on the same server.
 179
 180 Given that the SUT is a shared multi-tenant environment
 181 encompassing the DUT and other components, the DUT might inadvertently
 182 experience interference from the operating system
 183 or other software operating on the same server.
 184
 185 Some of this interference can be mitigated.
 186 For instance,
 187 pinning DUT program threads to specific CPU cores
 188 and isolating those cores can prevent context switching.
 189
 190 Despite taking all feasible precautions, some adverse effects may still impact
 191 the DUT's network performance.
 192 In this document, these effects are collectively
 193 referred to as SUT noise, even if the effects are not as unpredictable
 194 as what other engineering disciplines call noise.
 195
 196 DUT can also exhibit fluctuating performance itself, for reasons
 197 not related to the rest of SUT; for example due to pauses in execution
 198 as needed for internal stateful processing.
 199 In many cases this
 200 may be an expected per-design behavior, as it would be observable even
 201 in a hypothetical scenario where all sources of SUT noise are eliminated.
 202 Such behavior affects trial results in a way similar to SUT noise.
 203 As the two phenomenons are hard to distinguish,
 204 in this document the term 'noise' is used to encompass
 205 both the internal performance fluctuations of the DUT
 206 and the genuine noise of the SUT.
 207
 208 A simple model of SUT performance consists of an idealized noiseless performance,
 209 and additional noise effects.
 210 For a specific SUT, the noiseless performance is assumed to be constant,
 211 with all observed performance variations being attributed to noise.
 212 The impact of the noise can vary in time, sometimes wildly,
 213 even within a single trial.
 214 The noise can sometimes be negligible, but frequently
 215 it lowers the observed SUT performance as observed in trial results.
 216
 217 In this model, SUT does not have a single performance value, it has a spectrum.
 218 One end of the spectrum is the idealized noiseless performance value,
 219 the other end can be called a noiseful performance.
 220 In practice, trial result
 221 close to the noiseful end of the spectrum happens only rarely.
 222 The worse the performance value is, the more rarely it is seen in a trial.
 223 Therefore, the extreme noiseful end of the SUT spectrum is not observable
 224 among trial results.
 225 Also, the extreme noiseless end of the SUT spectrum
 226 is unlikely to be observable, this time because some small noise effects
 227 are likely to occur multiple times during a trial.
 228
 229 Unless specified otherwise, this document's focus is
 230 on the potentially observable ends of the SUT performance spectrum,
 231 as opposed to the extreme ones.
 232
 233 When focusing on the DUT, the benchmarking effort should ideally aim
 234 to eliminate only the SUT noise from SUT measurements.
 235 However,
 236 this is currently not feasible in practice, as there are no realistic enough
 237 models available to distinguish SUT noise from DUT fluctuations,
 238 based on the author's experience and available literature.
 239
 240 Assuming a well-constructed SUT, the DUT is likely its
 241 primary performance bottleneck.
 242 In this case, we can define the DUT's
 243 ideal noiseless performance as the noiseless end of the SUT performance spectrum,
 244 especially for throughput.
 245 However, other performance metrics, such as latency,
 246 may require additional considerations.
 247
 248 Note that by this definition, DUT noiseless performance
 249 also minimizes the impact of DUT fluctuations, as much as realistically possible
 250 for a given trial duration.
 251
 252 This document aims to solve the DUT in SUT problem
 253 by estimating the noiseless end of the SUT performance spectrum
 254 using a limited number of trial results.
 255
 256 Any improvements to the throughput search algorithm, aimed at better
 257 dealing with software networking SUT and DUT setup, should employ
 258 strategies recognizing the presence of SUT noise, allowing the discovery of
 259 (proxies for) DUT noiseless performance
 260 at different levels of sensitivity to SUT noise.
 261
 262 ## Repeatability and Comparability
 263
 264 [RFC2544] does not suggest to repeat throughput search.
 265 And from just one
 266 discovered throughput value, it cannot be determined how repeatable that value is.
 267 Poor repeatability then leads to poor comparability,
 268 as different benchmarking teams may obtain varying throughput values
 269 for the same SUT, exceeding the expected differences from search precision.
 270
 271 [RFC2544] throughput requirements (60 seconds trial and
 272 no tolerance of a single frame loss) affect the throughput results
 273 in the following way.
 274 The SUT behavior close to the noiseful end of its performance spectrum
 275 consists of rare occasions of significantly low performance,
 276 but the long trial duration makes those occasions not so rare on the trial level.
 277 Therefore, the binary search results tend to wander away from the noiseless end
 278 of SUT performance spectrum, more frequently and more widely than shorter
 279 trials would, thus causing poor throughput repeatability.
 280
 281 The repeatability problem can be addressed by defining a search procedure
 282 that identifies a consistent level of performance,
 283 even if it does not meet the strict definition of throughput in [RFC2544].
 284
 285 According to the SUT performance spectrum model, better repeatability
 286 will be at the noiseless end of the spectrum.
 287 Therefore, solutions to the DUT in SUT problem
 288 will help also with the repeatability problem.
 289
 290 Conversely, any alteration to [RFC2544] throughput search
 291 that improves repeatability should be considered
 292 as less dependent on the SUT noise.
 293
 294 An alternative option is to simply run a search multiple times, and report some
 295 statistics (e.g. average and standard deviation).
 296 This can be used
 297 for a subset of tests deemed more important,
 298 but it makes the search duration problem even more pronounced.
 299
 300 ## Throughput with Non-Zero Loss
 301
 302 [RFC1242] (section 3.17) defines throughput as:
 303     The maximum rate at which none of the offered frames
 304     are dropped by the device.
 305
 306 Then, it says:
 307     Since even the loss of one frame in a
 308     data stream can cause significant delays while
 309     waiting for the higher level protocols to time out,
 310     it is useful to know the actual maximum data
 311     rate that the device can support.
 312
 313 However, many benchmarking teams accept a small,
 314 non-zero loss ratio as the goal for their load search.
 315
 316 Motivations are many:
 317
 318 - Modern protocols tolerate frame loss better,
 319   compared to the time when [RFC1242] and [RFC2544] were specified.
 320
 321 - Trials nowadays send way more frames within the same duration,
 322   increasing the chance of a small SUT performance fluctuation
 323   being enough to cause frame loss.
 324
 325 - Small bursts of frame loss caused by noise have otherwise smaller impact
 326   on the average frame loss ratio observed in the trial,
 327   as during other parts of the same trial the SUT may work more closely
 328   to its noiseless performance, thus perhaps lowering the trial loss ratio
 329   below the goal loss ratio value.
 330
 331 - If an approximation of the SUT noise impact on the trial loss ratio is known,
 332   it can be set as the goal loss ratio.
 333
 334 Regardless of the validity of all similar motivations,
 335 support for non-zero loss goals makes any search algorithm more user-friendly.
 336 [RFC2544] throughput is not user-friendly in this regard.
 337
 338 Furthermore, allowing users to specify multiple loss ratio values,
 339 and enabling a single search to find all relevant bounds,
 340 significantly enhances the usefulness of the search algorithm.
 341
 342 Searching for multiple search goals also helps to describe the SUT performance
 343 spectrum better than the result of a single search goal.
 344 For example, the repeated wide gap between zero and non-zero loss loads
 345 indicates the noise has a large impact on the observed performance,
 346 which is not evident from a single goal load search procedure result.
 347
 348 It is easy to modify the vanilla bisection to find a lower bound
 349 for the intended load that satisfies a non-zero goal loss ratio.
 350 But it is not that obvious how to search for multiple goals at once,
 351 hence the support for multiple search goals remains a problem.
 352
 353 ## Inconsistent Trial Results
 354
 355 While performing throughput search by executing a sequence of
 356 measurement trials, there is a risk of encountering inconsistencies
 357 between trial results.
 358
 359 The plain bisection never encounters inconsistent trials.
 360 But [RFC2544] hints about the possibility of inconsistent trial results,
 361 in two places in its text.
 362 The first place is section 24, where full trial durations are required,
 363 presumably because they can be inconsistent with the results
 364 from shorter trial durations.
 365 The second place is section 26.3, where two successive zero-loss trials
 366 are recommended, presumably because after one zero-loss trial
 367 there can be a subsequent inconsistent non-zero-loss trial.
 368
 369 Examples include:
 370
 371 - A trial at the same load (same or different trial duration) results
 372   in a different trial loss ratio.
 373 - A trial at a higher load (same or different trial duration) results
 374   in a smaller trial loss ratio.
 375
 376 Any robust throughput search algorithm needs to decide how to continue
 377 the search in the presence of such inconsistencies.
 378 Definitions of throughput in [RFC1242] and [RFC2544] are not specific enough
 379 to imply a unique way of handling such inconsistencies.
 380
 381 Ideally, there will be a definition of a new quantity which both generalizes
 382 throughput for non-zero-loss (and other possible repeatability enhancements),
 383 while being precise enough to force a specific way to resolve trial result
 384 inconsistencies.
 385 But until such a definition is agreed upon, the correct way to handle
 386 inconsistent trial results remains an open problem.
 387
 388 # MLRsearch Specification
 389
 390 This chapter focuses on technical definitions needed for evaluating
 391 whether a particular test procedure adheres to MLRsearch specification.
 392
 393 For motivations, explanations, and other comments see other chapters.
 394
 395 ## MLRsearch Architecture
 396
 397 MLRsearch architecture consists of three main components:
 398 the manager, the controller, and the measurer.
 399 For definitions of the components, see the following sections.
 400
 401 The architecture also implies the presence of other components, such as the SUT.
 402
 403 These components can be seen as abstractions present in any testing procedure.
 404
 405 ### Measurer
 406
 407 The measurer is the component that performs one trial
 408 as described in [RFC2544] section 23.
 409
 410 Specifically, one call to the measurer accepts a trial load value
 411 and trial duration value, performs the trial, and returns
 412 the measured trial loss ratio, and optionally a different duration value.
 413
 414 It is the responsibility of the measurer to uphold any requirements
 415 and assumptions present in MLRsearch specification
 416 (e.g. trial forwarding ratio not being larger than one).
 417 Implementers have some freedom, for example in the way they deal with
 418 duplicated frames, or what to return if the tester sent zero frames towards SUT.
 419 Implementations are RECOMMENDED to document their behavior
 420 related to such freedoms in as detailed a way as possible.
 421
 422 Implementations MUST document any deviations from RFC documents,
 423 for example if the wait time around traffic
 424 is shorter than what [RFC2544] section 23 specifies.
 425
 426 ### Controller
 427
 428 The controller selects trial load and duration values
 429 to achieve the search goals in the shortest expected time.
 430
 431 The controller calls the measurer multiple times,
 432 receiving the trial result from each call.
 433 After exit condition is met, the controller returns
 434 the overall search results.
 435
 436 The controller's role in optimizing trial load and duration selection
 437 distinguishes MLRsearch algorithms from simpler search procedures.
 438
 439 For controller inputs, see later section Controller Inputs.
 440 For controller outputs, see later section Controller Outputs.
 441
 442 ### Manager
 443
 444 The controller gets initiated by the manager once, and subsequently calls
 445
 446 The manager is the component that initializes SUT, the traffic generator
 447 (tester in [RFC2544] terminology), the measurer and the controller
 448 with intended configurations.
 449 It then calls the controller once, and receives its outputs.
 450
 451 The manager is also responsible for creating reports in the appropriate format,
 452 based on information in controller outputs.
 453
 454 ## Units
 455
 456 The specification deals with physical quantities, so it is assumed
 457 each numeric value is accompanied by an appropriate physical unit.
 458
 459 The specification does not state which unit is appropriate,
 460 but implementations MUST make it explicit which unit is used
 461 for each value provided or received by the user.
 462
 463 For example, load quantities (including the conditional throughput)
 464 returned by the controller are defined to be based on a single-interface
 465 (unidirectional) loads.
 466 For bidirectional traffic, users are likely
 467 to expect bidirectional throughput quantities, so the manager is responsible
 468 for making its report clear.
 469
 470 ## SUT
 471
 472 As defined in [RFC2285]:
 473 The collective set of network devices to which stimulus is offered
 474 as a single entity and response measured.
 475
 476 ## Trial
 477
 478 A trial is the part of the test described in [RFC2544] section 23.
 479
 480 ### Trial Load
 481
 482 The trial load is the intended constant load for a trial.
 483
 484 Load is the quantity implied by Constant Load of [RFC1242],
 485 Data Rate of [RFC2544] and Intended Load of [RFC2285].
 486 All three specify this value applies to one (input or output) interface.
 487
 488 ### Trial Duration
 489
 490 Trial duration is the intended duration of the traffic for a trial.
 491
 492 In general, this quantity does not include any preparation nor waiting
 493 described in section 23 of [RFC2544].
 494
 495 However, the measurer MAY return a duration value that deviates
 496 from the intended duration.
 497 This feature can be beneficial for users
 498 who wish to manage the overall search duration,
 499 rather than solely the traffic portion of it.
 500 The manager MUST report
 501 how the measurer computes the returned duration values in that case.
 502
 503 ### Trial Forwarding Ratio
 504
 505 The trial forwarding ratio is a dimensionless floating point value
 506 that ranges from 0.0 to 1.0, inclusive.
 507 It is calculated by dividing the number of frames
 508 successfully forwarded by the SUT
 509 by the total number of frames expected to be forwarded during the trial.
 510
 511 Note that, contrary to loads, frame counts used to compute
 512 trial forwarding ratio are aggregates over all SUT output ports.
 513
 514 Questions around what is the correct number of frames
 515 that should have been forwarded is outside of the scope of this document.
 516 E.g. what should the measurer return when it detects
 517 that the offered load differs significantly from the intended load.
 518
 519 ### Trial Loss Ratio
 520
 521 The trial loss ratio is equal to one minus the trial forwarding ratio.
 522
 523 ### Trial Forwarding Rate
 524
 525 The trial forwarding rate is a derived quantity, calculated by
 526 multiplying the trial load by the trial forwarding ratio.
 527
 528 It is important to note that while similar, this quantity is not identical
 529 to the Forwarding Rate as defined in [RFC2285] section 3.6.1,
 530 as the latter is specific to one output interface,
 531 whereas the trial forwarding ratio is based
 532 on frame counts aggregated over all SUT output interfaces.
 533
 534 ## Traffic profile
 535
 536 Any other specifics (besides trial load and trial duration)
 537 the measurer needs in order to perform the trial
 538 are understood as a composite called the traffic profile.
 539 All its attributes are assumed to be constant during the search,
 540 and the composite is configured on the measurer by the manager
 541 before the search starts.
 542
 543 The traffic profile is REQUIRED by [RFC2544]
 544 to contain some specific quantities, for example frame size.
 545 Several more specific quantities may be RECOMMENDED.
 546
 547 Depending on SUT configuration, e.g. when testing specific protocols,
 548 additional values need to be included in the traffic profile
 549 and in the test report.
 550 See other IETF documents.
 551
 552 ## Search Goal
 553
 554 The search goal is a composite consisting of several attributes,
 555 some of them are required.
 556 Implementations are free to add their own attributes.
 557
 558 A particular set of attribute values is called a search goal instance.
 559
 560 Subsections list all required attributes and one recommended attribute.
 561 Each subsection contains a short informal description,
 562 but see other chapters for more in-depth explanations.
 563
 564 The meaning of the attributes is formally given only by their effect
 565 on the controller output attributes (defined in later in section Search Result).
 566
 567 Informally, later chapters give additional intuitions and examples
 568 to the search goal attribute values.
 569 Later chapters also give motivation to formulas of computation of the outputs.
 570
 571 ### Goal Final Trial Duration
 572
 573 A threshold value for trial durations.
 574 This attribute is REQUIRED, and the value MUST be positive.
 575
 576 Informally, while MLRsearch is allowed to perform trials shorter than this,
 577 but results from such short trials have only limited impact on search results.
 578
 579 The full relation needs definitions is later subsections.
 580 But for example, the conditional throughput
 581 (definition in subsection Conditional Throughput)
 582 for this goal will be computed only from trial results
 583 from trials at least as long as this.
 584
 585 ### Goal Duration Sum
 586
 587 A threshold value for a particular sum of trial durations.
 588 This attribute is REQUIRED, and the value MUST be positive.
 589
 590 This uses the duration values returned by the measurer.
 591
 592 Informally, even when looking only at trials done at this goal's
 593 final trial duration, MLRsearch may spend up to this time measuring
 594 the same load value.
 595 If the goal duration sum is larger than
 596 the goal final trial duration, it means multiple trials need to be measured
 597 at the same load.
 598
 599 ### Goal Loss Ratio
 600
 601 A threshold value for trial loss ratios.
 602 REQUIRED attribute, MUST be non-negative and smaller than one.
 603
 604 Informally, if a load causes too many trials with trial loss ratios
 605 larger than this, the conditional throughput for this goal
 606 will be smaller than that load.
 607
 608 ### Goal Exceed Ratio
 609
 610 A threshold value for a particular ratio of duration sums.
 611 REQUIRED attribute, MUST be non-negative and smaller than one.
 612
 613 The duration sum values come from the duration values returned by the measurer.
 614
 615 Informally, the impact of lossy trials is controlled by this value.
 616 The full relation needs definitions is later subsections.
 617
 618 But for example, the definition of the conditional throughput
 619 (given later in subsection Conditional Throughput)
 620 refers to a q-value for a quantile when selecting
 621 which trial result gives the conditional throughput.
 622 The goal exceed ratio acts as the q-value to use there.
 623
 624 Specifically, when the goal exceed ratio is 0.5 and MLRsearch happened
 625 to use the whole goal duration sum (using full-length trials),
 626 it means the conditional throughput is the median of trial forwarding rates.
 627
 628 ### Goal Width
 629
 630 A value used as a threshold for telling when two trial load values
 631 are close enough.
 632
 633 RECOMMENDED attribute, positive.
 634 Implementations without this attribute
 635 MUST give the manager other ways to control the search exit condition.
 636
 637 Absolute load difference and relative load difference are two popular choices,
 638 but implementations may choose a different way to specify width.
 639
 640 Informally, this acts as a stopping condition, controlling the precision
 641 of the search.
 642 The search stops if every goal has reached its precision.
 643
 644 ## Controller Inputs
 645
 646 The only REQUIRED input for controller is a set of search goal instances.
 647 MLRsearch implementations MAY use additional input parameters for the controller.
 648
 649 The order of instances SHOULD NOT have a big impact on controller outputs,
 650 but MLRsearch implementations MAY base their behavior on the order
 651 of search goal instances.
 652
 653 The search goal instances SHOULD NOT be identical.
 654 MLRsearch implementation MAY allow identical instances.
 655
 656 ## Goal Result
 657
 658 Before defining the output of the controller,
 659 it is useful to define what the goal result is.
 660
 661 The goal result is a composite object consisting of several attributes.
 662 A particular set of attribute values is called a goal result instance.
 663
 664 Any goal result instance can be either regular or irregular.
 665 MLRsearch specification puts requirements on regular goal result instances.
 666 Any instance that does not meet the requirements is deemed irregular.
 667
 668 Implementations are free to define their own irregular goal results,
 669 but the manager MUST report them clearly as not regular according to this section.
 670
 671 All attribute values in one goal result instance
 672 are related to a single search goal instance,
 673 referred to as the given search goal.
 674
 675 Some of the attributes of a regular goal result instance are required,
 676 some are recommended, implementations are free to add their own.
 677
 678 The subsections define two required and one optional attribute
 679 for a regular goal result.
 680
 681 A typical irregular result is when all trials at the maximal offered load
 682 have zero loss, as the relevant upper bound does not exist in that case.
 683
 684 ### Relevant Upper Bound
 685
 686 The relevant upper bound is the smallest intended load value that is classified
 687 at the end of the search as an upper bound (see Appendix A)
 688 for the given search goal.
 689 This is a REQUIRED attribute.
 690
 691 Informally, this is the smallest intended load that failed to uphold
 692 all the requirements of the given search goal, mainly the goal loss ratio
 693 in combination with the goal exceed ratio.
 694
 695 ### Relevant Lower Bound
 696
 697 The relevant lower bound is the largest intended load value
 698 among those smaller than the relevant upper bound
 699 that got classified at the end of the search
 700 as a lower bound (see Appendix A) for the given search goal.
 701 This is a REQUIRED attribute.
 702
 703 For a regular goal result, the distance between the relevant lower bound
 704 and the relevant upper bound MUST NOT be larger than the goal width,
 705 if the implementation offers width as a goal attribute.
 706
 707 Informally, this is the largest intended load that managed to uphold
 708 all the requirements of the given search goal, mainly the goal loss ratio
 709 in combination with the goal exceed ratio, while not being larger
 710 than the relevant upper bound.
 711
 712 ### Conditional Throughput
 713
 714 The conditional throughput (see Appendix B)
 715 as evaluated at the relevant lower bound of the given search goal
 716 at the end of the search.
 717 This is a RECOMMENDED attribute.
 718
 719 Informally, this is a typical forwarding rate expected to be seen
 720 at the relevant lower bound of the given search goal.
 721 But frequently just a conservative estimate thereof,
 722 as MLRsearch implementations tend to stop gathering more data
 723 as soon as they confirm the result cannot get worse than this estimate
 724 within the goal duration sum.
 725
 726 ## Search Result
 727
 728 The search result is a single composite object
 729 that maps each search goal to a corresponding goal result.
 730
 731 In other words, search result is an unordered list of key-value pairs,
 732 where no two pairs contain equal keys.
 733 The key is a search goal instance, acting as the given search goal
 734 for the goal result instance in the value portion of the key-value pair.
 735
 736 The search result (as a mapping)
 737 MUST map from all the search goals present in the controller input.
 738
 739 ## Controller Outputs
 740
 741 The search result is the only REQUIRED output
 742 returned from the controller to the manager.
 743
 744 MLRsearch implementation MAY return additional data in the controller output.
 745
 746 # Further Explanations
 747
 748 This chapter focuses on intuitions and motivations
 749 and skips over some important details.
 750
 751 Familiarity with the MLRsearch specification is not required here,
 752 so this chapter can act as an introduction.
 753 For example, this chapter starts talking about the tightest lower bounds
 754 before it is ready to talk about the relevant lower bound from the specification.
 755
 756 ## MLRsearch Versions
 757
 758 The MLRsearch algorithm has been developed in a code-first approach,
 759 a Python library has been created, debugged, and used in production
 760 before the first descriptions (even informal) were published.
 761 In fact, multiple versions of the library were used in the production
 762 over the past few years, and later code was usually not compatible
 763 with earlier descriptions.
 764
 765 The code in (any version of) MLRsearch library fully determines
 766 the search process (for given configuration parameters),
 767 leaving no space for deviations.
 768 MLRsearch, as a name for a broad class of possible algorithms,
 769 leaves plenty of space for future improvements, at the cost
 770 of poor comparability of results of different MLRsearch implementations.
 771
 772 There are two competing needs.
 773 There is the need for standardization in areas critical to comparability.
 774 There is also the need to allow flexibility for implementations
 775 to innovate and improve in other areas.
 776 This document defines the MLRsearch specification
 777 in a manner that aims to fairly balances both needs.
 778
 779 ## Exit Condition
 780
 781 [RFC2544] prescribes that after performing one trial at a specific offered load,
 782 the next offered load should be larger or smaller, based on frame loss.
 783
 784 The usual implementation uses binary search.
 785 Here a lossy trial becomes
 786 a new upper bound, a lossless trial becomes a new lower bound.
 787 The span of values between (including both) the tightest lower bound
 788 and the tightest upper bound forms an interval of possible results,
 789 and after each trial the width of that interval halves.
 790
 791 Usually the binary search implementation tracks only the two tightest bounds,
 792 simply calling them bounds.
 793 But the old values still B remain valid bounds,
 794 just not as tight as the new ones.
 795
 796 After some number of trials, the tightest lower bound becomes the throughput.
 797 [RFC2544] does not specify when (if ever) should the search stop.
 798
 799 MLRsearch library introduces a concept of goal width.
 800 The search stops
 801 when the distance between the tightest upper bound and the tightest lower bound
 802 is smaller than a user-configured value, called goal width from now on.
 803 In other words, the interval width at the end of the search
 804 has to be no larger than the goal width.
 805
 806 This goal width value therefore determines the precision of the result.
 807 As MLRsearch specification requires a particular structure of the result,
 808 the result itself does contain enough information to determine its precision,
 809 thus it is not required to report the goal width value.
 810
 811 This allows MLRsearch implementations to use exit conditions
 812 different from goal width.
 813
 814 ## Load Classification
 815
 816 MLRsearch keeps the basic logic of binary search (tracking tightest bounds,
 817 measuring at the middle), perhaps with minor technical clarifications.
 818 The algorithm chooses an intended load (as opposed to the offered load),
 819 the interval between bounds does not need to be split
 820 exactly into two equal halves,
 821 and the final reported structure specifies both bounds.
 822
 823 The biggest difference is that to classify a load
 824 as an upper or lower bound, MLRsearch may need more than one trial
 825 (depending on configuration options) to be performed at the same intended load.
 826
 827 As a consequence, even if a load already does have few trial results,
 828 it still may be classified as undecided, neither a lower bound nor an upper bound.
 829
 830 An explanation of the classification logic is given in the next chapter,
 831 as it relies heavily on other sections of this chapter.
 832
 833 For repeatability and comparability reasons, it is important that
 834 given a set of trial results, all implementations of MLRsearch
 835 classify the load equivalently.
 836
 837 ## Loss Ratios
 838
 839 The next difference is in the goals of the search.
 840 [RFC2544] has a single goal,
 841 based on classifying full-length trials as either lossless or lossy.
 842
 843 As the name suggests, MLRsearch can search for multiple goals,
 844 differing in their loss ratios.
 845 The precise definition of the goal loss ratio will be given later.
 846 The [RFC2544] throughput goal then simply becomes a zero goal loss ratio.
 847 Different goals also may have different goal widths.
 848
 849 A set of trial results for one specific intended load value
 850 can classify the load as an upper bound for some goals, but a lower bound
 851 for some other goals, and undecided for the rest of the goals.
 852
 853 Therefore, the load classification depends not only on trial results,
 854 but also on the goal.
 855 The overall search procedure becomes more complicated
 856 (compared to binary search with a single goal),
 857 but most of the complications do not affect the final result,
 858 except for one phenomenon, loss inversion.
 859
 860 ## Loss Inversion
 861
 862 In [RFC2544] throughput search using bisection, any load with a lossy trial
 863 becomes a hard upper bound, meaning every subsequent trial has a smaller
 864 intended load.
 865
 866 But in MLRsearch, a load that is classified as an upper bound for one goal
 867 may still be a lower bound for another goal, and due to the other goal
 868 MLRsearch will probably perform trials at even higher loads.
 869 What to do when all such higher load trials happen to have zero loss?
 870 Does it mean the earlier upper bound was not real?
 871 Does it mean the later lossless trials are not considered a lower bound?
 872 Surely we do not want to have an upper bound at a load smaller than a lower bound.
 873
 874 MLRsearch is conservative in these situations.
 875 The upper bound is considered real, and the lossless trials at higher loads
 876 are considered to be a coincidence, at least when computing the final result.
 877
 878 This is formalized using new notions, the relevant upper bound and
 879 the relevant lower bound.
 880 Load classification is still based just on the set of trial results
 881 at a given intended load (trials at other loads are ignored),
 882 making it possible to have a lower load classified as an upper bound,
 883 and a higher load classified as a lower bound (for the same goal).
 884 The relevant upper bound (for a goal) is the smallest load classified
 885 as an upper bound.
 886 But the relevant lower bound is not simply
 887 the largest among lower bounds.
 888 It is the largest load among loads
 889 that are lower bounds while also being smaller than the relevant upper bound.
 890
 891 With these definitions, the relevant lower bound is always smaller
 892 than the relevant upper bound (if both exist), and the two relevant bounds
 893 are used analogously as the two tightest bounds in the binary search.
 894 When they are less than the goal width apart,
 895 the relevant bounds are used in the output.
 896
 897 One consequence is that every trial result can have an impact on the search result.
 898 That means if your SUT (or your traffic generator) needs a warmup,
 899 be sure to warm it up before starting the search.
 900
 901 ## Exceed Ratio
 902
 903 The idea of performing multiple trials at the same load comes from
 904 a model where some trial results (those with high loss) are affected
 905 by infrequent effects, causing poor repeatability of [RFC2544] throughput results.
 906 See the discussion about noiseful and noiseless ends
 907 of the SUT performance spectrum.
 908 Stable results are closer to the noiseless end of the SUT performance spectrum,
 909 so MLRsearch may need to allow some frequency of high-loss trials
 910 to ignore the rare but big effects near the noiseful end.
 911
 912 MLRsearch can do such trial result filtering, but it needs
 913 a configuration option to tell it how frequent can the infrequent big loss be.
 914 This option is called the exceed ratio.
 915 It tells MLRsearch what ratio of trials
 916 (more exactly what ratio of trial seconds) can have a trial loss ratio
 917 larger than the goal loss ratio and still be classified as a lower bound.
 918 Zero exceed ratio means all trials have to have a trial loss ratio
 919 equal to or smaller than the goal loss ratio.
 920
 921 For explainability reasons, the RECOMMENDED value for exceed ratio is 0.5,
 922 as it simplifies some later concepts by relating them to the concept of median.
 923
 924 ## Duration Sum
 925
 926 When more than one trial is needed to classify a load,
 927 MLRsearch also needs something that controls the number of trials needed.
 928 Therefore, each goal also has an attribute called duration sum.
 929
 930 The meaning of a goal duration sum is that when a load has trials
 931 (at full trial duration, details later)
 932 whose trial durations when summed up give a value at least this long,
 933 the load is guaranteed to be classified as an upper bound or a lower bound
 934 for the goal.
 935
 936 As the duration sum has a big impact on the overall search duration,
 937 and [RFC2544] prescribes wait intervals around trial traffic,
 938 the MLRsearch algorithm is allowed to sum durations that are different
 939 from the actual trial traffic durations.
 940
 941 ## Short Trials
 942
 943 MLRsearch requires each goal to specify its final trial duration.
 944 Full-length trial is a shorter name for a trial whose intended trial duration
 945 is equal to (or longer than) the goal final trial duration.
 946
 947 Section 24 of [RFC2544] already anticipates possible time savings
 948 when short trials (shorter than full-length trials) are used.
 949 Full-length trials are the opposite of short trials,
 950 so they may also be called long trials.
 951
 952 Any MLRsearch implementation may include its own configuration options
 953 which control when and how MLRsearch chooses to use shorter trial durations.
 954
 955 For explainability reasons, when exceed ratio of 0.5 is used,
 956 it is recommended for the goal duration sum to be an odd multiple
 957 of the full trial durations, so conditional throughput becomes identical to
 958 a median of a particular set of forwarding rates.
 959
 960 The presence of shorter trial results complicates the load classification logic.
 961 Full details are given later.
 962 In short, results from short trials
 963 may cause a load to be classified as an upper bound.
 964 This may cause loss inversion, and thus lower the relevant lower bound
 965 (below what would classification say when considering full-length trials only).
 966
 967 For explainability reasons, it is RECOMMENDED users use such configurations
 968 that guarantee all trials have the same length.
 969 Alas, such configurations are usually not compliant with [RFC2544] requirements,
 970 or not time-saving enough.
 971
 972 ## Conditional Throughput
 973
 974 As testing equipment takes the intended load as an input parameter
 975 for a trial measurement, any load search algorithm needs to deal
 976 with intended load values internally.
 977
 978 But in the presence of goals with a non-zero loss ratio, the intended load
 979 usually does not match the user's intuition of what a throughput is.
 980 The forwarding rate (as defined in [RFC2285] section 3.6.1) is better,
 981 but it is not obvious how to generalize it
 982 for loads with multiple trial results and a non-zero goal loss ratio.
 983
 984 MLRsearch defines one such generalization, called the conditional throughput.
 985 It is the forwarding rate from one of the trials performed at the load
 986 in question.
 987 Specification of which trial exactly is quite technical,
 988 see the specification and Appendix B.
 989
 990 Conditional throughput is partially related to load classification.
 991 If a load is classified as a lower bound for a goal,
 992 the conditional throughput can be calculated,
 993 and guaranteed to show an effective loss ratio
 994 no larger than the goal loss ratio.
 995
 996 While the conditional throughput gives more intuitive-looking values
 997 than the relevant lower bound, especially for non-zero goal loss ratio values,
 998 the actual definition is more complicated than the definition of the relevant
 999 lower bound.
1000 In the future, other intuitive values may become popular,
1001 but they are unlikely to supersede the definition of the relevant lower bound
1002 as the most fitting value for comparability purposes,
1003 therefore the relevant lower bound remains a required attribute
1004 of the goal result structure, while the conditional throughput is only optional.
1005
1006 Note that comparing the best and worst case, the same relevant lower bound value
1007 may result in the conditional throughput differing up to the goal loss ratio.
1008 Therefore it is rarely needed to set the goal width (if expressed
1009 as the relative difference of loads) below the goal loss ratio.
1010 In other words, setting the goal width below the goal loss ratio
1011 may cause the conditional throughput for a larger loss ratio to become smaller
1012 than a conditional throughput for a goal with a smaller goal loss ratio,
1013 which is counter-intuitive, considering they come from the same search.
1014 Therefore it is RECOMMENDED to set the goal width to a value no smaller
1015 than the goal loss ratio.
1016
1017 ## Search Time
1018
1019 MLRsearch was primarily developed to reduce the time
1020 required to determine a throughput, either the [RFC2544] compliant one,
1021 or some generalization thereof.
1022 The art of achieving short search times
1023 is mainly in the smart selection of intended loads (and intended durations)
1024 for the next trial to perform.
1025
1026 While there is an indirect impact of the load selection on the reported values,
1027 in practice such impact tends to be small,
1028 even for SUTs with quite a broad performance spectrum.
1029
1030 A typical example of two approaches to load selection leading to different
1031 relevant lower bounds is when the interval is split in a very uneven way.
1032 Any implementation choosing loads very close to the current relevant lower bound
1033 is quite likely to eventually stumble upon a trial result
1034 with poor performance (due to SUT noise).
1035 For an implementation choosing loads very close
1036 to the current relevant upper bound, this is unlikely,
1037 as it examines more loads that can see a performance
1038 close to the noiseless end of the SUT performance spectrum.
1039
1040 However, as even splits optimize search duration at give precision,
1041 MLRsearch implementations that prioritize minimizing search time
1042 are unlikely to suffer from any such bias.
1043
1044 Therefore, this document remains quite vague on load selection
1045 and other optimization details, and configuration attributes related to them.
1046 Assuming users prefer libraries that achieve short overall search time,
1047 the definition of the relevant lower bound
1048 should be strict enough to ensure result repeatability
1049 and comparability between different implementations,
1050 while not restricting future implementations much.
1051
1052 Sadly, different implementations may exhibit their sweet spot of
1053 the best repeatability for a given search duration
1054 at different goals attribute values, especially concerning
1055 any optional goal attributes such as the initial trial duration.
1056 Thus, this document does not comment much on which configurations
1057 are good for comparability between different implementations.
1058 For comparability between different SUTs using the same implementation,
1059 refer to configurations recommended by that particular implementation.
1060
1061 ## [RFC2544] compliance
1062
1063 The following search goal ensures unconditional compliance with
1064 [RFC2544] throughput search procedure:
1065
1066 - Goal loss ratio: zero.
1067
1068 - Goal final trial duration: 60 seconds.
1069
1070 - Goal duration sum: 60 seconds.
1071
1072 - Goal exceed ratio: zero.
1073
1074 The presence of other search goals does not affect the compliance
1075 of this goal result.
1076 The relevant lower bound and the conditional throughput are in this case
1077 equal to each other, and the value is the [RFC2544] throughput.
1078
1079 If the 60 second quantity is replaced by a smaller quantity in both attributes,
1080 the conditional throughput is still conditionally compliant with
1081 [RFC2544] throughput.
1082
1083 # Logic of Load Classification
1084
1085 This chapter continues with explanations,
1086 but this time more precise definitions are needed
1087 for readers to follow the explanations.
1088 The definitions here are wordy, implementers should read the specification
1089 chapter and appendices for more concise definitions.
1090
1091 The two related areas of focus in this chapter are load classification
1092 and the conditional throughput, starting with the latter.
1093
1094 The section Performance Spectrum contains definitions
1095 needed to gain insight into what conditional throughput means.
1096 The rest of the subsections discuss load classification,
1097 they do not refer to Performance Spectrum, only to a few duration sums.
1098
1099 For load classification, it is useful to define good and bad trials.
1100 A trial is called bad (according to a goal) if its trial loss ratio
1101 is larger than the goal loss ratio.
1102 The trial that is not bad is called good.
1103
1104 ## Performance Spectrum
1105
1106 There are several equivalent ways to explain
1107 the conditional throughput computation.
1108 One of the ways relies on an object called the performance spectrum.
1109 First, two heavy definitions are needed.
1110
1111 Take an intended load value, a trial duration value, and a finite set
1112 of trial results, all trials measured at that load value and duration value.
1113 The performance spectrum is the function that maps
1114 any non-negative real number into a sum of trial durations among all trials
1115 in the set that has that number as their forwarding rate,
1116 e.g. map to zero if no trial has that particular forwarding rate.
1117
1118 A related function, defined if there is at least one trial in the set,
1119 is the performance spectrum divided by the sum of the durations
1120 of all trials in the set.
1121 That function is called the performance probability function, as it satisfies
1122 all the requirements for probability mass function function
1123 of a discrete probability distribution,
1124 the one-dimensional random variable being the trial forwarding rate.
1125
1126 These functions are related to the SUT performance spectrum,
1127 as sampled by the trials in the set.
1128
1129 As for any other probability function, we can talk about percentiles
1130 of the performance probability function, including the median.
1131 The conditional throughput will be one such quantile value
1132 for a specifically chosen set of trials.
1133
1134 Take a set of all full-length trials performed at the relevant lower bound,
1135 sorted by decreasing forwarding rate.
1136 The sum of the durations of those trials
1137 may be less than the goal duration sum, or not.
1138 If it is less, add an imaginary trial result with zero forwarding rate,
1139 such that the new sum of durations is equal to the goal duration sum.
1140 This is the set of trials to use.
1141 The q-value for the quantile
1142 is the goal exceed ratio.
1143 If the quantile touches two trials,
1144 the larger forwarding rate (from the trial result sorted earlier) is used.
1145 The resulting quantity is the conditional throughput of the goal in question.
1146
1147 First example.
1148 For zero exceed ratio, when goal duration sum has been reached.
1149 The conditional throughput is the smallest forwarding rate among the trials.
1150
1151 Second example.
1152 For zero exceed ratio, when goal duration sum has not been reached yet.
1153 Due to the missing duration sum, the worst case may still happen,
1154 so the conditional throughput is zero.
1155 This is not reported to the user,
1156 as this load cannot become the relevant lower bound yet.
1157
1158 Third example.
1159 Exceed ratio 50%, goal duration sum two seconds,
1160 one trial present with the duration of one second and zero loss.
1161 The imaginary trial is added with the duration
1162 of one second and zero forwarding rate.
1163 The median would touch both trials, so the conditional throughput
1164 is the forwarding rate of the one non-imaginary trial.
1165 As that had zero loss, the value is equal to the offered load.
1166
1167 Note that Appendix B does not take into account short trial results.
1168
1169 ### Summary
1170
1171 While the conditional throughput is a generalization of the forwarding rate,
1172 its definition is not an obvious one.
1173
1174 Other than the forwarding rate, the other source of intuition
1175 is the quantile in general, and the median the the recommended case.
1176
1177 In future, different quantities may prove more useful,
1178 especially when applying to specific problems,
1179 but currently the conditional throughput is the recommended compromise,
1180 especially for repeatability and comparability reasons.
1181
1182 ## Single Trial Duration
1183
1184 When goal attributes are chosen in such a way that every trial has the same
1185 intended duration, the load classification is simpler.
1186
1187 The following description looks technical, but it follows the motivation
1188 of goal loss ratio, goal exceed ratio, and goal duration sum.
1189 If the sum of the durations of all trials (at the given load)
1190 is less than the goal duration sum, imagine best case scenario
1191 (all subsequent trials having zero loss) and worst case scenario
1192 (all subsequent trials having 100% loss).
1193 Here we assume there are as many subsequent trials as needed
1194 to make the sum of all trials equal to the goal duration sum.
1195 As the exceed ratio is defined just using sums of durations
1196 (number of trials does not matter), it does not matter whether
1197 the "subsequent trials" can consist of an integer number of full-length trials.
1198
1199 In any of the two scenarios, we can compute the load exceed ratio,
1200 As the duration sum of good trials divided by the duration sum of all trials,
1201 in both cases including the assumed trials.
1202
1203 If even in the best case scenario the load exceed ratio would be larger
1204 than the goal exceed ratio, the load is an upper bound.
1205 If even in the worst case scenario the load exceed ratio would not be larger
1206 than the goal exceed ratio, the load is a lower bound.
1207
1208 Even more specifically.
1209 Take all trials measured at a given load.
1210 The sum of the durations of all bad full-length trials is called the bad sum.
1211 The sum of the durations of all good full-length trials is called the good sum.
1212 The result of adding the bad sum plus the good sum is called the measured sum.
1213 The larger of the measured sum and the goal duration sum is called the whole sum.
1214 The whole sum minus the measured sum is called the missing sum.
1215 The optimistic exceed ratio is the bad sum divided by the whole sum.
1216 The pessimistic exceed ratio is the bad sum plus the missing sum,
1217 that divided by the whole sum.
1218 If the optimistic exceed ratio is larger than the goal exceed ratio,
1219 the load is classified as an upper bound.
1220 If the pessimistic exceed ratio is not larger than the goal exceed ratio,
1221 the load is classified as a lower bound.
1222 Else, the load is classified as undecided.
1223
1224 The definition of pessimistic exceed ratio is compatible with the logic in
1225 the conditional throughput computation, so in this single trial duration case,
1226 a load is a lower bound if and only if the conditional throughput
1227 effective loss ratio is not larger than the goal loss ratio.
1228 If it is larger, the load is either an upper bound or undecided.
1229
1230 ## Short Trial Scenarios
1231
1232 Trials with intended duration smaller than the goal final trial duration
1233 are called short trials.
1234 The motivation for load classification logic in the presence of short trials
1235 is based around a counter-factual case: What would the trial result be
1236 if a short trial has been measured as a full-length trial instead?
1237
1238 There are three main scenarios where human intuition guides
1239 the intended behavior of load classification.
1240
1241 False good scenario.
1242 The user had their reason for not configuring a shorter goal
1243 final trial duration.
1244 Perhaps SUT has buffers that may get full at longer
1245 trial durations.
1246 Perhaps SUT shows periodic decreases in performance
1247 the user does not want to be treated as noise.
1248 In any case, many good short trials may become bad full-length trials
1249 in the counter-factual case.
1250 In extreme cases, there are plenty of good short trials and no bad short trials.
1251 In this scenario, we want the load classification NOT to classify the load
1252 as a lower bound, despite the abundance of good short trials.
1253 Effectively, we want the good short trials to be ignored, so they
1254 do not contribute to comparisons with the goal duration sum.
1255
1256 True bad scenario.
1257 When there is a frame loss in a short trial,
1258 the counter-factual full-length trial is expected to lose at least as many
1259 frames.
1260 And in practice, bad short trials are rarely turning into
1261 good full-length trials.
1262 In extreme cases, there are no good short trials.
1263 In this scenario, we want the load classification
1264 to classify the load as an upper bound just based on the abundance
1265 of short bad trials.
1266 Effectively, we want the bad short trials
1267 to contribute to comparisons with the goal duration sum,
1268 so the load can be classified sooner.
1269
1270 Balanced scenario.
1271 Some SUTs are quite indifferent to trial duration.
1272 Performance probability function constructed from short trial results
1273 is likely to be similar to the performance probability function constructed
1274 from full-length trial results (perhaps with larger dispersion,
1275 but without a big impact on the median quantiles overall).
1276 For a moderate goal exceed ratio value, this may mean there are both
1277 good short trials and bad short trials.
1278 This scenario is there just to invalidate a simple heuristic
1279 of always ignoring good short trials and never ignoring bad short trials.
1280 That simple heuristic would be too biased.
1281 Yes, the short bad trials
1282 are likely to turn into full-length bad trials in the counter-factual case,
1283 but there is no information on what would the good short trials turn into.
1284 The only way to decide safely is to do more trials at full length,
1285 the same as in scenario one.
1286
1287 ## Short Trial Logic
1288
1289 MLRsearch picks a particular logic for load classification
1290 in the presence of short trials, but it is still RECOMMENDED
1291 to use configurations that imply no short trials,
1292 so the possible inefficiencies in that logic
1293 do not affect the result, and the result has better explainability.
1294
1295 With that said, the logic differs from the single trial duration case
1296 only in different definition of the bad sum.
1297 The good sum is still the sum across all good full-length trials.
1298
1299 Few more notions are needed for defining the new bad sum.
1300 The sum of durations of all bad full-length trials is called the bad long sum.
1301 The sum of durations of all bad short trials is called the bad short sum.
1302 The sum of durations of all good short trials is called the good short sum.
1303 One minus the goal exceed ratio is called the inceed ratio.
1304 The goal exceed ratio divided by the inceed ratio is called the exceed coefficient.
1305 The good short sum multiplied by the exceed coefficient is called the balancing sum.
1306 The bad short sum minus the balancing sum is called the excess sum.
1307 If the excess sum is negative, the bad sum is equal to the bad long sum.
1308 Otherwise, the bad sum is equal to the bad long sum plus the excess sum.
1309
1310 Here is how the new definition of the bad sum fares in the three scenarios,
1311 where the load is close to what would the relevant bounds be
1312 if only full-length trials were used for the search.
1313
1314 False good scenario.
1315 If the duration is too short, we expect to see a higher frequency
1316 of good short trials.
1317 This could lead to a negative excess sum,
1318 which has no impact, hence the load classification is given just by
1319 full-length trials.
1320 Thus, MLRsearch using too short trials has no detrimental effect
1321 on result comparability in this scenario.
1322 But also using short trials does not help with overall search duration,
1323 probably making it worse.
1324
1325 True bad cenario.
1326 Settings with a small exceed ratio
1327 have a small exceed coefficient, so the impact of the good short sum is small,
1328 and the bad short sum is almost wholly converted into excess sum,
1329 thus bad short trials have almost as big an impact as full-length bad trials.
1330 The same conclusion applies to moderate exceed ratio values
1331 when the good short sum is small.
1332 Thus, short trials can cause a load to get classified as an upper bound earlier,
1333 bringing time savings (while not affecting comparability).
1334
1335 Balanced scenario.
1336 Here excess sum is small in absolute value, as the balancing sum
1337 is expected to be similar to the bad short sum.
1338 Once again, full-length trials are needed for final load classification;
1339 but usage of short trials probably means MLRsearch needed
1340 a shorter overall search time before selecting this load for measurement,
1341 thus bringing time savings (while not affecting comparability).
1342
1343 Note that in presence of short trial results,
1344 the comparibility between the load classification
1345 and the conditional throughput is only partial.
1346 The conditional throughput still comes from a good long trial,
1347 but a load higher than the relevant lower bound may also compute to a good value.
1348
1349 ## Longer Trial Durations
1350
1351 If there are trial results with an intended duration larger
1352 than the goal trial duration, the precise definitions
1353 in Appendix A and Appendix B treat them in exactly the same way
1354 as trials with duration equal to the goal trial duration.
1355
1356 But in configurations with moderate (including 0.5) or small
1357 goal exceed ratio and small goal loss ratio (especially zero),
1358 bad trials with longer than goal durations may bias the search
1359 towards the lower load values, as the noiseful end of the spectrum
1360 gets a larger probability of causing the loss within the longer trials.
1361
1362 For some users, this is an acceptable price
1363 for increased configuration flexibility
1364 (perhaps saving time for the related goals),
1365 so implementations SHOULD allow such configurations.
1366 Still, users are encouraged to avoid such configurations
1367 by making all goals use the same final trial duration,
1368 so their results remain comparable across implementations.
1369
1370 # Addressed Problems
1371
1372 Now when MLRsearch is clearly specified and explained,
1373 it is possible to summarize how does MLRsearch specification help with problems.
1374
1375 Here, "multiple trials" is a shorthand for having the goal final trial duration
1376 significantly smaller than the goal duration sum.
1377 This results in MLRsearch performing multiple trials at the same load,
1378 which may not be the case with other configurations.
1379
1380 ## Long Test Duration
1381
1382 As shortening the overall search duration is the main motivation
1383 of MLRsearch library development, the library implements
1384 multiple improvements on this front, both big and small.
1385
1386 Most of implementation details are not constrained by the MLRsearch specification,
1387 so that future implementations may keep shortening the search duration even more.
1388
1389 One exception is the impact of short trial results on the relevant lower bound.
1390 While motivated by human intuition, the logic is not straightforward.
1391 In practice, configurations with only one common trial duration value
1392 are capable of achieving good overal search time and result repeatability
1393 without the need to consider short trials.
1394
1395 ### Impact of goal attribute values
1396
1397 From the required goal attributes, the goal duration sum
1398 remains the best way to get even shorter searches.
1399
1400 Usage of multiple trials can also save time,
1401 depending on wait times around trial traffic.
1402
1403 The farther the goal exceed ratio is from 0.5 (towards zero or one),
1404 the less predictable the overal search duration becomes in practice.
1405
1406 Width parameter does not change search duration much in practice
1407 (compared to other, mainly optional goal attributes).
1408
1409 ## DUT in SUT
1410
1411 In practice, using multiple trials and moderate exceed ratios
1412 often improves result repeatability without increasing the overall search time,
1413 depending on the specific SUT and DUT characteristics.
1414 Benefits for separating SUT noise are less clear though,
1415 as it is not easy to distinguish SUT noise from DUT instability in general.
1416
1417 Conditional throughput has an intuitive meaning when described
1418 using the performance spectrum, so this is an improvement
1419 over existing simple (less configurable) search procedures.
1420
1421 Multiple trials can save time also when the noisy end of
1422 the preformance spectrum needs to be examined, e.g. for [RFC9004].
1423
1424 Under some circumstances, testing the same DUT and SUT setup with different
1425 DUT configurations can give some hints on what part of noise is SUT noise
1426 and what part is DUT performance fluctuations.
1427 In practice, both types of noise tend to be too complicated for that analysis.
1428
1429 MLRsearch enables users to search for multiple goals,
1430 potentially providing more insight at the cost of a longer overall search time.
1431 However, for a thorough and reliable examination of DUT-SUT interactions,
1432 it is necessary to employ additional methods beyond black-box benchmarking,
1433 such as collecting and analyzing DUT and SUT telemetry.
1434
1435 ## Repeatability and Comparability
1436
1437 Multiple trials improve repeatability, depending on exceed ratio.
1438
1439 In practice, one-second goal final trial duration with exceed ratio 0.5
1440 is good enough for modern SUTs.
1441 However, unless smaller wait times around the traffic part of the trial
1442 are allowed, too much of overal search time would be wasted on waiting.
1443
1444 It is not clear whether exceed ratios higher than 0.5 are better
1445 for repeatability.
1446 The 0.5 value is still preferred due to explainability using median.
1447
1448 It is possible that the conditional throughput values (with non-zero goal
1449 loss ratio) are better for repeatability than the relevant lower bound values.
1450 This is especially for implementations
1451 which pick load from a small set of discrete values,
1452 as that hides small variances in relevant lower bound values
1453 other implementations may find.
1454
1455 Implementations focusing on shortening the overall search time
1456 are automatically forced to avoid comparability issues due to load selection,
1457 as they must prefer even splits wherever possible.
1458 But this conclusion only holds when the same goals are used.
1459 Larger adoption is needed before any further claims on comparability
1460 between MLRsearch implementations can be made.
1461
1462 ## Throughput with Non-Zero Loss
1463
1464 Trivially suported by the goal loss ratio attribute.
1465
1466 In practice, usage of non-zero loss ratio values
1467 improves the result repeatability
1468 (exactly as expected based on results from simpler search methods).
1469
1470 ## Inconsistent Trial Results
1471
1472 MLRsearch is conservative wherever possible.
1473 This is built into the definition of conditional throughput,
1474 and into the treatment of short trial results for load classification.
1475
1476 This is consistent with [RFC2544] zero loss tolerance motivation.
1477
1478 If the noiseless part of the SUT performance spectrum is of interest,
1479 it should be enough to set small value for the goal final trial duration,
1480 and perhaps also a large value for the goal exceed ratio.
1481
1482 Implementations may offer other (optional) configuration attributes
1483 to become less conservative, but currently it is not clear
1484 what impact would that have on repeatability.
1485
1486 # IANA Considerations
1487
1488 No requests of IANA.
1489
1490 # Security Considerations
1491
1492 Benchmarking activities as described in this memo are limited to
1493 technology characterization of a DUT/SUT using controlled stimuli in a
1494 laboratory environment, with dedicated address space and the constraints
1495 specified in the sections above.
1496
1497 The benchmarking network topology will be an independent test setup and
1498 MUST NOT be connected to devices that may forward the test traffic into
1499 a production network or misroute traffic to the test management network.
1500
1501 Further, benchmarking is performed on a "black-box" basis, relying
1502 solely on measurements observable external to the DUT/SUT.
1503
1504 Special capabilities SHOULD NOT exist in the DUT/SUT specifically for
1505 benchmarking purposes. Any implications for network security arising
1506 from the DUT/SUT SHOULD be identical in the lab and in production
1507 networks.
1508
1509 # Acknowledgements
1510
1511 Some phrases and statements in this document were created
1512 with help of Mistral AI (mistral.ai).
1513
1514 Many thanks to Alec Hothan of the OPNFV NFVbench project for thorough
1515 review and numerous useful comments and suggestions.
1516
1517 Special wholehearted gratitude and thanks to the late Al Morton for his
1518 thorough reviews filled with very specific feedback and constructive
1519 guidelines. Thank you Al for the close collaboration over the years,
1520 for your continuous unwavering encouragement full of empathy and
1521 positive attitude.
1522 Al, you are dearly missed.
1523
1524 # Appendix A: Load Classification
1525
1526 This is the specification of how to perform the load classification.
1527
1528 Any intended load value can be classified, according to the given search goal.
1529
1530 The algorithm uses (some subsets of) the set of all available trial results
1531 from trials measured at a given intended load at the end of the search.
1532 All durations are those returned by the measurer.
1533
1534 The block at the end of this appendix holds pseudocode
1535 which computes two values, stored in variables named optimistic and pessimistic.
1536 The pseudocode happens to be a valid Python code.
1537
1538 If both values are computed to be true, the load in question
1539 is classified as a lower bound according to the given search goal.
1540 If both values are false, the load is classified as an upper bound.
1541 Otherwise, the load is classified as undecided.
1542
1543 The pseudocode expects the following variables to hold values as follows:
1544
1545 - goal_duration_sum: The duration sum value of the given search goal.
1546
1547 - goal_exceed_ratio: The exceed ratio value of the given search goal.
1548
1549 - good_long_sum: Sum of durations across trials with trial duration
1550   at least equal to the goal final trial duration and with a trial loss ratio
1551   not higher than the goal loss ratio.
1552
1553 - bad_long_sum: Sum of durations across trials with trial duration
1554   at least equal to the goal final trial duration and with a trial loss ratio
1555   higher than the goal loss ratio.
1556
1557 - good_short_sum: Sum of durations across trials with trial duration
1558   shorter than the goal final trial duration and with a trial loss ratio
1559   not higher than the goal loss ratio.
1560
1561 - bad_short_sum: Sum of durations across trials with trial duration
1562   shorter than the goal final trial duration and with a trial loss ratio
1563   higher than the goal loss ratio.
1564
1565 The code works correctly also when there are no trial results at the given load.
1566
1567 ~~~ python
1568 balancing_sum = good_short_sum * goal_exceed_ratio / (1.0 - goal_exceed_ratio)
1569 effective_bad_sum = bad_long_sum + max(0.0, bad_short_sum - balancing_sum)
1570 effective_whole_sum = max(good_long_sum + effective_bad_sum, goal_duration_sum)
1571 quantile_duration_sum = effective_whole_sum * goal_exceed_ratio
1572 optimistic = effective_bad_sum <= quantile_duration_sum
1573 pessimistic = (effective_whole_sum - good_long_sum) <= quantile_duration_sum
1574 ~~~
1575
1576 # Appendix B: Conditional Throughput
1577
1578 This is the specification of how to compute conditional throughput.
1579
1580 Any intended load value can be used as the basis for the following computation,
1581 but only the relevant lower bound (at the end of the search)
1582 leads to the value called the conditional throughput for a given search goal.
1583
1584 The algorithm uses (some subsets of) the set of all available trial results
1585 from trials measured at a given intended load at the end of the search.
1586 All durations are those returned by the measurer.
1587
1588 The block at the end of this appendix holds pseudocode
1589 which computes a value stored as variable conditional_throughput.
1590 The pseudocode happens to be a valid Python code.
1591
1592 The pseudocode expects the following variables to hold values as follows:
1593
1594 - goal_duration_sum: The duration sum value of the given search goal.
1595
1596 - goal_exceed_ratio: The exceed ratio value of the given search goal.
1597
1598 - good_long_sum: Sum of durations across trials with trial duration
1599   at least equal to the goal final trial duration and with a trial loss ratio
1600   not higher than the goal loss ratio.
1601
1602 - bad_long_sum: Sum of durations across trials with trial duration
1603   at least equal to the goal final trial duration and with a trial loss ratio
1604   higher than the goal loss ratio.
1605
1606 - long_trials: An iterable of all trial results from trials with trial duration
1607   at least equal to the goal final trial duration,
1608   sorted by increasing the trial loss ratio.
1609   A trial result is a composite with the following two attributes available:
1610
1611   - trial.loss_ratio: The trial loss ratio as measured for this trial.
1612
1613   - trial.duration: The trial duration of this trial.
1614
1615 The code works correctly only when there if there is at least one
1616 trial result measured at a given load.
1617
1618 ~~~ python
1619 all_long_sum = max(goal_duration_sum, good_long_sum + bad_long_sum)
1620 remaining = all_long_sum * (1.0 - goal_exceed_ratio)
1621 quantile_loss_ratio = None
1622 for trial in long_trials:
1623     if quantile_loss_ratio is None or remaining > 0.0:
1624         quantile_loss_ratio = trial.loss_ratio
1625         remaining -= trial.duration
1626     else:
1627         break
1628 else:
1629     if remaining > 0.0:
1630         quantile_loss_ratio = 1.0
1631 conditional_throughput = intended_load * (1.0 - quantile_loss_ratio)
1632 ~~~
1633
1634 --- back