+ than the trial loss ratio at that lower bound, but still not larger
+ than the target loss ratio.
+* TODO: Search algorithm.
+* TODO: Precision goal.
+* TODO: Define a "benchmarking group".
+* TODO: Upper and lower bound.
+* TODO: Valid and invalid bound?
+* TODO: Interval and interval width?
+
+TODO: Mention NIC/PCI bandwidth/pps limits can be lower than bandwidth of medium.
+
+# Intentions of this document
+
+{::comment}
+ Instead of talking about DUTs being non-deterministic
+ and vendors "gaming" in order to get better Throughput results,
+ Maciek and Vratko currently prefer to talk about result repeatability.
+{:/comment}
+
+The intention of this document is to provide recommendations for:
+* optimizing search for multiple target loss ratios at once,
+* speeding up the overall search time,
+* improve search results repeatability and comparability.
+
+No part of RFC2544 is intended to be obsoleted by this document.
+
+{::comment}
+ This document may contain examples which contradict RFC2544 requirements
+ and suggestions.
+ That is not an ecouragement for benchmarking groups
+ to stop being compliant with RFC2544.
+{:/comment}
+
+# RFC2544
+
+## Throughput search
+
+It is useful to restate the key requirements of RFC2544
+using the new terminology (see section Terminology).
+
+The following sections of RFC2544 are of interest for this document.
+
+* https://datatracker.ietf.org/doc/html/rfc2544#section-20
+ Mentions the max load SHOULD not be larget than the theoretical
+ maximum rate for the frame size on the media.
+
+* https://datatracker.ietf.org/doc/html/rfc2544#section-23
+ Lists the actions to be done for each trial measurement,
+ it also mentions loss rate as an example of trial measurement results.
+ This document uses loss count instead, as that is the quantity
+ that is easier for the current test equipment to measure,
+ e.g. it is not affected by the real traffic duration.
+ TODO: Time uncertainty again.
+
+* https://datatracker.ietf.org/doc/html/rfc2544#section-24
+ Mentions "full length trials" leading to the Throughput found,
+ as opposed to shorter trial durations, allowed in an attempt
+ to "minimize the length of search procedure".
+ This document talks about "final trial duration" and aims to
+ "optimize overal search time".
+
+* https://datatracker.ietf.org/doc/html/rfc2544#section-26.1
+ with https://www.rfc-editor.org/errata/eid422
+ finaly states requirements for the search procedure.
+ It boils down to "increase intended load upon zero trial loss
+ and decrease intended load upon non-zero trial loss".
+
+No additional constraints are placed on the load selection,
+and there is no mention of an exit condition, e.g. when there is enough
+trial measurements to proclaim the largest load with zero trial loss
+(and final trial duration) to be the Throughput found.
+
+{::comment}
+ The following section is probably not useful enough.
+
+ ## Generalized search
+
+ Note that the Throughput search can be restated as a "conditional
+ load search" with a specific condition.
+
+ "increase intended load upon trial result satisfying the condition
+ and decrease intended load upon trial result not satisfying the condition"
+ where the Throughput condition is "trial loss count is zero".
+
+ This works for any condition that can be evaluated from a single
+ trial measurement result, and is likely to be true at low loads
+ and false at high loads.
+
+ MLRsearch can incorporate multiple different conditions,
+ as long as there is total ligical ordering between them
+ (e.g. if a condition for a target loss ratio is not satisfied,
+ it is also not satisfied for any other codition which uses
+ larger target loss ratio).
+
+ TODO: How to call a "load associated with this particular condition"?
+{:/comment}
+
+{::comment}
+
+ TODO: Not sure if this subsection is needed an where.
+
+ ## Simple bisection
+
+ There is one obvious and simple search algorithm which conforms
+ to throughput search requirements: simple bijection.
+
+ Input: target precision, in frames per second.
+
+ Procedure:
+
+ 1. Chose min load to be zero.
+ 1. No need to measure, loss count has to be zero.
+ 2. Use the zero load as the current lower bound.
+ 2. Chose max load to be the max value allowed by bandwidth of the medium.
+ 1. Perform a trial measurement (at the full length duration) at max load.
+ 2. If there is zero trial loss count, return max load as Throughput.
+ 3. Use max load as the current upper bound.
+ 3. Repeat until the difference between lower bound and upper bound is
+ smaller or equal to the precision goal.
+ 1. If it is not larget, return the current lower bound as Throughput.
+ 2. Else: Chose new load as the arithmetic average of lower and upper bound.
+ 3. Perform a trial measurement (at the full length duration) at this load.
+ 4. If the trial loss rate is zero, consider the load as new lower bound.
+ 5. Else consider the load as the new upper bound.
+ 6. Jump back to the repeat at 3.
+
+ Another possible stop condition is the overal search time so far,
+ but that is not really a different condition, as the time for search to reach
+ the precision goal is just a function of precision goal, trial duration
+ and the difference between max and min load.
+
+ While this algorithm can be accomodated to search for multiple
+ target loss ratios "at the same time (see somewhere below),
+ it is still missing multiple improvement which give MLRsearch
+ considerably better overal search time in practice.
+
+{:/comment}
+
+# Problems
+
+## Repeatability and Comparability
+
+RFC2544 does not suggest to repeat Throughput search,
+{::comment}probably because the full set of tests already takes long{:/comment}
+and from just one Throughput value, it cannot be determined
+how repeatable that value is (how likely it is for a repeated Throughput search
+to end up with a value less then the precision goal away from the first value).
+
+Depending on SUT behavior, different benchmark groups
+can report significantly different Througput values,
+even when using identical SUT and test equipment,
+just because of minor differences in their search algorithm
+(e.g. different max load value).
+
+While repeatability can be addressed by repeating the search several times,
+the differences in the comparability scenario may be systematic,
+e.g. seeming like a bias in one or both benchmark groups.
+
+MLRsearch algorithm does not really help with the repeatability problem.
+This document RECOMMENDS to repeat a selection of "important" tests
+ten times, so users can ascertain the repeatability of the results.
+
+TODO: How to report? Average and standard deviation?
+
+Following MLRsearch algorithm leaves less freedom for the benchmark groups
+to encounter the comparability problem,
+alghough more research is needed to determine the effect
+of MLRsearch's tweakable parameters.
+
+{::comment}
+ Possibly, the old DUTs were quite sharply consistent in their performance,
+ and/or precision goals were quite large in order to save overal search time.
+
+ With software DUTs and with time-efficient search algorithms,
+ nowadays the repeatability of Throughput can be quite low,
+ as in standard deviation of repeated Througput results
+ is considerably higher than the precision goal.
+{:/comment}
+
+{::comment}
+ TODO: Unify with PLRsearch draft.
+ TODO: No-loss region, random region, lossy region.
+ TODO: Tweaks with respect to non-zero loss ratio goal.
+ TODO: Duration dependence?
+
+ Both RFC2544 and MLRsearch return Throughput somewhere inside the random region,
+ or at most the precision goal below it.
+{:/comment}
+
+{::comment}
+ TODO: Make sure this is covered elsewhere, then delete.
+
+ ## Search repeatability
+
+ The goal of RFC1242 and RFC2544 is to limit how vendors benchmark their DUTs,
+ in order to force them to report values that have higher chance
+ to be confirmed by independent benchmarking groups following the same RFCs.
+
+ This works well for deterministic DUTs.
+
+ But for non-deterministic DUTs, the RFC2544 Throughput value
+ is only guaranteed to fall somewhere below the lossy region (TODO define).
+ It is possible to arrive at a value positioned likely high in the random region
+ at the cost of increased overall search duration,
+ simply by lowering the load by very small amounts (instead of exact halving)
+ upon lossy trial and increasing by large amounts upon lossless trial.
+
+ Prescribing an exact search algorithm (bisection or MLRsearch or other)
+ will force vendors to report less "gamey" Throughput values.
+{:/comment}
+
+{::comment}
+ ## Extensions
+
+ The following two sections are probably out of scope,
+ as they does not affect MLRsearch design choices.
+
+ ### Direct and inverse measurements
+
+ TODO expand: Direct measurement is single trial measurement,
+ with predescribed inputs and outputs turned directly into the quality of interest
+ Examples:
+ Latency https://datatracker.ietf.org/doc/html/rfc2544#section-26.2
+ is a single direct measurement.
+ Frame loss rate https://datatracker.ietf.org/doc/html/rfc2544#section-26.3
+ is a sequence of direct measurements.
+
+ TODO expand: Indirect measurement aims to solve an "inverse function problem",
+ meaning (a part of) trial measurement output is prescribed, and the quantity
+ of interest is (derived from) the input parameters of trial measurement
+ that achieves the prescribed output.
+ In general this is a hard problem, but if the unknown input parameter
+ is just one-dimensional quantity, algorithms such as bisection
+ do converge regardless of outputs seen.
+ We call any such algorithm examining one-dimensional input as "search".
+ Of course, some exit condition is needed for the search to end.
+ In case of Throughput, bisection algorithm tracks both upper bound
+ and lower bound, with lower bound at the end of search is the quantity
+ satisfying the definition of Throughput.
+
+ ### Metrics other than frames
+
+ TODO expand: Small TCP transaction can succeed even if some frames are lost.
+
+ TODO expand: It is possible for loss ratio to use different metric than load.
+ E.g. pps loss ratio when traffic profile uses higher level transactions per second.
+
+ ### TODO: Stateful DUT
+
+ ### TODO: Stateful traffic
+{:/comment}
+
+## Non-Zero Target Loss Ratios
+
+https://datatracker.ietf.org/doc/html/rfc1242#section-3.17
+defines Throughput as:
+ The maximum rate at which none of the offered frames
+ are dropped by the device.
+
+and then it says:
+ Since even the loss of one frame in a
+ data stream can cause significant delays while
+ waiting for the higher level protocols to time out,
+ it is useful to know the actual maximum data
+ rate that the device can support.
+
+{::comment}
+
+ While this may still be true for some protocols,
+ research has been performed...
+
+ TODO: Add this link properly: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-Y.1541-201112-I!!PDF-E&type=items
+ TODO: List values from that document, from 10^-3 to 4*10^-6.
+
+ ...on other protocols and use cases,
+ resulting in some small but non-zero loss ratios being considered
+ as acceptable. Unfortunately, the acceptable value depends on use case
+ and properties such as TCP window size and round trip time,
+ so no single value of target loss rate (other than zero)
+ is considered to be universally applicable.
+
+{:/comment}
+
+New "software DUTs" (traffic forwarding programs running on
+commercial-off-the-shelf compute server hardware) frequently exhibit quite
+low repeatability of Throughput results per above definition.
+
+This is due to, in general, throughput rates of software DUTs (programs)
+being sensitive to server resource allocation by OS during runtime,
+as well as any interrupts or blocking of software threads involved
+in packet processing.
+
+To deal with this, this document recommends discovery of multiple throughput rates of interest for software DUTs that run on general purpose COTS servers (with x86, AArch64 Instruction Set Architectures):
+* throughput rate with target of zero packet loss ratio.
+* at least one throughput rate with target of non-zero packet loss ratio.
+
+
+In our experience, the higher the target loss ratio is,
+the better is the repeatability of the corresponding load found.
+
+TODO: Define a good name for a load corresponding to a specific non-zero
+target loss ration, while keeping Throughput for the load corresponding
+to zero target loss ratio.
+
+This document RECOMMENDS the benchmark groups to search for corresponding loads
+to at least one non-zero target loss ratio.
+This document does not suggest any particular non-zero target loss ratio value
+to search the corresponding load for.
+
+{::comment}
+ What is worse, some benchmark groups (which groups?; citation needed)
+ started reporting loads that achieved only "approximate zero loss",
+ while still calling that a Throughput (and thus becoming non-compliant
+ with RFC2544).
+{:/comment}
+
+# Solution ideas
+
+This document gives several independent ideas on how to lower the (average)
+overall search time, while remaining unconditionally compliant with RFC2544
+(and adding some of extensions).
+
+This document also specifies one particular way to combine all the ideas
+into a single search algorithm class (single logic with few tweakable parameters).
+
+Little to no research has been done into the question of which combination
+of ideas achieves the best compromise with respect to overal search time,
+high repeatability and high comparability.
+
+TODO: How important it is to discuss particular implementation choices,
+especially when motivated by non-deterministic SUT behavior?
+
+## Short duration trials
+
+https://datatracker.ietf.org/doc/html/rfc2544#section-24
+already mentions the possibity of using shorter duration
+for trials that are not part of "final determination".
+
+Obviously, the upper and lower bound from a smaller duration trial
+can be used as the initial upper and lower bound for the final determination.
+
+MLRsearch makes it clear a re-measurement is always needed
+(new trial measurement with the same load but longer duration).
+It also specifes what to do if the longer trial is no longer a valid bound
+(TODO define?), e.g. start an external search.
+Additionaly one halving can be saved during the shorter duration search.
+
+## FRMOL as reasonable start
+
+TODO expand: Overal search ends with "final determination" search,
+preceded by "shorter duration search" preceded by "bound initialization",
+where the bounds can be considerably different from min and max load.
+
+For SUTs with high repeatability, the FRMOL is usually a good approximation
+of Throughput. But for less repeatable SUTs, forwarding rate (TODO define)
+is frequently a bad approximation to Throughput, therefore halving
+and other robust-to-worst-case approaches have to be used.
+Still, forwarding rate at FRMOL load can be a good initial bound.
+
+## Non-zero loss ratios
+
+See the "Popularity of non-zero target loss ratios" section above.
+
+TODO: Define "trial measurement result classification criteria",
+or keep reusing long phrases without definitions?
+
+A search for a load corresponding to a non-zero target loss rate
+is very similar to a search for Throughput,
+just the criterion when to increase or decrease the intended load
+for the next trial measurement uses the comparison of trial loss ratio
+to the target loss ratio (instead of comparing loss count to zero)
+Any search algorithm that works for Throughput can be easily used also for
+non-zero target loss rates, perhaps with small modifications
+in places where the measured forwarding rate is used.
+
+Note that it is possible to search for multiple loss ratio goals if needed.
+
+## Concurrent ratio search
+
+A single trial measurement result can act as an upper bound for a lower
+target loss ratio, and as a lower bound for a higher target loss ratio
+at the same time. This is an example of how
+it can be advantageous to search for all loss ratio goals "at once",
+or at least "reuse" trial measurement result done so far.
+
+Even when a search algorithm is fully deterministic in load selection
+while focusing on a single loss ratio and trial duration,
+the choice of iteration order between target loss ratios and trial durations
+can affect the obtained results in subtle ways.
+MLRsearch offers one particular ordering.
+
+{::comment}
+ It is not clear if the current ordering is "best",
+ it is not even clear how to measure how good an ordering is.
+ We would need several models for bad SUT behaviors,
+ bug-free implementations of different orderings,
+ simulator to show the distribution of rates found,
+ distribution of overall durations,
+ and a criterion of which rate distribution is "bad"
+ and whether it is worth the time saved.
+{:/comment}
+{::comment}
+{:/comment}
+
+## Load selection heuristics and shortcuts
+
+Aside of the two heuristics already mentioned (FRMOL based initial bounds
+and saving one halving when increasing trial duration),
+there are other tricks that can save some overall search time
+at the cost of keeping the difference between final lower and upper bound
+intentionally large (but still within the precision goal).
+
+TODO: Refer implementation subsections on:
+* Uneven splits.
+* Rounding the interval width up.
+* Using old invalid bounds for interval width guessing.
+
+The impact on overall duration is probably small,
+and the effect on result distribution maybe even smaller.
+TODO: Is the two-liner above useful at all?
+
+# Non-compliance with RFC2544
+
+It is possible to achieve even faster search times by abandoning
+some requirements and suggestions of RFC2544,
+mainly by reducing the wait times at start and end of trial.
+
+Such results are therefore no longer compliant with RFC2544
+(or at least not unconditionally),
+but they may still be useful for internal usage, or for comparing
+results of different DUTs achieved with an identical non-compliant algorithm.
+
+TODO: Refer to the subsection with CSIT customizations.
+
+# Additional Requirements
+
+RFC2544 can be understood as having a number of implicit requirements.
+They are made explicit in this section
+(as requirements for this document, not for RFC2544).
+
+Recommendations on how to properly address the implicit requirements
+are out of scope of this document.
+
+{::comment}
+
+ Although some (insufficient) ideas are proposed.
+
+{:/comment}
+
+## TODO: Search Stop Criteria
+
+TODO: Mention the timeout parameter?
+
+{::comment}
+
+ TODO: highlight importance of results consistency
+ for SUT performance trending and anomaly detection.
+
+{:/comment}
+
+## Reliability of Test Equipment
+
+Both TG and TA MUST be able to handle correctly
+every intended load used during the search.
+
+On TG side, the difference between Intended Load and Offered Load
+MUST be small.
+
+TODO: How small? Difference of one packet may not be measurable
+due to time uncertainties.
+
+{::comment}
+
+ Maciek: 1 packet out of 10M, that's 10**-7 accuracy.
+
+ Vratko: For example, TRex uses several "worker" threads, each doing its own
+ rounding on how many packets to send, separately per each traffic stream.
+ For high loads and durations, the observed number of frames transmitted
+ can differ from the expected (fractional) value by tens of frames.
+
+{:/comment}
+
+TODO expand: time uncertainty.
+
+To ensure that, max load (see Terminology) has to be set to low enough value.
+Benchmark groups MAY list the max load value used,
+especially if the Throughput value is equal (or close) to the max load.
+
+{::comment}
+
+ The following is probably out of scope of this document,
+ but can be useful when put into a separate document.
+
+ TODO expand: If it results in smaller Throughput reported,
+ it is not a big issue. Treat similarly to bandwidth and PPS limits of NICs.
+
+ TODO expand: TA dropping packets when loaded only lowers Throughput,
+ so not an issue.
+
+ TODO expand: TG sending less packets but stopping at target duration
+ is also fine, as long as the forwarding rate is used as Throughput value,
+ not the higher intended load.
+
+ TODO expand: Duration stretching is not fine.
+ Neither "check for actual duration" nor "start+sleep+stop"
+ are reliable solutions due to time overheads and uncertainty
+ of TG starting/stopping traffic (and TA stopping counting packets).
+
+{:/comment}
+
+Solutions (even problem formulations) for the following open problems
+are outside of the scope of this document:
+* Detecting when the test equipment operates above its safe load.
+* Finding a large but safe load value.
+* Correcting any result affected by max load value not being a safe load.
+
+{::comment}
+
+ TODO: Mention 90% of self-test as an idea:
+ https://datatracker.ietf.org/doc/html/rfc8219#section-9.2.1
+
+ This is pointing to DNS testing, nothing to do with throughput,
+ so how is it relevant here?
+
+{:/comment}
+
+{::comment}
+
+ Part of discussion on BMWG mailing list (with small edits):
+
+ This is a hard issue.
+ The algorithm as described has no way of knowing
+ which part of the whole system is limiting the performance.
+
+ It could be SUT only (no problem, testing SUT as expected),
+ it could be TG only (can be mitigated by TG self-test
+ and using small enough loads).
+
+ But it could also be an interaction between DUT and TG.
+ Imagine a TG (the Traffic Analyzer part) which is only able
+ to handle incoming traffic up to some rate,
+ but passes the self-test as the Generator part has maximal rate
+ not larger than that. But what if SUT turns that steady rate
+ into long-enough bursts of a higher rate (with delays between bursts
+ large enough, so average forwarding rate matches the load).
+ This way TA will see some packets as missing (when its buffers
+ fill up), even though SUT has processed them correctly.
+
+{:/comment}
+
+### Very late frames
+
+{::comment}
+
+ In CSIT we are aggressive at skipping all wait times around trial,
+ but few of DUTs have large enough buffers.
+ Or there is another reason why we are seeing negative loss counts.
+
+{:/comment}
+
+
+RFC2544 requires quite conservative time delays
+see https://datatracker.ietf.org/doc/html/rfc2544#section-23
+to prevent frames buffered in one trial measurement
+to be counted as received in a subsequent trial measurement.
+
+However, for some SUTs it may still be possible to buffer enough frames,
+so they are still sending them (perhaps in bursts)
+when the next trial measurement starts.
+Sometimes, this can be detected as a negative trial loss count, e.g. TA receiving
+more frames than TG has sent during this trial measurement. Frame duplication
+is another way of causing the negative trial loss count.
+
+https://datatracker.ietf.org/doc/html/rfc2544#section-10
+recommends to use sequence numbers in frame payloads,
+but generating and verifying them requires test equipment resources,
+which may be not plenty enough to suport at high loads.
+(Using low enough max load would work, but frequently that would be
+smaller than SUT's sctual Throughput.)
+
+RFC2544 does not offer any solution to the negative loss problem,
+except implicitly treating negative trial loss counts
+the same way as positive trial loss counts.
+
+This document also does not offer any practical solution.
+
+Instead, this document SUGGESTS the search algorithm to take any precaution
+necessary to avoid very late frames.
+
+This document also REQUIRES any detected duplicate frames to be counted
+as additional lost frames.
+This document also REQUIRES, any negative trial loss ratio
+to be treated as positive trial loss ratio of the same absolute value.
+
+{::comment}
+
+ !!! Make sure this is covered elsewere, at least in better comments. !!!
+
+ ## TODO: Bad behavior of SUT
+
+ (Highest load with always zero loss can be quite far from lowest load
+ with always nonzero loss.)
+ (Non-determinism: warm up, periodic "stalls", perf decrease over time, ...)
+
+ Big buffers:
+ http://www.hit.bme.hu/~lencse/publications/ECC-2017-B-M-DNS64-revised.pdf
+ See page 8 and search for the word "gaming".
+
+{:/comment}
+
+!!! Nothing below is up-to-date with draft v02. !!!