doc/guides/sample_app_ug/performance_thread.rst

   1 ..  BSD LICENSE
   2     Copyright(c) 2015 Intel Corporation. All rights reserved.
   3     All rights reserved.
   4
   5     Redistribution and use in source and binary forms, with or without
   6     modification, are permitted provided that the following conditions
   7     are met:
   8
   9     * Re-distributions of source code must retain the above copyright
  10     notice, this list of conditions and the following disclaimer.
  11     * Redistributions in binary form must reproduce the above copyright
  12     notice, this list of conditions and the following disclaimer in
  13     the documentation and/or other materials provided with the
  14     distribution.
  15     * Neither the name of Intel Corporation nor the names of its
  16     contributors may be used to endorse or promote products derived
  17     from this software without specific prior written permission.
  18
  19     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  20     "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  21     LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  22     A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  23     OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  24     SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
  25     LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  26     DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  27     THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
  28     (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  29     OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  30
  31
  32 Performance Thread Sample Application
  33 =====================================
  34
  35 The performance thread sample application is a derivative of the standard L3
  36 forwarding application that demonstrates different threading models.
  37
  38 Overview
  39 --------
  40 For a general description of the L3 forwarding applications capabilities
  41 please refer to the documentation of the standard application in
  42 :doc:`l3_forward`.
  43
  44 The performance thread sample application differs from the standard L3
  45 forwarding example in that it divides the TX and RX processing between
  46 different threads, and makes it possible to assign individual threads to
  47 different cores.
  48
  49 Three threading models are considered:
  50
  51 #. When there is one EAL thread per physical core.
  52 #. When there are multiple EAL threads per physical core.
  53 #. When there are multiple lightweight threads per EAL thread.
  54
  55 Since DPDK release 2.0 it is possible to launch applications using the
  56 ``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
  57 performance thread sample application its is now also possible to assign
  58 individual RX and TX functions to different cores.
  59
  60 As an alternative to dividing the L3 forwarding work between different EAL
  61 threads the performance thread sample introduces the possibility to run the
  62 application threads as lightweight threads (L-threads) within one or
  63 more EAL threads.
  64
  65 In order to facilitate this threading model the example includes a primitive
  66 cooperative scheduler (L-thread) subsystem. More details of the L-thread
  67 subsystem can be found in :ref:`lthread_subsystem`.
  68
  69 **Note:** Whilst theoretically possible it is not anticipated that multiple
  70 L-thread schedulers would be run on the same physical core, this mode of
  71 operation should not be expected to yield useful performance and is considered
  72 invalid.
  73
  74 Compiling the Application
  75 -------------------------
  76 The application is located in the sample application folder in the
  77 ``performance-thread`` folder.
  78
  79 #.  Go to the example applications folder
  80
  81     .. code-block:: console
  82
  83        export RTE_SDK=/path/to/rte_sdk
  84        cd ${RTE_SDK}/examples/performance-thread/l3fwd-thread
  85
  86 #.  Set the target (a default target is used if not specified). For example:
  87
  88     .. code-block:: console
  89
  90        export RTE_TARGET=x86_64-native-linuxapp-gcc
  91
  92     See the *DPDK Linux Getting Started Guide* for possible RTE_TARGET values.
  93
  94 #.  Build the application:
  95
  96         make
  97
  98
  99 Running the Application
 100 -----------------------
 101
 102 The application has a number of command line options::
 103
 104     ./build/l3fwd-thread [EAL options] --
 105         -p PORTMASK [-P]
 106         --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
 107         --tx(lcore,thread)[,(lcore,thread)]
 108         [--enable-jumbo] [--max-pkt-len PKTLEN]]  [--no-numa]
 109         [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
 110         [--parse-ptype]
 111
 112 Where:
 113
 114 * ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
 115
 116 * ``-P``: optional, sets all ports to promiscuous mode so that packets are
 117   accepted regardless of the packet's Ethernet MAC destination address.
 118   Without this option, only packets with the Ethernet MAC destination address
 119   set to the Ethernet address of the port are accepted.
 120
 121 * ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
 122   NIC RX ports and queues handled by the RX lcores and threads. The parameters
 123   are explained below.
 124
 125 * ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
 126   the lcore the thread runs on, and the id of RX thread with which it is
 127   associated. The parameters are explained below.
 128
 129 * ``--enable-jumbo``: optional, enables jumbo frames.
 130
 131 * ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
 132
 133 * ``--no-numa``: optional, disables numa awareness.
 134
 135 * ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
 136   setup.
 137
 138 * ``--ipv6``: optional, set it if running ipv6 packets.
 139
 140 * ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
 141   model. See below.
 142
 143 * ``--stat-lcore``: optional, run CPU load stats collector on the specified
 144   lcore.
 145
 146 * ``--parse-ptype:`` optional, set to use software to analyze packet type.
 147   Without this option, hardware will check the packet type.
 148
 149 The parameters of the ``--rx`` and ``--tx`` options are:
 150
 151 * ``--rx`` parameters
 152
 153    .. _table_l3fwd_rx_parameters:
 154
 155    +--------+------------------------------------------------------+
 156    | port   | RX port                                              |
 157    +--------+------------------------------------------------------+
 158    | queue  | RX queue that will be read on the specified RX port  |
 159    +--------+------------------------------------------------------+
 160    | lcore  | Core to use for the thread                           |
 161    +--------+------------------------------------------------------+
 162    | thread | Thread id (continuously from 0 to N)                 |
 163    +--------+------------------------------------------------------+
 164
 165
 166 * ``--tx`` parameters
 167
 168    .. _table_l3fwd_tx_parameters:
 169
 170    +--------+------------------------------------------------------+
 171    | lcore  | Core to use for L3 route match and transmit          |
 172    +--------+------------------------------------------------------+
 173    | thread | Id of RX thread to be associated with this TX thread |
 174    +--------+------------------------------------------------------+
 175
 176 The ``l3fwd-thread`` application allows you to start packet processing in two
 177 threading models: L-Threads (default) and EAL Threads (when the
 178 ``--no-lthreads`` parameter is used). For consistency all parameters are used
 179 in the same way for both models.
 180
 181
 182 Running with L-threads
 183 ~~~~~~~~~~~~~~~~~~~~~~
 184
 185 When the L-thread model is used (default option), lcore and thread parameters
 186 in ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
 187
 188 For example, the following places every l-thread on different lcores::
 189
 190    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 191                 --rx="(0,0,0,0)(1,0,1,1)" \
 192                 --tx="(2,0)(3,1)"
 193
 194 The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
 195 and so on::
 196
 197    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 198                 --rx="(0,0,0,0)(1,0,0,1)" \
 199                 --tx="(1,0)(2,1)"
 200
 201
 202 Running with EAL threads
 203 ~~~~~~~~~~~~~~~~~~~~~~~~
 204
 205 When the ``--no-lthreads`` parameter is used, the L-threading model is turned
 206 off and EAL threads are used for all processing. EAL threads are enumerated in
 207 the same way as L-threads, but the ``--lcores`` EAL parameter is used to
 208 affinitize threads to the selected cpu-set (scheduler). Thus it is possible to
 209 place every RX and TX thread on different lcores.
 210
 211 For example, the following places every EAL thread on different lcores::
 212
 213    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 214                 --rx="(0,0,0,0)(1,0,1,1)" \
 215                 --tx="(2,0)(3,1)" \
 216                 --no-lthreads
 217
 218
 219 To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
 220 parameter is used.
 221
 222 The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
 223 and 2 and so on::
 224
 225    l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
 226                 --rx="(0,0,0,0)(1,0,1,1)" \
 227                 --tx="(2,0)(3,1)" \
 228                 --no-lthreads
 229
 230
 231 Examples
 232 ~~~~~~~~
 233
 234 For selected scenarios the command line configuration of the application for L-threads
 235 and its corresponding EAL threads command line can be realized as follows:
 236
 237 a) Start every thread on different scheduler (1:1)::
 238
 239       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 240                    --rx="(0,0,0,0)(1,0,1,1)" \
 241                    --tx="(2,0)(3,1)"
 242
 243    EAL thread equivalent::
 244
 245       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 246                    --rx="(0,0,0,0)(1,0,1,1)" \
 247                    --tx="(2,0)(3,1)" \
 248                    --no-lthreads
 249
 250 b) Start all threads on one core (N:1).
 251
 252    Start 4 L-threads on lcore 0::
 253
 254       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 255                    --rx="(0,0,0,0)(1,0,0,1)" \
 256                    --tx="(0,0)(0,1)"
 257
 258    Start 4 EAL threads on cpu-set 0::
 259
 260       l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \
 261                    --rx="(0,0,0,0)(1,0,0,1)" \
 262                    --tx="(2,0)(3,1)" \
 263                    --no-lthreads
 264
 265 c) Start threads on different cores (N:M).
 266
 267    Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
 268
 269       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 270                    --rx="(0,0,0,0)(1,0,0,1)" \
 271                    --tx="(1,0)(1,1)"
 272
 273    Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
 274    cpu-set 1::
 275
 276       l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
 277                    --rx="(0,0,0,0)(1,0,1,1)" \
 278                    --tx="(2,0)(3,1)" \
 279                    --no-lthreads
 280
 281 Explanation
 282 -----------
 283
 284 To a great extent the sample application differs little from the standard L3
 285 forwarding application, and readers are advised to familiarize themselves with
 286 the material covered in the :doc:`l3_forward` documentation before proceeding.
 287
 288 The following explanation is focused on the way threading is handled in the
 289 performance thread example.
 290
 291
 292 Mode of operation with EAL threads
 293 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 294
 295 The performance thread sample application has split the RX and TX functionality
 296 into two different threads, and the RX and TX threads are
 297 interconnected via software rings. With respect to these rings the RX threads
 298 are producers and the TX threads are consumers.
 299
 300 On initialization the TX and RX threads are started according to the command
 301 line parameters.
 302
 303 The RX threads poll the network interface queues and post received packets to a
 304 TX thread via a corresponding software ring.
 305
 306 The TX threads poll software rings, perform the L3 forwarding hash/LPM match,
 307 and assemble packet bursts before performing burst transmit on the network
 308 interface.
 309
 310 As with the standard L3 forward application, burst draining of residual packets
 311 is performed periodically with the period calculated from elapsed time using
 312 the timestamps counter.
 313
 314 The diagram below illustrates a case with two RX threads and three TX threads.
 315
 316 .. _figure_performance_thread_1:
 317
 318 .. figure:: img/performance_thread_1.*
 319
 320
 321 Mode of operation with L-threads
 322 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 323
 324 Like the EAL thread configuration the application has split the RX and TX
 325 functionality into different threads, and the pairs of RX and TX threads are
 326 interconnected via software rings.
 327
 328 On initialization an L-thread scheduler is started on every EAL thread. On all
 329 but the master EAL thread only a a dummy L-thread is initially started.
 330 The L-thread started on the master EAL thread then spawns other L-threads on
 331 different L-thread schedulers according the the command line parameters.
 332
 333 The RX threads poll the network interface queues and post received packets
 334 to a TX thread via the corresponding software ring.
 335
 336 The ring interface is augmented by means of an L-thread condition variable that
 337 enables the TX thread to be suspended when the TX ring is empty. The RX thread
 338 signals the condition whenever it posts to the TX ring, causing the TX thread
 339 to be resumed.
 340
 341 Additionally the TX L-thread spawns a worker L-thread to take care of
 342 polling the software rings, whilst it handles burst draining of the transmit
 343 buffer.
 344
 345 The worker threads poll the software rings, perform L3 route lookup and
 346 assemble packet bursts. If the TX ring is empty the worker thread suspends
 347 itself by waiting on the condition variable associated with the ring.
 348
 349 Burst draining of residual packets, less than the burst size, is performed by
 350 the TX thread which sleeps (using an L-thread sleep function) and resumes
 351 periodically to flush the TX buffer.
 352
 353 This design means that L-threads that have no work, can yield the CPU to other
 354 L-threads and avoid having to constantly poll the software rings.
 355
 356 The diagram below illustrates a case with two RX threads and three TX functions
 357 (each comprising a thread that processes forwarding and a thread that
 358 periodically drains the output buffer of residual packets).
 359
 360 .. _figure_performance_thread_2:
 361
 362 .. figure:: img/performance_thread_2.*
 363
 364
 365 CPU load statistics
 366 ~~~~~~~~~~~~~~~~~~~
 367
 368 It is possible to display statistics showing estimated CPU load on each core.
 369 The statistics indicate the percentage of CPU time spent: processing
 370 received packets (forwarding), polling queues/rings (waiting for work),
 371 and doing any other processing (context switch and other overhead).
 372
 373 When enabled statistics are gathered by having the application threads set and
 374 clear flags when they enter and exit pertinent code sections. The flags are
 375 then sampled in real time by a statistics collector thread running on another
 376 core. This thread displays the data in real time on the console.
 377
 378 This feature is enabled by designating a statistics collector core, using the
 379 ``--stat-lcore`` parameter.
 380
 381
 382 .. _lthread_subsystem:
 383
 384 The L-thread subsystem
 385 ----------------------
 386
 387 The L-thread subsystem resides in the examples/performance-thread/common
 388 directory and is built and linked automatically when building the
 389 ``l3fwd-thread`` example.
 390
 391 The subsystem provides a simple cooperative scheduler to enable arbitrary
 392 functions to run as cooperative threads within a single EAL thread.
 393 The subsystem provides a pthread like API that is intended to assist in
 394 reuse of legacy code written for POSIX pthreads.
 395
 396 The following sections provide some detail on the features, constraints,
 397 performance and porting considerations when using L-threads.
 398
 399
 400 .. _comparison_between_lthreads_and_pthreads:
 401
 402 Comparison between L-threads and POSIX pthreads
 403 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 404
 405 The fundamental difference between the L-thread and pthread models is the
 406 way in which threads are scheduled. The simplest way to think about this is to
 407 consider the case of a processor with a single CPU. To run multiple threads
 408 on a single CPU, the scheduler must frequently switch between the threads,
 409 in order that each thread is able to make timely progress.
 410 This is the basis of any multitasking operating system.
 411
 412 This section explores the differences between the pthread model and the
 413 L-thread model as implemented in the provided L-thread subsystem. If needed a
 414 theoretical discussion of preemptive vs cooperative multi-threading can be
 415 found in any good text on operating system design.
 416
 417
 418 Scheduling and context switching
 419 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 420
 421 The POSIX pthread library provides an application programming interface to
 422 create and synchronize threads. Scheduling policy is determined by the host OS,
 423 and may be configurable. The OS may use sophisticated rules to determine which
 424 thread should be run next, threads may suspend themselves or make other threads
 425 ready, and the scheduler may employ a time slice giving each thread a maximum
 426 time quantum after which it will be preempted in favor of another thread that
 427 is ready to run. To complicate matters further threads may be assigned
 428 different scheduling priorities.
 429
 430 By contrast the L-thread subsystem is considerably simpler. Logically the
 431 L-thread scheduler performs the same multiplexing function for L-threads
 432 within a single pthread as the OS scheduler does for pthreads within an
 433 application process. The L-thread scheduler is simply the main loop of a
 434 pthread, and in so far as the host OS is concerned it is a regular pthread
 435 just like any other. The host OS is oblivious about the existence of and
 436 not at all involved in the scheduling of L-threads.
 437
 438 The other and most significant difference between the two models is that
 439 L-threads are scheduled cooperatively. L-threads cannot not preempt each
 440 other, nor can the L-thread scheduler preempt a running L-thread (i.e.
 441 there is no time slicing). The consequence is that programs implemented with
 442 L-threads must possess frequent rescheduling points, meaning that they must
 443 explicitly and of their own volition return to the scheduler at frequent
 444 intervals, in order to allow other L-threads an opportunity to proceed.
 445
 446 In both models switching between threads requires that the current CPU
 447 context is saved and a new context (belonging to the next thread ready to run)
 448 is restored. With pthreads this context switching is handled transparently
 449 and the set of CPU registers that must be preserved between context switches
 450 is as per an interrupt handler.
 451
 452 An L-thread context switch is achieved by the thread itself making a function
 453 call to the L-thread scheduler. Thus it is only necessary to preserve the
 454 callee registers. The caller is responsible to save and restore any other
 455 registers it is using before a function call, and restore them on return,
 456 and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
 457 System V calling convention is used, this defines registers RSP, RBP, and
 458 R12-R15 as callee-save registers (for more detailed discussion a good reference
 459 is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
 460
 461 Taking advantage of this, and due to the absence of preemption, an L-thread
 462 context switch is achieved with less than 20 load/store instructions.
 463
 464 The scheduling policy for L-threads is fixed, there is no prioritization of
 465 L-threads, all L-threads are equal and scheduling is based on a FIFO
 466 ready queue.
 467
 468 An L-thread is a struct containing the CPU context of the thread
 469 (saved on context switch) and other useful items. The ready queue contains
 470 pointers to threads that are ready to run. The L-thread scheduler is a simple
 471 loop that polls the ready queue, reads from it the next thread ready to run,
 472 which it resumes by saving the current context (the current position in the
 473 scheduler loop) and restoring the context of the next thread from its thread
 474 struct. Thus an L-thread is always resumed at the last place it yielded.
 475
 476 A well behaved L-thread will call the context switch regularly (at least once
 477 in its main loop) thus returning to the scheduler's own main loop. Yielding
 478 inserts the current thread at the back of the ready queue, and the process of
 479 servicing the ready queue is repeated, thus the system runs by flipping back
 480 and forth the between L-threads and scheduler loop.
 481
 482 In the case of pthreads, the preemptive scheduling, time slicing, and support
 483 for thread prioritization means that progress is normally possible for any
 484 thread that is ready to run. This comes at the price of a relatively heavier
 485 context switch and scheduling overhead.
 486
 487 With L-threads the progress of any particular thread is determined by the
 488 frequency of rescheduling opportunities in the other L-threads. This means that
 489 an errant L-thread monopolizing the CPU might cause scheduling of other threads
 490 to be stalled. Due to the lower cost of context switching, however, voluntary
 491 rescheduling to ensure progress of other threads, if managed sensibly, is not
 492 a prohibitive overhead, and overall performance can exceed that of an
 493 application using pthreads.
 494
 495
 496 Mutual exclusion
 497 ^^^^^^^^^^^^^^^^
 498
 499 With pthreads preemption means that threads that share data must observe
 500 some form of mutual exclusion protocol.
 501
 502 The fact that L-threads cannot preempt each other means that in many cases
 503 mutual exclusion devices can be completely avoided.
 504
 505 Locking to protect shared data can be a significant bottleneck in
 506 multi-threaded applications so a carefully designed cooperatively scheduled
 507 program can enjoy significant performance advantages.
 508
 509 So far we have considered only the simplistic case of a single core CPU,
 510 when multiple CPUs are considered things are somewhat more complex.
 511
 512 First of all it is inevitable that there must be multiple L-thread schedulers,
 513 one running on each EAL thread. So long as these schedulers remain isolated
 514 from each other the above assertions about the potential advantages of
 515 cooperative scheduling hold true.
 516
 517 A configuration with isolated cooperative schedulers is less flexible than the
 518 pthread model where threads can be affinitized to run on any CPU. With isolated
 519 schedulers scaling of applications to utilize fewer or more CPUs according to
 520 system demand is very difficult to achieve.
 521
 522 The L-thread subsystem makes it possible for L-threads to migrate between
 523 schedulers running on different CPUs. Needless to say if the migration means
 524 that threads that share data end up running on different CPUs then this will
 525 introduce the need for some kind of mutual exclusion system.
 526
 527 Of course ``rte_ring`` software rings can always be used to interconnect
 528 threads running on different cores, however to protect other kinds of shared
 529 data structures, lock free constructs or else explicit locking will be
 530 required. This is a consideration for the application design.
 531
 532 In support of this extended functionality, the L-thread subsystem implements
 533 thread safe mutexes and condition variables.
 534
 535 The cost of affinitizing and of condition variable signaling is significantly
 536 lower than the equivalent pthread operations, and so applications using these
 537 features will see a performance benefit.
 538
 539
 540 Thread local storage
 541 ^^^^^^^^^^^^^^^^^^^^
 542
 543 As with applications written for pthreads an application written for L-threads
 544 can take advantage of thread local storage, in this case local to an L-thread.
 545 An application may save and retrieve a single pointer to application data in
 546 the L-thread struct.
 547
 548 For legacy and backward compatibility reasons two alternative methods are also
 549 offered, the first is modelled directly on the pthread get/set specific APIs,
 550 the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby
 551 ``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
 552 the L-thread.
 553
 554
 555 .. _constraints_and_performance_implications:
 556
 557 Constraints and performance implications when using L-threads
 558 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 559
 560
 561 .. _API_compatibility:
 562
 563 API compatibility
 564 ^^^^^^^^^^^^^^^^^
 565
 566 The L-thread subsystem provides a set of functions that are logically equivalent
 567 to the corresponding functions offered by the POSIX pthread library, however not
 568 all pthread functions have a corresponding L-thread equivalent, and not all
 569 features available to pthreads are implemented for L-threads.
 570
 571 The pthread library offers considerable flexibility via programmable attributes
 572 that can be associated with threads, mutexes, and condition variables.
 573
 574 By contrast the L-thread subsystem has fixed functionality, the scheduler policy
 575 cannot be varied, and L-threads cannot be prioritized. There are no variable
 576 attributes associated with any L-thread objects. L-threads, mutexes and
 577 conditional variables, all have fixed functionality. (Note: reserved parameters
 578 are included in the APIs to facilitate possible future support for attributes).
 579
 580 The table below lists the pthread and equivalent L-thread APIs with notes on
 581 differences and/or constraints. Where there is no L-thread entry in the table,
 582 then the L-thread subsystem provides no equivalent function.
 583
 584 .. _table_lthread_pthread:
 585
 586 .. table:: Pthread and equivalent L-thread APIs.
 587
 588    +----------------------------+------------------------+-------------------+
 589    | **Pthread function**       | **L-thread function**  | **Notes**         |
 590    +============================+========================+===================+
 591    | pthread_barrier_destroy    |                        |                   |
 592    +----------------------------+------------------------+-------------------+
 593    | pthread_barrier_init       |                        |                   |
 594    +----------------------------+------------------------+-------------------+
 595    | pthread_barrier_wait       |                        |                   |
 596    +----------------------------+------------------------+-------------------+
 597    | pthread_cond_broadcast     | lthread_cond_broadcast | See note 1        |
 598    +----------------------------+------------------------+-------------------+
 599    | pthread_cond_destroy       | lthread_cond_destroy   |                   |
 600    +----------------------------+------------------------+-------------------+
 601    | pthread_cond_init          | lthread_cond_init      |                   |
 602    +----------------------------+------------------------+-------------------+
 603    | pthread_cond_signal        | lthread_cond_signal    | See note 1        |
 604    +----------------------------+------------------------+-------------------+
 605    | pthread_cond_timedwait     |                        |                   |
 606    +----------------------------+------------------------+-------------------+
 607    | pthread_cond_wait          | lthread_cond_wait      | See note 5        |
 608    +----------------------------+------------------------+-------------------+
 609    | pthread_create             | lthread_create         | See notes 2, 3    |
 610    +----------------------------+------------------------+-------------------+
 611    | pthread_detach             | lthread_detach         | See note 4        |
 612    +----------------------------+------------------------+-------------------+
 613    | pthread_equal              |                        |                   |
 614    +----------------------------+------------------------+-------------------+
 615    | pthread_exit               | lthread_exit           |                   |
 616    +----------------------------+------------------------+-------------------+
 617    | pthread_getspecific        | lthread_getspecific    |                   |
 618    +----------------------------+------------------------+-------------------+
 619    | pthread_getcpuclockid      |                        |                   |
 620    +----------------------------+------------------------+-------------------+
 621    | pthread_join               | lthread_join           |                   |
 622    +----------------------------+------------------------+-------------------+
 623    | pthread_key_create         | lthread_key_create     |                   |
 624    +----------------------------+------------------------+-------------------+
 625    | pthread_key_delete         | lthread_key_delete     |                   |
 626    +----------------------------+------------------------+-------------------+
 627    | pthread_mutex_destroy      | lthread_mutex_destroy  |                   |
 628    +----------------------------+------------------------+-------------------+
 629    | pthread_mutex_init         | lthread_mutex_init     |                   |
 630    +----------------------------+------------------------+-------------------+
 631    | pthread_mutex_lock         | lthread_mutex_lock     | See note 6        |
 632    +----------------------------+------------------------+-------------------+
 633    | pthread_mutex_trylock      | lthread_mutex_trylock  | See note 6        |
 634    +----------------------------+------------------------+-------------------+
 635    | pthread_mutex_timedlock    |                        |                   |
 636    +----------------------------+------------------------+-------------------+
 637    | pthread_mutex_unlock       | lthread_mutex_unlock   |                   |
 638    +----------------------------+------------------------+-------------------+
 639    | pthread_once               |                        |                   |
 640    +----------------------------+------------------------+-------------------+
 641    | pthread_rwlock_destroy     |                        |                   |
 642    +----------------------------+------------------------+-------------------+
 643    | pthread_rwlock_init        |                        |                   |
 644    +----------------------------+------------------------+-------------------+
 645    | pthread_rwlock_rdlock      |                        |                   |
 646    +----------------------------+------------------------+-------------------+
 647    | pthread_rwlock_timedrdlock |                        |                   |
 648    +----------------------------+------------------------+-------------------+
 649    | pthread_rwlock_timedwrlock |                        |                   |
 650    +----------------------------+------------------------+-------------------+
 651    | pthread_rwlock_tryrdlock   |                        |                   |
 652    +----------------------------+------------------------+-------------------+
 653    | pthread_rwlock_trywrlock   |                        |                   |
 654    +----------------------------+------------------------+-------------------+
 655    | pthread_rwlock_unlock      |                        |                   |
 656    +----------------------------+------------------------+-------------------+
 657    | pthread_rwlock_wrlock      |                        |                   |
 658    +----------------------------+------------------------+-------------------+
 659    | pthread_self               | lthread_current        |                   |
 660    +----------------------------+------------------------+-------------------+
 661    | pthread_setspecific        | lthread_setspecific    |                   |
 662    +----------------------------+------------------------+-------------------+
 663    | pthread_spin_init          |                        | See note 10       |
 664    +----------------------------+------------------------+-------------------+
 665    | pthread_spin_destroy       |                        | See note 10       |
 666    +----------------------------+------------------------+-------------------+
 667    | pthread_spin_lock          |                        | See note 10       |
 668    +----------------------------+------------------------+-------------------+
 669    | pthread_spin_trylock       |                        | See note 10       |
 670    +----------------------------+------------------------+-------------------+
 671    | pthread_spin_unlock        |                        | See note 10       |
 672    +----------------------------+------------------------+-------------------+
 673    | pthread_cancel             | lthread_cancel         |                   |
 674    +----------------------------+------------------------+-------------------+
 675    | pthread_setcancelstate     |                        |                   |
 676    +----------------------------+------------------------+-------------------+
 677    | pthread_setcanceltype      |                        |                   |
 678    +----------------------------+------------------------+-------------------+
 679    | pthread_testcancel         |                        |                   |
 680    +----------------------------+------------------------+-------------------+
 681    | pthread_getschedparam      |                        |                   |
 682    +----------------------------+------------------------+-------------------+
 683    | pthread_setschedparam      |                        |                   |
 684    +----------------------------+------------------------+-------------------+
 685    | pthread_yield              | lthread_yield          | See note 7        |
 686    +----------------------------+------------------------+-------------------+
 687    | pthread_setaffinity_np     | lthread_set_affinity   | See notes 2, 3, 8 |
 688    +----------------------------+------------------------+-------------------+
 689    |                            | lthread_sleep          | See note 9        |
 690    +----------------------------+------------------------+-------------------+
 691    |                            | lthread_sleep_clks     | See note 9        |
 692    +----------------------------+------------------------+-------------------+
 693
 694
 695 **Note 1**:
 696
 697 Neither lthread signal nor broadcast may be called concurrently by L-threads
 698 running on different schedulers, although multiple L-threads running in the
 699 same scheduler may freely perform signal or broadcast operations. L-threads
 700 running on the same or different schedulers may always safely wait on a
 701 condition variable.
 702
 703
 704 **Note 2**:
 705
 706 Pthread attributes may be used to affinitize a pthread with a cpu-set. The
 707 L-thread subsystem does not support a cpu-set. An L-thread may be affinitized
 708 only with a single CPU at any time.
 709
 710
 711 **Note 3**:
 712
 713 If an L-thread is intended to run on a different NUMA node than the node that
 714 creates the thread then, when calling ``lthread_create()`` it is advantageous
 715 to specify the destination core as a parameter of ``lthread_create()``. See
 716 :ref:`memory_allocation_and_NUMA_awareness` for details.
 717
 718
 719 **Note 4**:
 720
 721 An L-thread can only detach itself, and cannot detach other L-threads.
 722
 723
 724 **Note 5**:
 725
 726 A wait operation on a pthread condition variable is always associated with and
 727 protected by a mutex which must be owned by the thread at the time it invokes
 728 ``pthread_wait()``. By contrast L-thread condition variables are thread safe
 729 (for waiters) and do not use an associated mutex. Multiple L-threads (including
 730 L-threads running on other schedulers) can safely wait on a L-thread condition
 731 variable. As a consequence the performance of an L-thread condition variables
 732 is typically an order of magnitude faster than its pthread counterpart.
 733
 734
 735 **Note 6**:
 736
 737 Recursive locking is not supported with L-threads, attempts to take a lock
 738 recursively will be detected and rejected.
 739
 740
 741 **Note 7**:
 742
 743 ``lthread_yield()`` will save the current context, insert the current thread
 744 to the back of the ready queue, and resume the next ready thread. Yielding
 745 increases ready queue backlog, see :ref:`ready_queue_backlog` for more details
 746 about the implications of this.
 747
 748
 749 N.B. The context switch time as measured from immediately before the call to
 750 ``lthread_yield()`` to the point at which the next ready thread is resumed,
 751 can be an order of magnitude faster that the same measurement for
 752 pthread_yield.
 753
 754
 755 **Note 8**:
 756
 757 ``lthread_set_affinity()`` is similar to a yield apart from the fact that the
 758 yielding thread is inserted into a peer ready queue of another scheduler.
 759 The peer ready queue is actually a separate thread safe queue, which means that
 760 threads appearing in the peer ready queue can jump any backlog in the local
 761 ready queue on the destination scheduler.
 762
 763 The context switch time as measured from the time just before the call to
 764 ``lthread_set_affinity()`` to just after the same thread is resumed on the new
 765 scheduler can be orders of magnitude faster than the same measurement for
 766 ``pthread_setaffinity_np()``.
 767
 768
 769 **Note 9**:
 770
 771 Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
 772 ``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
 773 ``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
 774 the current thread, start an ``rte_timer`` and resume the thread when the
 775 timer matures. The ``rte_timer_manage()`` entry point is called on every pass
 776 of the scheduler loop. This means that the worst case jitter on timer expiry
 777 is determined by the longest period between context switches of any running
 778 L-threads.
 779
 780 In a synthetic test with many threads sleeping and resuming then the measured
 781 jitter is typically orders of magnitude lower than the same measurement made
 782 for ``nanosleep()``.
 783
 784
 785 **Note 10**:
 786
 787 Spin locks are not provided because they are problematical in a cooperative
 788 environment, see :ref:`porting_locks_and_spinlocks` for a more detailed
 789 discussion on how to avoid spin locks.
 790
 791
 792 .. _Thread_local_storage_performance:
 793
 794 Thread local storage
 795 ^^^^^^^^^^^^^^^^^^^^
 796
 797 Of the three L-thread local storage options the simplest and most efficient is
 798 storing a single application data pointer in the L-thread struct.
 799
 800 The ``PER_LTHREAD`` macros involve a run time computation to obtain the address
 801 of the variable being saved/retrieved and also require that the accesses are
 802 de-referenced  via a pointer. This means that code that has used
 803 ``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
 804 adjustment (see :ref:`porting_thread_local_storage` for hints about porting
 805 code that makes use of thread local storage).
 806
 807 The get/set specific APIs are consistent with their pthread counterparts both
 808 in use and in performance.
 809
 810
 811 .. _memory_allocation_and_NUMA_awareness:
 812
 813 Memory allocation and NUMA awareness
 814 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 815
 816 All memory allocation is from DPDK huge pages, and is NUMA aware. Each
 817 scheduler maintains its own caches of objects: lthreads, their stacks, TLS,
 818 mutexes and condition variables. These caches are implemented as unbounded lock
 819 free MPSC queues. When objects are created they are always allocated from the
 820 caches on the local core (current EAL thread).
 821
 822 If an L-thread has been affinitized to a different scheduler, then it can
 823 always safely free resources to the caches from which they originated (because
 824 the caches are MPSC queues).
 825
 826 If the L-thread has been affinitized to a different NUMA node then the memory
 827 resources associated with it may incur longer access latency.
 828
 829 The commonly used pattern of setting affinity on entry to a thread after it has
 830 started, means that memory allocation for both the stack and TLS will have been
 831 made from caches on the NUMA node on which the threads creator is running.
 832 This has the side effect that access latency will be sub-optimal after
 833 affinitizing.
 834
 835 This side effect can be mitigated to some extent (although not completely) by
 836 specifying the destination CPU as a parameter of ``lthread_create()`` this
 837 causes the L-thread's stack and TLS to be allocated when it is first scheduled
 838 on the destination scheduler, if the destination is a on another NUMA node it
 839 results in a more optimal memory allocation.
 840
 841 Note that the lthread struct itself remains allocated from memory on the
 842 creating node, this is unavoidable because an L-thread is known everywhere by
 843 the address of this struct.
 844
 845
 846 .. _object_cache_sizing:
 847
 848 Object cache sizing
 849 ^^^^^^^^^^^^^^^^^^^
 850
 851 The per lcore object caches pre-allocate objects in bulk whenever a request to
 852 allocate an object finds a cache empty. By default 100 objects are
 853 pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
 854 header file lthread_api.h. This means that the caches constantly grow to meet
 855 system demand.
 856
 857 In the present implementation there is no mechanism to reduce the cache sizes
 858 if system demand reduces. Thus the caches will remain at their maximum extent
 859 indefinitely.
 860
 861 A consequence of the bulk pre-allocation of objects is that every 100 (default
 862 value) additional new object create operations results in a call to
 863 ``rte_malloc()``. For creation of objects such as L-threads, which trigger the
 864 allocation of even more objects (i.e. their stacks and TLS) then this can
 865 cause outliers in scheduling performance.
 866
 867 If this is a problem the simplest mitigation strategy is to dimension the
 868 system, by setting the bulk object pre-allocation size to some large number
 869 that you do not expect to be exceeded. This means the caches will be populated
 870 once only, the very first time a thread is created.
 871
 872
 873 .. _Ready_queue_backlog:
 874
 875 Ready queue backlog
 876 ^^^^^^^^^^^^^^^^^^^
 877
 878 One of the more subtle performance considerations is managing the ready queue
 879 backlog. The fewer threads that are waiting in the ready queue then the faster
 880 any particular thread will get serviced.
 881
 882 In a naive L-thread application with N L-threads simply looping and yielding,
 883 this backlog will always be equal to the number of L-threads, thus the cost of
 884 a yield to a particular L-thread will be N times the context switch time.
 885
 886 This side effect can be mitigated by arranging for threads to be suspended and
 887 wait to be resumed, rather than polling for work by constantly yielding.
 888 Blocking on a mutex or condition variable or even more obviously having a
 889 thread sleep if it has a low frequency workload are all mechanisms by which a
 890 thread can be excluded from the ready queue until it really does need to be
 891 run. This can have a significant positive impact on performance.
 892
 893
 894 .. _Initialization_and_shutdown_dependencies:
 895
 896 Initialization, shutdown and dependencies
 897 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 898
 899 The L-thread subsystem depends on DPDK for huge page allocation and depends on
 900 the ``rte_timer subsystem``. The DPDK EAL initialization and
 901 ``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
 902 system can be used.
 903
 904 Thereafter initialization of the L-thread subsystem is largely transparent to
 905 the application. Constructor functions ensure that global variables are properly
 906 initialized. Other than global variables each scheduler is initialized
 907 independently the first time that an L-thread is created by a particular EAL
 908 thread.
 909
 910 If the schedulers are to be run as isolated and independent schedulers, with
 911 no intention that L-threads running on different schedulers will migrate between
 912 schedulers or synchronize with L-threads running on other schedulers, then
 913 initialization consists simply of creating an L-thread, and then running the
 914 L-thread scheduler.
 915
 916 If there will be interaction between L-threads running on different schedulers,
 917 then it is important that the starting of schedulers on different EAL threads
 918 is synchronized.
 919
 920 To achieve this an additional initialization step is necessary, this is simply
 921 to set the number of schedulers by calling the API function
 922 ``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
 923 that will run L-thread schedulers. Setting the number of schedulers to a
 924 number greater than 0 will cause all schedulers to wait until the others have
 925 started before beginning to schedule L-threads.
 926
 927 The L-thread scheduler is started by calling the function ``lthread_run()``
 928 and should be called from the EAL thread and thus become the main loop of the
 929 EAL thread.
 930
 931 The function ``lthread_run()``, will not return until all threads running on
 932 the scheduler have exited, and the scheduler has been explicitly stopped by
 933 calling ``lthread_scheduler_shutdown(lcore)`` or
 934 ``lthread_scheduler_shutdown_all()``.
 935
 936 All these function do is tell the scheduler that it can exit when there are no
 937 longer any running L-threads, neither function forces any running L-thread to
 938 terminate. Any desired application shutdown behavior must be designed and
 939 built into the application to ensure that L-threads complete in a timely
 940 manner.
 941
 942 **Important Note:** It is assumed when the scheduler exits that the application
 943 is terminating for good, the scheduler does not free resources before exiting
 944 and running the scheduler a subsequent time will result in undefined behavior.
 945
 946
 947 .. _porting_legacy_code_to_run_on_lthreads:
 948
 949 Porting legacy code to run on L-threads
 950 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 951
 952 Legacy code originally written for a pthread environment may be ported to
 953 L-threads if the considerations about differences in scheduling policy, and
 954 constraints discussed in the previous sections can be accommodated.
 955
 956 This section looks in more detail at some of the issues that may have to be
 957 resolved when porting code.
 958
 959
 960 .. _pthread_API_compatibility:
 961
 962 pthread API compatibility
 963 ^^^^^^^^^^^^^^^^^^^^^^^^^
 964
 965 The first step is to establish exactly which pthread APIs the legacy
 966 application uses, and to understand the requirements of those APIs. If there
 967 are corresponding L-lthread APIs, and where the default pthread functionality
 968 is used by the application then, notwithstanding the other issues discussed
 969 here, it should be feasible to run the application with L-threads. If the
 970 legacy code modifies the default behavior using attributes then if may be
 971 necessary to make some adjustments to eliminate those requirements.
 972
 973
 974 .. _blocking_system_calls:
 975
 976 Blocking system API calls
 977 ^^^^^^^^^^^^^^^^^^^^^^^^^
 978
 979 It is important to understand what other system services the application may be
 980 using, bearing in mind that in a cooperatively scheduled environment a thread
 981 cannot block without stalling the scheduler and with it all other cooperative
 982 threads. Any kind of blocking system call, for example file or socket IO, is a
 983 potential problem, a good tool to analyze the application for this purpose is
 984 the ``strace`` utility.
 985
 986 There are many strategies to resolve these kind of issues, each with it
 987 merits. Possible solutions include:
 988
 989 * Adopting a polled mode of the system API concerned (if available).
 990
 991 * Arranging for another core to perform the function and synchronizing with
 992   that core via constructs that will not block the L-thread.
 993
 994 * Affinitizing the thread to another scheduler devoted (as a matter of policy)
 995   to handling threads wishing to make blocking calls, and then back again when
 996   finished.
 997
 998
 999 .. _porting_locks_and_spinlocks:
1000
1001 Locks and spinlocks
1002 ^^^^^^^^^^^^^^^^^^^
1003
1004 Locks and spinlocks are another source of blocking behavior that for the same
1005 reasons as system calls will need to be addressed.
1006
1007 If the application design ensures that the contending L-threads will always
1008 run on the same scheduler then it its probably safe to remove locks and spin
1009 locks completely.
1010
1011 The only exception to the above rule is if for some reason the
1012 code performs any kind of context switch whilst holding the lock
1013 (e.g. yield, sleep, or block on a different lock, or on a condition variable).
1014 This will need to determined before deciding to eliminate a lock.
1015
1016 If a lock cannot be eliminated then an L-thread mutex can be substituted for
1017 either kind of lock.
1018
1019 An L-thread blocking on an L-thread mutex will be suspended and will cause
1020 another ready L-thread to be resumed, thus not blocking the scheduler. When
1021 default behavior is required, it can be used as a direct replacement for a
1022 pthread mutex lock.
1023
1024 Spin locks are typically used when lock contention is likely to be rare and
1025 where the period during which the lock may be held is relatively short.
1026 When the contending L-threads are running on the same scheduler then an
1027 L-thread blocking on a spin lock will enter an infinite loop stopping the
1028 scheduler completely (see :ref:`porting_infinite_loops` below).
1029
1030 If the application design ensures that contending L-threads will always run
1031 on different schedulers then it might be reasonable to leave a short spin lock
1032 that rarely experiences contention in place.
1033
1034 If after all considerations it appears that a spin lock can neither be
1035 eliminated completely, replaced with an L-thread mutex, or left in place as
1036 is, then an alternative is to loop on a flag, with a call to
1037 ``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
1038 ever run on different schedulers the flag will need to be manipulated
1039 atomically).
1040
1041 Spinning and yielding is the least preferred solution since it introduces
1042 ready queue backlog (see also :ref:`ready_queue_backlog`).
1043
1044
1045 .. _porting_sleeps_and_delays:
1046
1047 Sleeps and delays
1048 ^^^^^^^^^^^^^^^^^
1049
1050 Yet another kind of blocking behavior (albeit momentary) are delay functions
1051 like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
1052 consequence of stalling the L-thread scheduler and unless the delay is very
1053 short (e.g. a very short nanosleep) calls to these functions will need to be
1054 eliminated.
1055
1056 The simplest mitigation strategy is to use the L-thread sleep API functions,
1057 of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
1058 These functions start an rte_timer against the L-thread, suspend the L-thread
1059 and cause another ready L-thread to be resumed. The suspended L-thread is
1060 resumed when the rte_timer matures.
1061
1062
1063 .. _porting_infinite_loops:
1064
1065 Infinite loops
1066 ^^^^^^^^^^^^^^
1067
1068 Some applications have threads with loops that contain no inherent
1069 rescheduling opportunity, and rely solely on the OS time slicing to share
1070 the CPU. In a cooperative environment this will stop everything dead. These
1071 kind of loops are not hard to identify, in a debug session you will find the
1072 debugger is always stopping in the same loop.
1073
1074 The simplest solution to this kind of problem is to insert an explicit
1075 ``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
1076 might be to include the function performed by the loop into the execution path
1077 of some other loop that does in fact yield, if this is possible.
1078
1079
1080 .. _porting_thread_local_storage:
1081
1082 Thread local storage
1083 ^^^^^^^^^^^^^^^^^^^^
1084
1085 If the application uses thread local storage, the use case should be
1086 studied carefully.
1087
1088 In a legacy pthread application either or both the ``__thread`` prefix, or the
1089 pthread set/get specific APIs may have been used to define storage local to a
1090 pthread.
1091
1092 In some applications it may be a reasonable assumption that the data could
1093 or in fact most likely should be placed in L-thread local storage.
1094
1095 If the application (like many DPDK applications) has assumed a certain
1096 relationship between a pthread and the CPU to which it is affinitized, there
1097 is a risk that thread local storage may have been used to save some data items
1098 that are correctly logically associated with the CPU, and others items which
1099 relate to application context for the thread. Only a good understanding of the
1100 application will reveal such cases.
1101
1102 If the application requires an that an L-thread is to be able to move between
1103 schedulers then care should be taken to separate these kinds of data, into per
1104 lcore, and per L-thread storage. In this way a migrating thread will bring with
1105 it the local data it needs, and pick up the new logical core specific values
1106 from pthread local storage at its new home.
1107
1108
1109 .. _pthread_shim:
1110
1111 Pthread shim
1112 ~~~~~~~~~~~~
1113
1114 A convenient way to get something working with legacy code can be to use a
1115 shim that adapts pthread API calls to the corresponding L-thread ones.
1116 This approach will not mitigate any of the porting considerations mentioned
1117 in the previous sections, but it will reduce the amount of code churn that
1118 would otherwise been involved. It is a reasonable approach to evaluate
1119 L-threads, before investing effort in porting to the native L-thread APIs.
1120
1121
1122 Overview
1123 ^^^^^^^^
1124 The L-thread subsystem includes an example pthread shim. This is a partial
1125 implementation but does contain the API stubs needed to get basic applications
1126 running. There is a simple "hello world" application that demonstrates the
1127 use of the pthread shim.
1128
1129 A subtlety of working with a shim is that the application will still need
1130 to make use of the genuine pthread library functions, at the very least in
1131 order to create the EAL threads in which the L-thread schedulers will run.
1132 This is the case with DPDK initialization, and exit.
1133
1134 To deal with the initialization and shutdown scenarios, the shim is capable of
1135 switching on or off its adaptor functionality, an application can control this
1136 behavior by the calling the function ``pt_override_set()``. The default state
1137 is disabled.
1138
1139 The pthread shim uses the dynamic linker loader and saves the loaded addresses
1140 of the genuine pthread API functions in an internal table, when the shim
1141 functionality is enabled it performs the adaptor function, when disabled it
1142 invokes the genuine pthread function.
1143
1144 The function ``pthread_exit()`` has additional special handling. The standard
1145 system header file pthread.h declares ``pthread_exit()`` with
1146 ``__attribute__((noreturn))`` this is an optimization that is possible because
1147 the pthread is terminating and this enables the compiler to omit the normal
1148 handling of stack and protection of registers since the function is not
1149 expected to return, and in fact the thread is being destroyed. These
1150 optimizations are applied in both the callee and the caller of the
1151 ``pthread_exit()`` function.
1152
1153 In our cooperative scheduling environment this behavior is inadmissible. The
1154 pthread is the L-thread scheduler thread, and, although an L-thread is
1155 terminating, there must be a return to the scheduler in order that the system
1156 can continue to run. Further, returning from a function with attribute
1157 ``noreturn`` is invalid and may result in undefined behavior.
1158
1159 The solution is to redefine the ``pthread_exit`` function with a macro,
1160 causing it to be mapped to a stub function in the shim that does not have the
1161 ``noreturn`` attribute. This macro is defined in the file
1162 ``pthread_shim.h``. The stub function is otherwise no different than any of
1163 the other stub functions in the shim, and will switch between the real
1164 ``pthread_exit()`` function or the ``lthread_exit()`` function as
1165 required. The only difference is that the mapping to the stub by macro
1166 substitution.
1167
1168 A consequence of this is that the file ``pthread_shim.h`` must be included in
1169 legacy code wishing to make use of the shim. It also means that dynamic
1170 linkage of a pre-compiled binary that did not include pthread_shim.h is not be
1171 supported.
1172
1173 Given the requirements for porting legacy code outlined in
1174 :ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
1175 least some minimal adjustment and recompilation to run on L-threads so
1176 pre-compiled binaries are unlikely to be met in practice.
1177
1178 In summary the shim approach adds some overhead but can be a useful tool to help
1179 establish the feasibility of a code reuse project. It is also a fairly
1180 straightforward task to extend the shim if necessary.
1181
1182 **Note:** Bearing in mind the preceding discussions about the impact of making
1183 blocking calls then switching the shim in and out on the fly to invoke any
1184 pthread API this might block is something that should typically be avoided.
1185
1186
1187 Building and running the pthread shim
1188 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1189
1190 The shim example application is located in the sample application
1191 in the performance-thread folder
1192
1193 To build and run the pthread shim example
1194
1195 #. Go to the example applications folder
1196
1197    .. code-block:: console
1198
1199        export RTE_SDK=/path/to/rte_sdk
1200        cd ${RTE_SDK}/examples/performance-thread/pthread_shim
1201
1202
1203 #. Set the target (a default target is used if not specified). For example:
1204
1205    .. code-block:: console
1206
1207        export RTE_TARGET=x86_64-native-linuxapp-gcc
1208
1209    See the DPDK Getting Started Guide for possible RTE_TARGET values.
1210
1211 #. Build the application:
1212
1213    .. code-block:: console
1214
1215        make
1216
1217 #. To run the pthread_shim example
1218
1219    .. code-block:: console
1220
1221        lthread-pthread-shim -c core_mask -n number_of_channels
1222
1223 .. _lthread_diagnostics:
1224
1225 L-thread Diagnostics
1226 ~~~~~~~~~~~~~~~~~~~~
1227
1228 When debugging you must take account of the fact that the L-threads are run in
1229 a single pthread. The current scheduler is defined by
1230 ``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
1231 ``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
1232 session the current lthread can be obtained by displaying the pthread local
1233 variable ``per_lcore_this_sched->current_lthread``.
1234
1235 Another useful diagnostic feature is the possibility to trace significant
1236 events in the life of an L-thread, this feature is enabled by changing the
1237 value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
1238
1239 Tracing of events can be individually masked, and the mask may be programmed
1240 at run time. An unmasked event results in a callback that provides information
1241 about the event. The default callback simply prints trace information. The
1242 default mask is 0 (all events off) the mask can be modified by calling the
1243 function ``lthread_diagniostic_set_mask()``.
1244
1245 It is possible register a user callback function to implement more
1246 sophisticated diagnostic functions.
1247 Object creation events (lthread, mutex, and condition variable) accept, and
1248 store in the created object, a user supplied reference value returned by the
1249 callback function.
1250
1251 The lthread reference value is passed back in all subsequent event callbacks,
1252 the mutex and APIs are provided to retrieve the reference value from
1253 mutexes and condition variables. This enables a user to monitor, count, or
1254 filter for specific events, on specific objects, for example to monitor for a
1255 specific thread signaling a specific condition variable, or to monitor
1256 on all timer events, the possibilities and combinations are endless.
1257
1258 The callback function can be set by calling the function
1259 ``lthread_diagnostic_enable()`` supplying a callback function pointer and an
1260 event mask.
1261
1262 Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
1263 queue usage, and these statistics can be displayed by calling the function
1264 ``lthread_diag_stats_display()``. This function also performs a consistency
1265 check on the caches and queues. The function should only be called from the
1266 master EAL thread after all slave threads have stopped and returned to the C
1267 main program, otherwise the consistency check will fail.