doc/guides/sample_app_ug/performance_thread.rst

   1 ..  SPDX-License-Identifier: BSD-3-Clause
   2     Copyright(c) 2015 Intel Corporation.
   3
   4 Performance Thread Sample Application
   5 =====================================
   6
   7 The performance thread sample application is a derivative of the standard L3
   8 forwarding application that demonstrates different threading models.
   9
  10 Overview
  11 --------
  12 For a general description of the L3 forwarding applications capabilities
  13 please refer to the documentation of the standard application in
  14 :doc:`l3_forward`.
  15
  16 The performance thread sample application differs from the standard L3
  17 forwarding example in that it divides the TX and RX processing between
  18 different threads, and makes it possible to assign individual threads to
  19 different cores.
  20
  21 Three threading models are considered:
  22
  23 #. When there is one EAL thread per physical core.
  24 #. When there are multiple EAL threads per physical core.
  25 #. When there are multiple lightweight threads per EAL thread.
  26
  27 Since DPDK release 2.0 it is possible to launch applications using the
  28 ``--lcores`` EAL parameter, specifying cpu-sets for a physical core. With the
  29 performance thread sample application its is now also possible to assign
  30 individual RX and TX functions to different cores.
  31
  32 As an alternative to dividing the L3 forwarding work between different EAL
  33 threads the performance thread sample introduces the possibility to run the
  34 application threads as lightweight threads (L-threads) within one or
  35 more EAL threads.
  36
  37 In order to facilitate this threading model the example includes a primitive
  38 cooperative scheduler (L-thread) subsystem. More details of the L-thread
  39 subsystem can be found in :ref:`lthread_subsystem`.
  40
  41 **Note:** Whilst theoretically possible it is not anticipated that multiple
  42 L-thread schedulers would be run on the same physical core, this mode of
  43 operation should not be expected to yield useful performance and is considered
  44 invalid.
  45
  46 Compiling the Application
  47 -------------------------
  48
  49 To compile the sample application see :doc:`compiling`.
  50
  51 The application is located in the `performance-thread/l3fwd-thread` sub-directory.
  52
  53 Running the Application
  54 -----------------------
  55
  56 The application has a number of command line options::
  57
  58     ./build/l3fwd-thread [EAL options] --
  59         -p PORTMASK [-P]
  60         --rx(port,queue,lcore,thread)[,(port,queue,lcore,thread)]
  61         --tx(lcore,thread)[,(lcore,thread)]
  62         [--enable-jumbo] [--max-pkt-len PKTLEN]]  [--no-numa]
  63         [--hash-entry-num] [--ipv6] [--no-lthreads] [--stat-lcore lcore]
  64         [--parse-ptype]
  65
  66 Where:
  67
  68 * ``-p PORTMASK``: Hexadecimal bitmask of ports to configure.
  69
  70 * ``-P``: optional, sets all ports to promiscuous mode so that packets are
  71   accepted regardless of the packet's Ethernet MAC destination address.
  72   Without this option, only packets with the Ethernet MAC destination address
  73   set to the Ethernet address of the port are accepted.
  74
  75 * ``--rx (port,queue,lcore,thread)[,(port,queue,lcore,thread)]``: the list of
  76   NIC RX ports and queues handled by the RX lcores and threads. The parameters
  77   are explained below.
  78
  79 * ``--tx (lcore,thread)[,(lcore,thread)]``: the list of TX threads identifying
  80   the lcore the thread runs on, and the id of RX thread with which it is
  81   associated. The parameters are explained below.
  82
  83 * ``--enable-jumbo``: optional, enables jumbo frames.
  84
  85 * ``--max-pkt-len``: optional, maximum packet length in decimal (64-9600).
  86
  87 * ``--no-numa``: optional, disables numa awareness.
  88
  89 * ``--hash-entry-num``: optional, specifies the hash entry number in hex to be
  90   setup.
  91
  92 * ``--ipv6``: optional, set it if running ipv6 packets.
  93
  94 * ``--no-lthreads``: optional, disables l-thread model and uses EAL threading
  95   model. See below.
  96
  97 * ``--stat-lcore``: optional, run CPU load stats collector on the specified
  98   lcore.
  99
 100 * ``--parse-ptype:`` optional, set to use software to analyze packet type.
 101   Without this option, hardware will check the packet type.
 102
 103 The parameters of the ``--rx`` and ``--tx`` options are:
 104
 105 * ``--rx`` parameters
 106
 107    .. _table_l3fwd_rx_parameters:
 108
 109    +--------+------------------------------------------------------+
 110    | port   | RX port                                              |
 111    +--------+------------------------------------------------------+
 112    | queue  | RX queue that will be read on the specified RX port  |
 113    +--------+------------------------------------------------------+
 114    | lcore  | Core to use for the thread                           |
 115    +--------+------------------------------------------------------+
 116    | thread | Thread id (continuously from 0 to N)                 |
 117    +--------+------------------------------------------------------+
 118
 119
 120 * ``--tx`` parameters
 121
 122    .. _table_l3fwd_tx_parameters:
 123
 124    +--------+------------------------------------------------------+
 125    | lcore  | Core to use for L3 route match and transmit          |
 126    +--------+------------------------------------------------------+
 127    | thread | Id of RX thread to be associated with this TX thread |
 128    +--------+------------------------------------------------------+
 129
 130 The ``l3fwd-thread`` application allows you to start packet processing in two
 131 threading models: L-Threads (default) and EAL Threads (when the
 132 ``--no-lthreads`` parameter is used). For consistency all parameters are used
 133 in the same way for both models.
 134
 135
 136 Running with L-threads
 137 ~~~~~~~~~~~~~~~~~~~~~~
 138
 139 When the L-thread model is used (default option), lcore and thread parameters
 140 in ``--rx/--tx`` are used to affinitize threads to the selected scheduler.
 141
 142 For example, the following places every l-thread on different lcores::
 143
 144    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 145                 --rx="(0,0,0,0)(1,0,1,1)" \
 146                 --tx="(2,0)(3,1)"
 147
 148 The following places RX l-threads on lcore 0 and TX l-threads on lcore 1 and 2
 149 and so on::
 150
 151    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 152                 --rx="(0,0,0,0)(1,0,0,1)" \
 153                 --tx="(1,0)(2,1)"
 154
 155
 156 Running with EAL threads
 157 ~~~~~~~~~~~~~~~~~~~~~~~~
 158
 159 When the ``--no-lthreads`` parameter is used, the L-threading model is turned
 160 off and EAL threads are used for all processing. EAL threads are enumerated in
 161 the same way as L-threads, but the ``--lcores`` EAL parameter is used to
 162 affinitize threads to the selected cpu-set (scheduler). Thus it is possible to
 163 place every RX and TX thread on different lcores.
 164
 165 For example, the following places every EAL thread on different lcores::
 166
 167    l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 168                 --rx="(0,0,0,0)(1,0,1,1)" \
 169                 --tx="(2,0)(3,1)" \
 170                 --no-lthreads
 171
 172
 173 To affinitize two or more EAL threads to one cpu-set, the EAL ``--lcores``
 174 parameter is used.
 175
 176 The following places RX EAL threads on lcore 0 and TX EAL threads on lcore 1
 177 and 2 and so on::
 178
 179    l3fwd-thread -l 0-7 -n 2 --lcores="(0,1)@0,(2,3)@1" -- -P -p 3 \
 180                 --rx="(0,0,0,0)(1,0,1,1)" \
 181                 --tx="(2,0)(3,1)" \
 182                 --no-lthreads
 183
 184
 185 Examples
 186 ~~~~~~~~
 187
 188 For selected scenarios the command line configuration of the application for L-threads
 189 and its corresponding EAL threads command line can be realized as follows:
 190
 191 a) Start every thread on different scheduler (1:1)::
 192
 193       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 194                    --rx="(0,0,0,0)(1,0,1,1)" \
 195                    --tx="(2,0)(3,1)"
 196
 197    EAL thread equivalent::
 198
 199       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 200                    --rx="(0,0,0,0)(1,0,1,1)" \
 201                    --tx="(2,0)(3,1)" \
 202                    --no-lthreads
 203
 204 b) Start all threads on one core (N:1).
 205
 206    Start 4 L-threads on lcore 0::
 207
 208       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 209                    --rx="(0,0,0,0)(1,0,0,1)" \
 210                    --tx="(0,0)(0,1)"
 211
 212    Start 4 EAL threads on cpu-set 0::
 213
 214       l3fwd-thread -l 0-7 -n 2 --lcores="(0-3)@0" -- -P -p 3 \
 215                    --rx="(0,0,0,0)(1,0,0,1)" \
 216                    --tx="(2,0)(3,1)" \
 217                    --no-lthreads
 218
 219 c) Start threads on different cores (N:M).
 220
 221    Start 2 L-threads for RX on lcore 0, and 2 L-threads for TX on lcore 1::
 222
 223       l3fwd-thread -l 0-7 -n 2 -- -P -p 3 \
 224                    --rx="(0,0,0,0)(1,0,0,1)" \
 225                    --tx="(1,0)(1,1)"
 226
 227    Start 2 EAL threads for RX on cpu-set 0, and 2 EAL threads for TX on
 228    cpu-set 1::
 229
 230       l3fwd-thread -l 0-7 -n 2 --lcores="(0-1)@0,(2-3)@1" -- -P -p 3 \
 231                    --rx="(0,0,0,0)(1,0,1,1)" \
 232                    --tx="(2,0)(3,1)" \
 233                    --no-lthreads
 234
 235 Explanation
 236 -----------
 237
 238 To a great extent the sample application differs little from the standard L3
 239 forwarding application, and readers are advised to familiarize themselves with
 240 the material covered in the :doc:`l3_forward` documentation before proceeding.
 241
 242 The following explanation is focused on the way threading is handled in the
 243 performance thread example.
 244
 245
 246 Mode of operation with EAL threads
 247 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 248
 249 The performance thread sample application has split the RX and TX functionality
 250 into two different threads, and the RX and TX threads are
 251 interconnected via software rings. With respect to these rings the RX threads
 252 are producers and the TX threads are consumers.
 253
 254 On initialization the TX and RX threads are started according to the command
 255 line parameters.
 256
 257 The RX threads poll the network interface queues and post received packets to a
 258 TX thread via a corresponding software ring.
 259
 260 The TX threads poll software rings, perform the L3 forwarding hash/LPM match,
 261 and assemble packet bursts before performing burst transmit on the network
 262 interface.
 263
 264 As with the standard L3 forward application, burst draining of residual packets
 265 is performed periodically with the period calculated from elapsed time using
 266 the timestamps counter.
 267
 268 The diagram below illustrates a case with two RX threads and three TX threads.
 269
 270 .. _figure_performance_thread_1:
 271
 272 .. figure:: img/performance_thread_1.*
 273
 274
 275 Mode of operation with L-threads
 276 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 277
 278 Like the EAL thread configuration the application has split the RX and TX
 279 functionality into different threads, and the pairs of RX and TX threads are
 280 interconnected via software rings.
 281
 282 On initialization an L-thread scheduler is started on every EAL thread. On all
 283 but the master EAL thread only a a dummy L-thread is initially started.
 284 The L-thread started on the master EAL thread then spawns other L-threads on
 285 different L-thread schedulers according the command line parameters.
 286
 287 The RX threads poll the network interface queues and post received packets
 288 to a TX thread via the corresponding software ring.
 289
 290 The ring interface is augmented by means of an L-thread condition variable that
 291 enables the TX thread to be suspended when the TX ring is empty. The RX thread
 292 signals the condition whenever it posts to the TX ring, causing the TX thread
 293 to be resumed.
 294
 295 Additionally the TX L-thread spawns a worker L-thread to take care of
 296 polling the software rings, whilst it handles burst draining of the transmit
 297 buffer.
 298
 299 The worker threads poll the software rings, perform L3 route lookup and
 300 assemble packet bursts. If the TX ring is empty the worker thread suspends
 301 itself by waiting on the condition variable associated with the ring.
 302
 303 Burst draining of residual packets, less than the burst size, is performed by
 304 the TX thread which sleeps (using an L-thread sleep function) and resumes
 305 periodically to flush the TX buffer.
 306
 307 This design means that L-threads that have no work, can yield the CPU to other
 308 L-threads and avoid having to constantly poll the software rings.
 309
 310 The diagram below illustrates a case with two RX threads and three TX functions
 311 (each comprising a thread that processes forwarding and a thread that
 312 periodically drains the output buffer of residual packets).
 313
 314 .. _figure_performance_thread_2:
 315
 316 .. figure:: img/performance_thread_2.*
 317
 318
 319 CPU load statistics
 320 ~~~~~~~~~~~~~~~~~~~
 321
 322 It is possible to display statistics showing estimated CPU load on each core.
 323 The statistics indicate the percentage of CPU time spent: processing
 324 received packets (forwarding), polling queues/rings (waiting for work),
 325 and doing any other processing (context switch and other overhead).
 326
 327 When enabled statistics are gathered by having the application threads set and
 328 clear flags when they enter and exit pertinent code sections. The flags are
 329 then sampled in real time by a statistics collector thread running on another
 330 core. This thread displays the data in real time on the console.
 331
 332 This feature is enabled by designating a statistics collector core, using the
 333 ``--stat-lcore`` parameter.
 334
 335
 336 .. _lthread_subsystem:
 337
 338 The L-thread subsystem
 339 ----------------------
 340
 341 The L-thread subsystem resides in the examples/performance-thread/common
 342 directory and is built and linked automatically when building the
 343 ``l3fwd-thread`` example.
 344
 345 The subsystem provides a simple cooperative scheduler to enable arbitrary
 346 functions to run as cooperative threads within a single EAL thread.
 347 The subsystem provides a pthread like API that is intended to assist in
 348 reuse of legacy code written for POSIX pthreads.
 349
 350 The following sections provide some detail on the features, constraints,
 351 performance and porting considerations when using L-threads.
 352
 353
 354 .. _comparison_between_lthreads_and_pthreads:
 355
 356 Comparison between L-threads and POSIX pthreads
 357 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 358
 359 The fundamental difference between the L-thread and pthread models is the
 360 way in which threads are scheduled. The simplest way to think about this is to
 361 consider the case of a processor with a single CPU. To run multiple threads
 362 on a single CPU, the scheduler must frequently switch between the threads,
 363 in order that each thread is able to make timely progress.
 364 This is the basis of any multitasking operating system.
 365
 366 This section explores the differences between the pthread model and the
 367 L-thread model as implemented in the provided L-thread subsystem. If needed a
 368 theoretical discussion of preemptive vs cooperative multi-threading can be
 369 found in any good text on operating system design.
 370
 371
 372 Scheduling and context switching
 373 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 374
 375 The POSIX pthread library provides an application programming interface to
 376 create and synchronize threads. Scheduling policy is determined by the host OS,
 377 and may be configurable. The OS may use sophisticated rules to determine which
 378 thread should be run next, threads may suspend themselves or make other threads
 379 ready, and the scheduler may employ a time slice giving each thread a maximum
 380 time quantum after which it will be preempted in favor of another thread that
 381 is ready to run. To complicate matters further threads may be assigned
 382 different scheduling priorities.
 383
 384 By contrast the L-thread subsystem is considerably simpler. Logically the
 385 L-thread scheduler performs the same multiplexing function for L-threads
 386 within a single pthread as the OS scheduler does for pthreads within an
 387 application process. The L-thread scheduler is simply the main loop of a
 388 pthread, and in so far as the host OS is concerned it is a regular pthread
 389 just like any other. The host OS is oblivious about the existence of and
 390 not at all involved in the scheduling of L-threads.
 391
 392 The other and most significant difference between the two models is that
 393 L-threads are scheduled cooperatively. L-threads cannot not preempt each
 394 other, nor can the L-thread scheduler preempt a running L-thread (i.e.
 395 there is no time slicing). The consequence is that programs implemented with
 396 L-threads must possess frequent rescheduling points, meaning that they must
 397 explicitly and of their own volition return to the scheduler at frequent
 398 intervals, in order to allow other L-threads an opportunity to proceed.
 399
 400 In both models switching between threads requires that the current CPU
 401 context is saved and a new context (belonging to the next thread ready to run)
 402 is restored. With pthreads this context switching is handled transparently
 403 and the set of CPU registers that must be preserved between context switches
 404 is as per an interrupt handler.
 405
 406 An L-thread context switch is achieved by the thread itself making a function
 407 call to the L-thread scheduler. Thus it is only necessary to preserve the
 408 callee registers. The caller is responsible to save and restore any other
 409 registers it is using before a function call, and restore them on return,
 410 and this is handled by the compiler. For ``X86_64`` on both Linux and BSD the
 411 System V calling convention is used, this defines registers RSP, RBP, and
 412 R12-R15 as callee-save registers (for more detailed discussion a good reference
 413 is `X86 Calling Conventions <https://en.wikipedia.org/wiki/X86_calling_conventions>`_).
 414
 415 Taking advantage of this, and due to the absence of preemption, an L-thread
 416 context switch is achieved with less than 20 load/store instructions.
 417
 418 The scheduling policy for L-threads is fixed, there is no prioritization of
 419 L-threads, all L-threads are equal and scheduling is based on a FIFO
 420 ready queue.
 421
 422 An L-thread is a struct containing the CPU context of the thread
 423 (saved on context switch) and other useful items. The ready queue contains
 424 pointers to threads that are ready to run. The L-thread scheduler is a simple
 425 loop that polls the ready queue, reads from it the next thread ready to run,
 426 which it resumes by saving the current context (the current position in the
 427 scheduler loop) and restoring the context of the next thread from its thread
 428 struct. Thus an L-thread is always resumed at the last place it yielded.
 429
 430 A well behaved L-thread will call the context switch regularly (at least once
 431 in its main loop) thus returning to the scheduler's own main loop. Yielding
 432 inserts the current thread at the back of the ready queue, and the process of
 433 servicing the ready queue is repeated, thus the system runs by flipping back
 434 and forth the between L-threads and scheduler loop.
 435
 436 In the case of pthreads, the preemptive scheduling, time slicing, and support
 437 for thread prioritization means that progress is normally possible for any
 438 thread that is ready to run. This comes at the price of a relatively heavier
 439 context switch and scheduling overhead.
 440
 441 With L-threads the progress of any particular thread is determined by the
 442 frequency of rescheduling opportunities in the other L-threads. This means that
 443 an errant L-thread monopolizing the CPU might cause scheduling of other threads
 444 to be stalled. Due to the lower cost of context switching, however, voluntary
 445 rescheduling to ensure progress of other threads, if managed sensibly, is not
 446 a prohibitive overhead, and overall performance can exceed that of an
 447 application using pthreads.
 448
 449
 450 Mutual exclusion
 451 ^^^^^^^^^^^^^^^^
 452
 453 With pthreads preemption means that threads that share data must observe
 454 some form of mutual exclusion protocol.
 455
 456 The fact that L-threads cannot preempt each other means that in many cases
 457 mutual exclusion devices can be completely avoided.
 458
 459 Locking to protect shared data can be a significant bottleneck in
 460 multi-threaded applications so a carefully designed cooperatively scheduled
 461 program can enjoy significant performance advantages.
 462
 463 So far we have considered only the simplistic case of a single core CPU,
 464 when multiple CPUs are considered things are somewhat more complex.
 465
 466 First of all it is inevitable that there must be multiple L-thread schedulers,
 467 one running on each EAL thread. So long as these schedulers remain isolated
 468 from each other the above assertions about the potential advantages of
 469 cooperative scheduling hold true.
 470
 471 A configuration with isolated cooperative schedulers is less flexible than the
 472 pthread model where threads can be affinitized to run on any CPU. With isolated
 473 schedulers scaling of applications to utilize fewer or more CPUs according to
 474 system demand is very difficult to achieve.
 475
 476 The L-thread subsystem makes it possible for L-threads to migrate between
 477 schedulers running on different CPUs. Needless to say if the migration means
 478 that threads that share data end up running on different CPUs then this will
 479 introduce the need for some kind of mutual exclusion system.
 480
 481 Of course ``rte_ring`` software rings can always be used to interconnect
 482 threads running on different cores, however to protect other kinds of shared
 483 data structures, lock free constructs or else explicit locking will be
 484 required. This is a consideration for the application design.
 485
 486 In support of this extended functionality, the L-thread subsystem implements
 487 thread safe mutexes and condition variables.
 488
 489 The cost of affinitizing and of condition variable signaling is significantly
 490 lower than the equivalent pthread operations, and so applications using these
 491 features will see a performance benefit.
 492
 493
 494 Thread local storage
 495 ^^^^^^^^^^^^^^^^^^^^
 496
 497 As with applications written for pthreads an application written for L-threads
 498 can take advantage of thread local storage, in this case local to an L-thread.
 499 An application may save and retrieve a single pointer to application data in
 500 the L-thread struct.
 501
 502 For legacy and backward compatibility reasons two alternative methods are also
 503 offered, the first is modelled directly on the pthread get/set specific APIs,
 504 the second approach is modelled on the ``RTE_PER_LCORE`` macros, whereby
 505 ``PER_LTHREAD`` macros are introduced, in both cases the storage is local to
 506 the L-thread.
 507
 508
 509 .. _constraints_and_performance_implications:
 510
 511 Constraints and performance implications when using L-threads
 512 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 513
 514
 515 .. _API_compatibility:
 516
 517 API compatibility
 518 ^^^^^^^^^^^^^^^^^
 519
 520 The L-thread subsystem provides a set of functions that are logically equivalent
 521 to the corresponding functions offered by the POSIX pthread library, however not
 522 all pthread functions have a corresponding L-thread equivalent, and not all
 523 features available to pthreads are implemented for L-threads.
 524
 525 The pthread library offers considerable flexibility via programmable attributes
 526 that can be associated with threads, mutexes, and condition variables.
 527
 528 By contrast the L-thread subsystem has fixed functionality, the scheduler policy
 529 cannot be varied, and L-threads cannot be prioritized. There are no variable
 530 attributes associated with any L-thread objects. L-threads, mutexes and
 531 conditional variables, all have fixed functionality. (Note: reserved parameters
 532 are included in the APIs to facilitate possible future support for attributes).
 533
 534 The table below lists the pthread and equivalent L-thread APIs with notes on
 535 differences and/or constraints. Where there is no L-thread entry in the table,
 536 then the L-thread subsystem provides no equivalent function.
 537
 538 .. _table_lthread_pthread:
 539
 540 .. table:: Pthread and equivalent L-thread APIs.
 541
 542    +----------------------------+------------------------+-------------------+
 543    | **Pthread function**       | **L-thread function**  | **Notes**         |
 544    +============================+========================+===================+
 545    | pthread_barrier_destroy    |                        |                   |
 546    +----------------------------+------------------------+-------------------+
 547    | pthread_barrier_init       |                        |                   |
 548    +----------------------------+------------------------+-------------------+
 549    | pthread_barrier_wait       |                        |                   |
 550    +----------------------------+------------------------+-------------------+
 551    | pthread_cond_broadcast     | lthread_cond_broadcast | See note 1        |
 552    +----------------------------+------------------------+-------------------+
 553    | pthread_cond_destroy       | lthread_cond_destroy   |                   |
 554    +----------------------------+------------------------+-------------------+
 555    | pthread_cond_init          | lthread_cond_init      |                   |
 556    +----------------------------+------------------------+-------------------+
 557    | pthread_cond_signal        | lthread_cond_signal    | See note 1        |
 558    +----------------------------+------------------------+-------------------+
 559    | pthread_cond_timedwait     |                        |                   |
 560    +----------------------------+------------------------+-------------------+
 561    | pthread_cond_wait          | lthread_cond_wait      | See note 5        |
 562    +----------------------------+------------------------+-------------------+
 563    | pthread_create             | lthread_create         | See notes 2, 3    |
 564    +----------------------------+------------------------+-------------------+
 565    | pthread_detach             | lthread_detach         | See note 4        |
 566    +----------------------------+------------------------+-------------------+
 567    | pthread_equal              |                        |                   |
 568    +----------------------------+------------------------+-------------------+
 569    | pthread_exit               | lthread_exit           |                   |
 570    +----------------------------+------------------------+-------------------+
 571    | pthread_getspecific        | lthread_getspecific    |                   |
 572    +----------------------------+------------------------+-------------------+
 573    | pthread_getcpuclockid      |                        |                   |
 574    +----------------------------+------------------------+-------------------+
 575    | pthread_join               | lthread_join           |                   |
 576    +----------------------------+------------------------+-------------------+
 577    | pthread_key_create         | lthread_key_create     |                   |
 578    +----------------------------+------------------------+-------------------+
 579    | pthread_key_delete         | lthread_key_delete     |                   |
 580    +----------------------------+------------------------+-------------------+
 581    | pthread_mutex_destroy      | lthread_mutex_destroy  |                   |
 582    +----------------------------+------------------------+-------------------+
 583    | pthread_mutex_init         | lthread_mutex_init     |                   |
 584    +----------------------------+------------------------+-------------------+
 585    | pthread_mutex_lock         | lthread_mutex_lock     | See note 6        |
 586    +----------------------------+------------------------+-------------------+
 587    | pthread_mutex_trylock      | lthread_mutex_trylock  | See note 6        |
 588    +----------------------------+------------------------+-------------------+
 589    | pthread_mutex_timedlock    |                        |                   |
 590    +----------------------------+------------------------+-------------------+
 591    | pthread_mutex_unlock       | lthread_mutex_unlock   |                   |
 592    +----------------------------+------------------------+-------------------+
 593    | pthread_once               |                        |                   |
 594    +----------------------------+------------------------+-------------------+
 595    | pthread_rwlock_destroy     |                        |                   |
 596    +----------------------------+------------------------+-------------------+
 597    | pthread_rwlock_init        |                        |                   |
 598    +----------------------------+------------------------+-------------------+
 599    | pthread_rwlock_rdlock      |                        |                   |
 600    +----------------------------+------------------------+-------------------+
 601    | pthread_rwlock_timedrdlock |                        |                   |
 602    +----------------------------+------------------------+-------------------+
 603    | pthread_rwlock_timedwrlock |                        |                   |
 604    +----------------------------+------------------------+-------------------+
 605    | pthread_rwlock_tryrdlock   |                        |                   |
 606    +----------------------------+------------------------+-------------------+
 607    | pthread_rwlock_trywrlock   |                        |                   |
 608    +----------------------------+------------------------+-------------------+
 609    | pthread_rwlock_unlock      |                        |                   |
 610    +----------------------------+------------------------+-------------------+
 611    | pthread_rwlock_wrlock      |                        |                   |
 612    +----------------------------+------------------------+-------------------+
 613    | pthread_self               | lthread_current        |                   |
 614    +----------------------------+------------------------+-------------------+
 615    | pthread_setspecific        | lthread_setspecific    |                   |
 616    +----------------------------+------------------------+-------------------+
 617    | pthread_spin_init          |                        | See note 10       |
 618    +----------------------------+------------------------+-------------------+
 619    | pthread_spin_destroy       |                        | See note 10       |
 620    +----------------------------+------------------------+-------------------+
 621    | pthread_spin_lock          |                        | See note 10       |
 622    +----------------------------+------------------------+-------------------+
 623    | pthread_spin_trylock       |                        | See note 10       |
 624    +----------------------------+------------------------+-------------------+
 625    | pthread_spin_unlock        |                        | See note 10       |
 626    +----------------------------+------------------------+-------------------+
 627    | pthread_cancel             | lthread_cancel         |                   |
 628    +----------------------------+------------------------+-------------------+
 629    | pthread_setcancelstate     |                        |                   |
 630    +----------------------------+------------------------+-------------------+
 631    | pthread_setcanceltype      |                        |                   |
 632    +----------------------------+------------------------+-------------------+
 633    | pthread_testcancel         |                        |                   |
 634    +----------------------------+------------------------+-------------------+
 635    | pthread_getschedparam      |                        |                   |
 636    +----------------------------+------------------------+-------------------+
 637    | pthread_setschedparam      |                        |                   |
 638    +----------------------------+------------------------+-------------------+
 639    | pthread_yield              | lthread_yield          | See note 7        |
 640    +----------------------------+------------------------+-------------------+
 641    | pthread_setaffinity_np     | lthread_set_affinity   | See notes 2, 3, 8 |
 642    +----------------------------+------------------------+-------------------+
 643    |                            | lthread_sleep          | See note 9        |
 644    +----------------------------+------------------------+-------------------+
 645    |                            | lthread_sleep_clks     | See note 9        |
 646    +----------------------------+------------------------+-------------------+
 647
 648
 649 **Note 1**:
 650
 651 Neither lthread signal nor broadcast may be called concurrently by L-threads
 652 running on different schedulers, although multiple L-threads running in the
 653 same scheduler may freely perform signal or broadcast operations. L-threads
 654 running on the same or different schedulers may always safely wait on a
 655 condition variable.
 656
 657
 658 **Note 2**:
 659
 660 Pthread attributes may be used to affinitize a pthread with a cpu-set. The
 661 L-thread subsystem does not support a cpu-set. An L-thread may be affinitized
 662 only with a single CPU at any time.
 663
 664
 665 **Note 3**:
 666
 667 If an L-thread is intended to run on a different NUMA node than the node that
 668 creates the thread then, when calling ``lthread_create()`` it is advantageous
 669 to specify the destination core as a parameter of ``lthread_create()``. See
 670 :ref:`memory_allocation_and_NUMA_awareness` for details.
 671
 672
 673 **Note 4**:
 674
 675 An L-thread can only detach itself, and cannot detach other L-threads.
 676
 677
 678 **Note 5**:
 679
 680 A wait operation on a pthread condition variable is always associated with and
 681 protected by a mutex which must be owned by the thread at the time it invokes
 682 ``pthread_wait()``. By contrast L-thread condition variables are thread safe
 683 (for waiters) and do not use an associated mutex. Multiple L-threads (including
 684 L-threads running on other schedulers) can safely wait on a L-thread condition
 685 variable. As a consequence the performance of an L-thread condition variables
 686 is typically an order of magnitude faster than its pthread counterpart.
 687
 688
 689 **Note 6**:
 690
 691 Recursive locking is not supported with L-threads, attempts to take a lock
 692 recursively will be detected and rejected.
 693
 694
 695 **Note 7**:
 696
 697 ``lthread_yield()`` will save the current context, insert the current thread
 698 to the back of the ready queue, and resume the next ready thread. Yielding
 699 increases ready queue backlog, see :ref:`ready_queue_backlog` for more details
 700 about the implications of this.
 701
 702
 703 N.B. The context switch time as measured from immediately before the call to
 704 ``lthread_yield()`` to the point at which the next ready thread is resumed,
 705 can be an order of magnitude faster that the same measurement for
 706 pthread_yield.
 707
 708
 709 **Note 8**:
 710
 711 ``lthread_set_affinity()`` is similar to a yield apart from the fact that the
 712 yielding thread is inserted into a peer ready queue of another scheduler.
 713 The peer ready queue is actually a separate thread safe queue, which means that
 714 threads appearing in the peer ready queue can jump any backlog in the local
 715 ready queue on the destination scheduler.
 716
 717 The context switch time as measured from the time just before the call to
 718 ``lthread_set_affinity()`` to just after the same thread is resumed on the new
 719 scheduler can be orders of magnitude faster than the same measurement for
 720 ``pthread_setaffinity_np()``.
 721
 722
 723 **Note 9**:
 724
 725 Although there is no ``pthread_sleep()`` function, ``lthread_sleep()`` and
 726 ``lthread_sleep_clks()`` can be used wherever ``sleep()``, ``usleep()`` or
 727 ``nanosleep()`` might ordinarily be used. The L-thread sleep functions suspend
 728 the current thread, start an ``rte_timer`` and resume the thread when the
 729 timer matures. The ``rte_timer_manage()`` entry point is called on every pass
 730 of the scheduler loop. This means that the worst case jitter on timer expiry
 731 is determined by the longest period between context switches of any running
 732 L-threads.
 733
 734 In a synthetic test with many threads sleeping and resuming then the measured
 735 jitter is typically orders of magnitude lower than the same measurement made
 736 for ``nanosleep()``.
 737
 738
 739 **Note 10**:
 740
 741 Spin locks are not provided because they are problematical in a cooperative
 742 environment, see :ref:`porting_locks_and_spinlocks` for a more detailed
 743 discussion on how to avoid spin locks.
 744
 745
 746 .. _Thread_local_storage_performance:
 747
 748 Thread local storage
 749 ^^^^^^^^^^^^^^^^^^^^
 750
 751 Of the three L-thread local storage options the simplest and most efficient is
 752 storing a single application data pointer in the L-thread struct.
 753
 754 The ``PER_LTHREAD`` macros involve a run time computation to obtain the address
 755 of the variable being saved/retrieved and also require that the accesses are
 756 de-referenced  via a pointer. This means that code that has used
 757 ``RTE_PER_LCORE`` macros being ported to L-threads might need some slight
 758 adjustment (see :ref:`porting_thread_local_storage` for hints about porting
 759 code that makes use of thread local storage).
 760
 761 The get/set specific APIs are consistent with their pthread counterparts both
 762 in use and in performance.
 763
 764
 765 .. _memory_allocation_and_NUMA_awareness:
 766
 767 Memory allocation and NUMA awareness
 768 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 769
 770 All memory allocation is from DPDK huge pages, and is NUMA aware. Each
 771 scheduler maintains its own caches of objects: lthreads, their stacks, TLS,
 772 mutexes and condition variables. These caches are implemented as unbounded lock
 773 free MPSC queues. When objects are created they are always allocated from the
 774 caches on the local core (current EAL thread).
 775
 776 If an L-thread has been affinitized to a different scheduler, then it can
 777 always safely free resources to the caches from which they originated (because
 778 the caches are MPSC queues).
 779
 780 If the L-thread has been affinitized to a different NUMA node then the memory
 781 resources associated with it may incur longer access latency.
 782
 783 The commonly used pattern of setting affinity on entry to a thread after it has
 784 started, means that memory allocation for both the stack and TLS will have been
 785 made from caches on the NUMA node on which the threads creator is running.
 786 This has the side effect that access latency will be sub-optimal after
 787 affinitizing.
 788
 789 This side effect can be mitigated to some extent (although not completely) by
 790 specifying the destination CPU as a parameter of ``lthread_create()`` this
 791 causes the L-thread's stack and TLS to be allocated when it is first scheduled
 792 on the destination scheduler, if the destination is a on another NUMA node it
 793 results in a more optimal memory allocation.
 794
 795 Note that the lthread struct itself remains allocated from memory on the
 796 creating node, this is unavoidable because an L-thread is known everywhere by
 797 the address of this struct.
 798
 799
 800 .. _object_cache_sizing:
 801
 802 Object cache sizing
 803 ^^^^^^^^^^^^^^^^^^^
 804
 805 The per lcore object caches pre-allocate objects in bulk whenever a request to
 806 allocate an object finds a cache empty. By default 100 objects are
 807 pre-allocated, this is defined by ``LTHREAD_PREALLOC`` in the public API
 808 header file lthread_api.h. This means that the caches constantly grow to meet
 809 system demand.
 810
 811 In the present implementation there is no mechanism to reduce the cache sizes
 812 if system demand reduces. Thus the caches will remain at their maximum extent
 813 indefinitely.
 814
 815 A consequence of the bulk pre-allocation of objects is that every 100 (default
 816 value) additional new object create operations results in a call to
 817 ``rte_malloc()``. For creation of objects such as L-threads, which trigger the
 818 allocation of even more objects (i.e. their stacks and TLS) then this can
 819 cause outliers in scheduling performance.
 820
 821 If this is a problem the simplest mitigation strategy is to dimension the
 822 system, by setting the bulk object pre-allocation size to some large number
 823 that you do not expect to be exceeded. This means the caches will be populated
 824 once only, the very first time a thread is created.
 825
 826
 827 .. _Ready_queue_backlog:
 828
 829 Ready queue backlog
 830 ^^^^^^^^^^^^^^^^^^^
 831
 832 One of the more subtle performance considerations is managing the ready queue
 833 backlog. The fewer threads that are waiting in the ready queue then the faster
 834 any particular thread will get serviced.
 835
 836 In a naive L-thread application with N L-threads simply looping and yielding,
 837 this backlog will always be equal to the number of L-threads, thus the cost of
 838 a yield to a particular L-thread will be N times the context switch time.
 839
 840 This side effect can be mitigated by arranging for threads to be suspended and
 841 wait to be resumed, rather than polling for work by constantly yielding.
 842 Blocking on a mutex or condition variable or even more obviously having a
 843 thread sleep if it has a low frequency workload are all mechanisms by which a
 844 thread can be excluded from the ready queue until it really does need to be
 845 run. This can have a significant positive impact on performance.
 846
 847
 848 .. _Initialization_and_shutdown_dependencies:
 849
 850 Initialization, shutdown and dependencies
 851 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 852
 853 The L-thread subsystem depends on DPDK for huge page allocation and depends on
 854 the ``rte_timer subsystem``. The DPDK EAL initialization and
 855 ``rte_timer_subsystem_init()`` **MUST** be completed before the L-thread sub
 856 system can be used.
 857
 858 Thereafter initialization of the L-thread subsystem is largely transparent to
 859 the application. Constructor functions ensure that global variables are properly
 860 initialized. Other than global variables each scheduler is initialized
 861 independently the first time that an L-thread is created by a particular EAL
 862 thread.
 863
 864 If the schedulers are to be run as isolated and independent schedulers, with
 865 no intention that L-threads running on different schedulers will migrate between
 866 schedulers or synchronize with L-threads running on other schedulers, then
 867 initialization consists simply of creating an L-thread, and then running the
 868 L-thread scheduler.
 869
 870 If there will be interaction between L-threads running on different schedulers,
 871 then it is important that the starting of schedulers on different EAL threads
 872 is synchronized.
 873
 874 To achieve this an additional initialization step is necessary, this is simply
 875 to set the number of schedulers by calling the API function
 876 ``lthread_num_schedulers_set(n)``, where ``n`` is the number of EAL threads
 877 that will run L-thread schedulers. Setting the number of schedulers to a
 878 number greater than 0 will cause all schedulers to wait until the others have
 879 started before beginning to schedule L-threads.
 880
 881 The L-thread scheduler is started by calling the function ``lthread_run()``
 882 and should be called from the EAL thread and thus become the main loop of the
 883 EAL thread.
 884
 885 The function ``lthread_run()``, will not return until all threads running on
 886 the scheduler have exited, and the scheduler has been explicitly stopped by
 887 calling ``lthread_scheduler_shutdown(lcore)`` or
 888 ``lthread_scheduler_shutdown_all()``.
 889
 890 All these function do is tell the scheduler that it can exit when there are no
 891 longer any running L-threads, neither function forces any running L-thread to
 892 terminate. Any desired application shutdown behavior must be designed and
 893 built into the application to ensure that L-threads complete in a timely
 894 manner.
 895
 896 **Important Note:** It is assumed when the scheduler exits that the application
 897 is terminating for good, the scheduler does not free resources before exiting
 898 and running the scheduler a subsequent time will result in undefined behavior.
 899
 900
 901 .. _porting_legacy_code_to_run_on_lthreads:
 902
 903 Porting legacy code to run on L-threads
 904 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 905
 906 Legacy code originally written for a pthread environment may be ported to
 907 L-threads if the considerations about differences in scheduling policy, and
 908 constraints discussed in the previous sections can be accommodated.
 909
 910 This section looks in more detail at some of the issues that may have to be
 911 resolved when porting code.
 912
 913
 914 .. _pthread_API_compatibility:
 915
 916 pthread API compatibility
 917 ^^^^^^^^^^^^^^^^^^^^^^^^^
 918
 919 The first step is to establish exactly which pthread APIs the legacy
 920 application uses, and to understand the requirements of those APIs. If there
 921 are corresponding L-lthread APIs, and where the default pthread functionality
 922 is used by the application then, notwithstanding the other issues discussed
 923 here, it should be feasible to run the application with L-threads. If the
 924 legacy code modifies the default behavior using attributes then if may be
 925 necessary to make some adjustments to eliminate those requirements.
 926
 927
 928 .. _blocking_system_calls:
 929
 930 Blocking system API calls
 931 ^^^^^^^^^^^^^^^^^^^^^^^^^
 932
 933 It is important to understand what other system services the application may be
 934 using, bearing in mind that in a cooperatively scheduled environment a thread
 935 cannot block without stalling the scheduler and with it all other cooperative
 936 threads. Any kind of blocking system call, for example file or socket IO, is a
 937 potential problem, a good tool to analyze the application for this purpose is
 938 the ``strace`` utility.
 939
 940 There are many strategies to resolve these kind of issues, each with it
 941 merits. Possible solutions include:
 942
 943 * Adopting a polled mode of the system API concerned (if available).
 944
 945 * Arranging for another core to perform the function and synchronizing with
 946   that core via constructs that will not block the L-thread.
 947
 948 * Affinitizing the thread to another scheduler devoted (as a matter of policy)
 949   to handling threads wishing to make blocking calls, and then back again when
 950   finished.
 951
 952
 953 .. _porting_locks_and_spinlocks:
 954
 955 Locks and spinlocks
 956 ^^^^^^^^^^^^^^^^^^^
 957
 958 Locks and spinlocks are another source of blocking behavior that for the same
 959 reasons as system calls will need to be addressed.
 960
 961 If the application design ensures that the contending L-threads will always
 962 run on the same scheduler then it its probably safe to remove locks and spin
 963 locks completely.
 964
 965 The only exception to the above rule is if for some reason the
 966 code performs any kind of context switch whilst holding the lock
 967 (e.g. yield, sleep, or block on a different lock, or on a condition variable).
 968 This will need to determined before deciding to eliminate a lock.
 969
 970 If a lock cannot be eliminated then an L-thread mutex can be substituted for
 971 either kind of lock.
 972
 973 An L-thread blocking on an L-thread mutex will be suspended and will cause
 974 another ready L-thread to be resumed, thus not blocking the scheduler. When
 975 default behavior is required, it can be used as a direct replacement for a
 976 pthread mutex lock.
 977
 978 Spin locks are typically used when lock contention is likely to be rare and
 979 where the period during which the lock may be held is relatively short.
 980 When the contending L-threads are running on the same scheduler then an
 981 L-thread blocking on a spin lock will enter an infinite loop stopping the
 982 scheduler completely (see :ref:`porting_infinite_loops` below).
 983
 984 If the application design ensures that contending L-threads will always run
 985 on different schedulers then it might be reasonable to leave a short spin lock
 986 that rarely experiences contention in place.
 987
 988 If after all considerations it appears that a spin lock can neither be
 989 eliminated completely, replaced with an L-thread mutex, or left in place as
 990 is, then an alternative is to loop on a flag, with a call to
 991 ``lthread_yield()`` inside the loop (n.b. if the contending L-threads might
 992 ever run on different schedulers the flag will need to be manipulated
 993 atomically).
 994
 995 Spinning and yielding is the least preferred solution since it introduces
 996 ready queue backlog (see also :ref:`ready_queue_backlog`).
 997
 998
 999 .. _porting_sleeps_and_delays:
1000
1001 Sleeps and delays
1002 ^^^^^^^^^^^^^^^^^
1003
1004 Yet another kind of blocking behavior (albeit momentary) are delay functions
1005 like ``sleep()``, ``usleep()``, ``nanosleep()`` etc. All will have the
1006 consequence of stalling the L-thread scheduler and unless the delay is very
1007 short (e.g. a very short nanosleep) calls to these functions will need to be
1008 eliminated.
1009
1010 The simplest mitigation strategy is to use the L-thread sleep API functions,
1011 of which two variants exist, ``lthread_sleep()`` and ``lthread_sleep_clks()``.
1012 These functions start an rte_timer against the L-thread, suspend the L-thread
1013 and cause another ready L-thread to be resumed. The suspended L-thread is
1014 resumed when the rte_timer matures.
1015
1016
1017 .. _porting_infinite_loops:
1018
1019 Infinite loops
1020 ^^^^^^^^^^^^^^
1021
1022 Some applications have threads with loops that contain no inherent
1023 rescheduling opportunity, and rely solely on the OS time slicing to share
1024 the CPU. In a cooperative environment this will stop everything dead. These
1025 kind of loops are not hard to identify, in a debug session you will find the
1026 debugger is always stopping in the same loop.
1027
1028 The simplest solution to this kind of problem is to insert an explicit
1029 ``lthread_yield()`` or ``lthread_sleep()`` into the loop. Another solution
1030 might be to include the function performed by the loop into the execution path
1031 of some other loop that does in fact yield, if this is possible.
1032
1033
1034 .. _porting_thread_local_storage:
1035
1036 Thread local storage
1037 ^^^^^^^^^^^^^^^^^^^^
1038
1039 If the application uses thread local storage, the use case should be
1040 studied carefully.
1041
1042 In a legacy pthread application either or both the ``__thread`` prefix, or the
1043 pthread set/get specific APIs may have been used to define storage local to a
1044 pthread.
1045
1046 In some applications it may be a reasonable assumption that the data could
1047 or in fact most likely should be placed in L-thread local storage.
1048
1049 If the application (like many DPDK applications) has assumed a certain
1050 relationship between a pthread and the CPU to which it is affinitized, there
1051 is a risk that thread local storage may have been used to save some data items
1052 that are correctly logically associated with the CPU, and others items which
1053 relate to application context for the thread. Only a good understanding of the
1054 application will reveal such cases.
1055
1056 If the application requires an that an L-thread is to be able to move between
1057 schedulers then care should be taken to separate these kinds of data, into per
1058 lcore, and per L-thread storage. In this way a migrating thread will bring with
1059 it the local data it needs, and pick up the new logical core specific values
1060 from pthread local storage at its new home.
1061
1062
1063 .. _pthread_shim:
1064
1065 Pthread shim
1066 ~~~~~~~~~~~~
1067
1068 A convenient way to get something working with legacy code can be to use a
1069 shim that adapts pthread API calls to the corresponding L-thread ones.
1070 This approach will not mitigate any of the porting considerations mentioned
1071 in the previous sections, but it will reduce the amount of code churn that
1072 would otherwise been involved. It is a reasonable approach to evaluate
1073 L-threads, before investing effort in porting to the native L-thread APIs.
1074
1075
1076 Overview
1077 ^^^^^^^^
1078 The L-thread subsystem includes an example pthread shim. This is a partial
1079 implementation but does contain the API stubs needed to get basic applications
1080 running. There is a simple "hello world" application that demonstrates the
1081 use of the pthread shim.
1082
1083 A subtlety of working with a shim is that the application will still need
1084 to make use of the genuine pthread library functions, at the very least in
1085 order to create the EAL threads in which the L-thread schedulers will run.
1086 This is the case with DPDK initialization, and exit.
1087
1088 To deal with the initialization and shutdown scenarios, the shim is capable of
1089 switching on or off its adaptor functionality, an application can control this
1090 behavior by the calling the function ``pt_override_set()``. The default state
1091 is disabled.
1092
1093 The pthread shim uses the dynamic linker loader and saves the loaded addresses
1094 of the genuine pthread API functions in an internal table, when the shim
1095 functionality is enabled it performs the adaptor function, when disabled it
1096 invokes the genuine pthread function.
1097
1098 The function ``pthread_exit()`` has additional special handling. The standard
1099 system header file pthread.h declares ``pthread_exit()`` with
1100 ``__attribute__((noreturn))`` this is an optimization that is possible because
1101 the pthread is terminating and this enables the compiler to omit the normal
1102 handling of stack and protection of registers since the function is not
1103 expected to return, and in fact the thread is being destroyed. These
1104 optimizations are applied in both the callee and the caller of the
1105 ``pthread_exit()`` function.
1106
1107 In our cooperative scheduling environment this behavior is inadmissible. The
1108 pthread is the L-thread scheduler thread, and, although an L-thread is
1109 terminating, there must be a return to the scheduler in order that the system
1110 can continue to run. Further, returning from a function with attribute
1111 ``noreturn`` is invalid and may result in undefined behavior.
1112
1113 The solution is to redefine the ``pthread_exit`` function with a macro,
1114 causing it to be mapped to a stub function in the shim that does not have the
1115 ``noreturn`` attribute. This macro is defined in the file
1116 ``pthread_shim.h``. The stub function is otherwise no different than any of
1117 the other stub functions in the shim, and will switch between the real
1118 ``pthread_exit()`` function or the ``lthread_exit()`` function as
1119 required. The only difference is that the mapping to the stub by macro
1120 substitution.
1121
1122 A consequence of this is that the file ``pthread_shim.h`` must be included in
1123 legacy code wishing to make use of the shim. It also means that dynamic
1124 linkage of a pre-compiled binary that did not include pthread_shim.h is not be
1125 supported.
1126
1127 Given the requirements for porting legacy code outlined in
1128 :ref:`porting_legacy_code_to_run_on_lthreads` most applications will require at
1129 least some minimal adjustment and recompilation to run on L-threads so
1130 pre-compiled binaries are unlikely to be met in practice.
1131
1132 In summary the shim approach adds some overhead but can be a useful tool to help
1133 establish the feasibility of a code reuse project. It is also a fairly
1134 straightforward task to extend the shim if necessary.
1135
1136 **Note:** Bearing in mind the preceding discussions about the impact of making
1137 blocking calls then switching the shim in and out on the fly to invoke any
1138 pthread API this might block is something that should typically be avoided.
1139
1140
1141 Building and running the pthread shim
1142 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1143
1144 The shim example application is located in the sample application
1145 in the performance-thread folder
1146
1147 To build and run the pthread shim example
1148
1149 #. Go to the example applications folder
1150
1151    .. code-block:: console
1152
1153        export RTE_SDK=/path/to/rte_sdk
1154        cd ${RTE_SDK}/examples/performance-thread/pthread_shim
1155
1156
1157 #. Set the target (a default target is used if not specified). For example:
1158
1159    .. code-block:: console
1160
1161        export RTE_TARGET=x86_64-native-linuxapp-gcc
1162
1163    See the DPDK Getting Started Guide for possible RTE_TARGET values.
1164
1165 #. Build the application:
1166
1167    .. code-block:: console
1168
1169        make
1170
1171 #. To run the pthread_shim example
1172
1173    .. code-block:: console
1174
1175        lthread-pthread-shim -c core_mask -n number_of_channels
1176
1177 .. _lthread_diagnostics:
1178
1179 L-thread Diagnostics
1180 ~~~~~~~~~~~~~~~~~~~~
1181
1182 When debugging you must take account of the fact that the L-threads are run in
1183 a single pthread. The current scheduler is defined by
1184 ``RTE_PER_LCORE(this_sched)``, and the current lthread is stored at
1185 ``RTE_PER_LCORE(this_sched)->current_lthread``. Thus on a breakpoint in a GDB
1186 session the current lthread can be obtained by displaying the pthread local
1187 variable ``per_lcore_this_sched->current_lthread``.
1188
1189 Another useful diagnostic feature is the possibility to trace significant
1190 events in the life of an L-thread, this feature is enabled by changing the
1191 value of LTHREAD_DIAG from 0 to 1 in the file ``lthread_diag_api.h``.
1192
1193 Tracing of events can be individually masked, and the mask may be programmed
1194 at run time. An unmasked event results in a callback that provides information
1195 about the event. The default callback simply prints trace information. The
1196 default mask is 0 (all events off) the mask can be modified by calling the
1197 function ``lthread_diagniostic_set_mask()``.
1198
1199 It is possible register a user callback function to implement more
1200 sophisticated diagnostic functions.
1201 Object creation events (lthread, mutex, and condition variable) accept, and
1202 store in the created object, a user supplied reference value returned by the
1203 callback function.
1204
1205 The lthread reference value is passed back in all subsequent event callbacks,
1206 the mutex and APIs are provided to retrieve the reference value from
1207 mutexes and condition variables. This enables a user to monitor, count, or
1208 filter for specific events, on specific objects, for example to monitor for a
1209 specific thread signaling a specific condition variable, or to monitor
1210 on all timer events, the possibilities and combinations are endless.
1211
1212 The callback function can be set by calling the function
1213 ``lthread_diagnostic_enable()`` supplying a callback function pointer and an
1214 event mask.
1215
1216 Setting ``LTHREAD_DIAG`` also enables counting of statistics about cache and
1217 queue usage, and these statistics can be displayed by calling the function
1218 ``lthread_diag_stats_display()``. This function also performs a consistency
1219 check on the caches and queues. The function should only be called from the
1220 master EAL thread after all slave threads have stopped and returned to the C
1221 main program, otherwise the consistency check will fail.