src/testbed/barriers.README.org

   1 * Description
   2 The testbed's barriers API facilitates coordination among the peers run by the
   3 testbed and the experiment driver.  The concept is similar to the barrier
   4 synchronisation mechanism found in parallel programming or multithreading
   5 paradigms - a peer waits at a barrier upon reaching it until the barrier is
   6 crossed i.e, the barrier is reached by a predefined number of peers.  This
   7 predefined number peers required to cross a barrier is also called quorum.  We
   8 say a peer has reached a barrier if the peer is waiting for the barrier to be
   9 crossed.  Similarly a barrier is said to be reached if the required quorum of
  10 peers reach the barrier.
  11
  12 The barriers API provides the following functions:
  13
  14 1) barrier_init():  function to initialse a barrier in the experiment
  15 2) barrier_cancel(): function to cancel a barrier which has been initialised
  16     before
  17 3) barrier_wait(): function to signal barrier service that the caller has reached
  18     a barrier and is waiting for it to be crossed
  19 4) barrier_wait_cancel(): function to stop waiting for a barrier to be crossed
  20
  21 Among the above functions, the first two, namely barrier_init() and
  22 barrier_cacel() are used by experiment drivers.  All barriers should be
  23 initialised by the experiment driver by calling barrier_init().  This function
  24 takes a name to identify the barrier, the quorum required for the barrier to be
  25 crossed and a notification callback for notifying the experiment driver when the
  26 barrier is crossed.  The function barrier_cancel() cancels an initialised
  27 barrier and frees the resources allocated for it.  This function can be called
  28 upon a initialised barrier before it is crossed.
  29
  30 The remaining two functions barrier_wait() and barrier_wait_cancel() are used in
  31 the peer's processes.  barrier_wait() connects to the local barrier service
  32 running on the same host the peer is running on and registers that the caller
  33 has reached the barrier and is waiting for the barrier to be crossed.  Note that
  34 this function can only be used by peers which are started by testbed as this
  35 function tries to access the local barrier service which is part of the testbed
  36 controller service.  Calling barrier_wait() on an uninitialised barrier barrier
  37 results in failure.  barrier_wait_cancel() cancels the notification registered
  38 by barrier_wait().
  39
  40
  41 * Implementation
  42 Since barriers involve coordination between experiment driver and peers, the
  43 barrier service in the testbed controller is split into two components.  The
  44 first component responds to the message generated by the barrier API used by the
  45 experiment driver (functions barrier_init() and barrier_cancel()) and the second
  46 component to the messages generated by barrier API used by peers (functions
  47 barrier_wait() and barrier_wait_cancel())
  48
  49 Calling barrier_init() sends a BARRIER_INIT message to the master controller.
  50 The master controller then registers a barrier and calls barrier_init() for each
  51 its subcontrollers.  In this way barrier initialisation is propagated to the
  52 controller hierarchy.  While propagating initialisation, any errors at a
  53 subcontroller such as timeout during further propagation are reported up the
  54 hierarchy back to the experiment driver.
  55
  56 Similar to barrier_init(), barrier_cancel() propagates BARRIER_CANCEL message
  57 which causes controllers to remove an initialised barrier.
  58
  59 The second component is implemented as a separate service in the binary
  60 `gnunet-service-testbed' which already has the testbed controller service.
  61 Although this deviates from the gnunet process architecture of having one
  62 service per binary, it is needed in this case as this component needs access to
  63 barrier data created by the first component.  This component responds to
  64 BARRIER_WAIT messages from local peers when they call barrier_wait().  Upon
  65 receiving BARRIER_WAIT message, the service checks if the requested barrier has
  66 been initialised before and if it was not initialised, an error status is sent
  67 through BARRIER_STATUS message to the local peer and the connection from the
  68 peer is terminated.  If the barrier is initialised before, the barrier's counter
  69 for reached peers is incremented and a notification is registered to notify the
  70 peer when the barrier is reached.  The connection from the peer is left open.
  71
  72 When enough peers required to attain the quorum send BARRIER_WAIT messages, the
  73 controller sends a BARRIER_STATUS message to its parent informing that the
  74 barrier is crossed.  If the controller has started further subcontrollers, it
  75 delays this message until it receives a notification from each of those
  76 subcontrollers that the barrier is crossed.  Finally, the barriers API at the
  77 experiment driver receives the BARRIER_STATUS when the barrier is reached at all
  78 the controllers.
  79
  80 The barriers API at the experiment driver responds to the BARRIER_STATUS message
  81 by echoing it back to the master controller and notifying the experiment
  82 controller through the notification callback that a barrier has been crossed.
  83 The echoed BARRIER_STATUS message is propagated by the master controller to the
  84 controller hierarchy.  This progation triggers the notifications registered by
  85 peers at each of the controllers in the hierarchy.  Note the difference between
  86 this downward propagation of the BARRIER_STATUS message from its upward
  87 propagation -- the upward propagation is needed for ensuring that the barrier is
  88 reached by all the controllers and the downward propagation is for triggering
  89 that the barrier is crossed.