RATIONALE

   1 This document is a summary of why we're moving to GNUnet NG and what
   2 this major redesign tries to address.
   3
   4 First of all, the redesign does not (intentionally) change anything
   5 fundamental about the application-level protocols or how files are
   6 encoded and shared.  However, it is not protocol-compatible due to
   7 other changes that do not relate to the essence of the application
   8 protocols.
   9
  10
  11 The redesign tries to address the following major problem groups
  12 describing isssues that apply more or less to all GNUnet versions
  13 prior to 0.9.x:
  14
  15
  16 PROBLEM GROUP 1 (scalability):
  17 * The code was modular, but bugs were not.  Memory corruption
  18   in one plugin could cause crashes in others and it was not
  19   always easy to identify the culprit.  This approach
  20   fundamentally does not scale (in the sense of GNUnet being
  21   a framework and a GNUnet server running hundreds of
  22   different application protocols -- and the result still
  23   being debuggable, secure and stable).
  24 * The code was heavily multi-threaded resulting in complex
  25   locking operations.  GNUnet 0.8.x had over 70 different
  26   mutexes and almost 1000 lines of lock/unlock operations.
  27   It is challenging for even good programmers to program or
  28   maintain good multi-threaded code with this complexity.
  29   The excessive locking essentially prevents GNUnet from
  30   actually doing much in parallel on multicores.
  31 * Despite efforts like Freeway, it was virtually
  32   impossible to contribute code to GNUnet that was not
  33   writen in C/C++.
  34 * Changes to the configuration almost always required restarts
  35   of gnunetd; the existence of change-notifications does not
  36   really change that (how many users are even aware of SIGHUP,
  37   and how few options worked with that -- and at what expense
  38   in code complexity!).
  39 * Valgrinding could only be done for the entire gnunetd
  40   process.  Given that gnunetd does quite a bit of
  41   CPU-intensive crypto, this could not be done for a system
  42   under heavy (or even moderate) load.
  43 * Stack overflows with threads, while rare under Linux these
  44   days, result in really nasty and hard-to-find crashes.
  45 * structs of function pointers in service APIs were
  46   needlessly adding complexity, especially since in
  47   most cases there was no polymorphism
  48
  49 SOLUTION:
  50 * Use multiple, lously-coupled processes and one big select
  51   loop in each (supported by a powerful library to eliminate
  52   code duplication for each process).
  53 * Eliminate all threads, manage the processes with a
  54   master-process (gnunet-arm, for automatic restart manager)
  55   which also ensures that configuration changes trigger the
  56   necessary restarts.
  57 * Use continuations (with timeouts) as a way to unify
  58   cron-jobs and other event-based code (such as waiting
  59   on network IO).
  60   => Using multiple processes ensures that memory corruption
  61      stays localized.
  62   => Using multiple processes will make it easy to contribute
  63      services written in other language(s).
  64   => Individual services can now be subjected to valgrind
  65   => Process priorities can be used to schedule the CPU better
  66   Note that we can not just use one process with a big
  67   select loop because we have blocking operations (and the
  68   blocking is outside of our control, thanks MySQL,
  69   sqlite, gethostbyaddr, etc.).  So in order to perform
  70   reasonably well, we need some construct for parallel
  71   execution.
  72
  73   RULE: If your service contains blocking functions, it
  74         MUST be a process by itself.
  75 * Eliminate structs with function pointers for service APIs;
  76   instead, provide a library (still ending in _service.h) API
  77   that transmits the requests nicely to the respective
  78   process (easier to use, no need to "request" service
  79   in the first place; API can cause process to be started/stopped
  80   via ARM if necessary).
  81
  82
  83 PROBLEM GROUP 2 (UTIL-APIs causing bugs):
  84 * The existing logging functions were awkward to use and
  85   their expressive power was never really used for much.
  86 * While we had some rules for naming functions, there
  87   were still plenty of inconsistencies.
  88 * Specification of default values in configuration could
  89   result in inconsistencies between defaults in
  90   config.scm and defaults used by the program; also,
  91   different defaults might have been specified for the
  92   same option in different parts of the program.
  93 * The TIME API did not distinguish between absolute
  94   and relative time, requiring users to know which
  95   type of value some variable contained and to
  96   manually convert properly.  Combined with the
  97   possibility of integer overflows this is a major
  98   source of bugs.
  99 * The TIME API for seconds has a theoretical problem
 100   with a 32-bit overflow on some platforms which is
 101   only partially fixed by the old code with some
 102   hackery.
 103
 104 SOLUTION:
 105 * Logging was radically simplified.
 106 * Functions are now more conistently named.
 107 * Configuration has no more defaults; instead,
 108   we load a global default configuration file
 109   before the user-specific configuration (which
 110   can be used to override defaults); the global
 111   default configuration file will be generated
 112   from config.scm.
 113 * Time now distinguishes between
 114   struct GNUNET_TIME_Absolute and
 115   struct GNUNET_TIME_Relative.  We use structs
 116   so that the compiler won't coerce for us
 117   (forcing the use of specific conversion
 118   functions which have checks for overflows, etc.).
 119   Naturally the need to use these functions makes
 120   the code a bit more verbose, but that's a good
 121   thing given the potential for bugs.
 122 * There is no more TIME API function to do anything
 123   with 32-bit seconds
 124
 125
 126 PROBLEM GROUP 3 (statistics):
 127 * Databases and others needed to store capacity values
 128   similar to what stats was already doing, but
 129   across process lifetimes ("state"-API was a partial
 130   solution for that, but using it was clunky)
 131 * Only gnunetd could use statistics, but other
 132   processes in the GNUnet system might have had
 133   good uses for it as well
 134
 135 SOLUTION:
 136 * New statistics library and service that offer
 137   an API to inspect and modify statistics
 138 * Statistics are distinguished by service name
 139   in addition to the name of the value
 140 * Statistics can be marked as persistent, in
 141   which case they are written to disk when
 142   the statistics service shuts down.
 143   => One solution for existing stats uses,
 144      application stats, database stats and
 145      versioning information!
 146
 147
 148 PROBLEM GROUP 4 (Testing):
 149 * The existing structure of the code with modules
 150   stored in places far away from the test code
 151   resulted in tools like lcov not giving good results.
 152 * The codebase had evolved into a complex, deeply
 153   nested hierarchy often with directories that
 154   then only contained a single file.  Some of these
 155   files had the same name making it hard to find
 156   the source corresponding to a crash based on
 157   the reported filename/line information.
 158 * Non-trivial portions of the code lacked good testcases,
 159   and it was not always obvious which parts of the code
 160   were not well-tested.
 161
 162 SOLUTION:
 163 * Code that should be tested together is now
 164   in the same directory.
 165 * The hierarchy is now essentially flat, each
 166   major service having on directory under src/;
 167   naming conventions help to make sure that
 168   files have globally-unique names
 169 * All code added to the new repository must
 170   come with testcases with reasonable coverage.
 171
 172
 173 PROBLEM GROUP 5 (core/transports):
 174 * The new DV service requires session key exchange
 175   between DV-neighbours, but the existing
 176   session key code can not be used to achieve this.
 177 * The core requires certain services
 178   (such as identity, pingpong, fragmentation,
 179    transport, traffic, session) which makes it
 180   meaningless to have these as modules
 181   (especially since there is really only one
 182   way to implement these)
 183 * HELLO's are larger than necessary since we need
 184   one for each transport (and hence often have
 185   to pick a subset of our HELLOs to transmit)
 186 * Fragmentation is done at the core level but only
 187   required for a few transports; future versions of
 188   these transports might want to be aware of fragments
 189   and do things like retransmission
 190 * Autoconfiguration is hard since we have no good
 191   way to detect (and then use securely) our external IP address
 192 * It is currently not possible for multiple transports
 193   between the same pair of peers to be used concurrently
 194   in the same direction(s)
 195 * We're using lots of cron-based jobs to periodically
 196   try (and fail) to build and transmit
 197
 198 SOLUTION:
 199 * Rewrite core to integrate most of these services
 200   into one "core" service.
 201 * Redesign HELLO to contain the addresses for
 202   all enabled transports in one message (avoiding
 203   having to transmit the public key and signature
 204   many, many times)
 205 * With discovery being part of the transport service,
 206   it is now also possible to "learn" our external
 207   IP address from other peers (we just add plausible
 208   addresses to the list; other peers will discard
 209   those addresses that don't work for them!)
 210 * New DV will consist of a "transport" and a
 211   high-level service (to handle encrypted DV
 212   control- and data-messages).
 213 * Move expiration from one field per HELLO to one
 214   per address
 215 * Require signature in PONG, not in HELLO (and confirm
 216   on address at a time)
 217 * Move fragmentation into helper library linked
 218   against by UDP (and others that might need it)
 219 * Link-to-link advertising of our HELLO is transport
 220   responsibility; global advertising/bootstrap remains
 221   responsibility of higher layers
 222 * Change APIs to be event-based (transports pull for
 223   transmission data instead of core pushing and failing)
 224
 225
 226 PROBLEM GROUP 6 (FS-APIs):
 227 * As with gnunetd, the FS-APIs are heavily threaded,
 228   resulting in hard-to-understand code (slightly
 229   better than gnunetd, but not much).
 230 * GTK in particular does not like this, resulting
 231   in complicated code to switch to the GTK event
 232   thread when needed (which may still be causing
 233   problems on Gnome, not sure).
 234 * If GUIs die (or are not properly shutdown), state
 235   of current transactions is lost (FSUI only
 236   saves to disk on shutdown)
 237
 238 SOLUTION (draft, not done yet, details missing...):
 239 * Eliminate threads from FS-APIs
 240   => Open question: how to best write the APIs to
 241      allow integration with diverse event loops
 242      of GUI libraries?
 243 * Store FS-state always also on disk
 244   => Open question: how to do this without
 245      compromising state/scalability?
 246
 247 PROBLEM GROUP 7 (User experience):
 248 * Searches often do not return a sufficient / significant number of
 249   results
 250 * Sharing a directory with thousands of similar files (image/jpeg)
 251   creates thousands of search results for the mime-type keyword
 252   (problem with DB performance, network transmission, caching,
 253    end-user display, etc.)
 254
 255 SOLUTION (draft, not done yet, details missing...):
 256 * Canonicalize keywords (see suggestion on mailinglist end of
 257   June 2009: keep consonants and sort those alphabetically);
 258   while I think we must have an option to disable this feature
 259   (for more private sharing), I do think it would make a reasonable
 260   default
 261 * When sharing directories, extract keywords first and then
 262   push keywords that are common in all files up to the
 263   directory level; when processing an AND-ed query and a directory
 264   is found to match the result, do an inspection on the metadata
 265   of the files in the directory to possibly produce further results
 266   (requires downloading of the directory in the background)
 267
 268