fix/improve shared library ctor/dtor handling, allow recursive dlopen
some libraries call dlopen from their constructors, resulting in
recursive calls to dlopen. previously, this resulted in deadlock. I'm
now unlocking the dlopen lock before running constructors (this is
especially important since the lock also blocked pthread_create and
was being held while application code runs!) and using a separate
recursive mutex protecting the ctor/dtor state instead.
in order to prevent the same ctor from being called more than once, a
module is considered "constructed" just before the ctor runs.
also, switch from using atexit to register each dtor to using a single
atexit call to register the dynamic linker's dtor processing as just
one handler. this is necessary because atexit performs allocation and
may fail, but the library has already been loaded and cannot be
backed-out at the time dtor registration is performed. this change
also ensures that all dtors run after all atexit functions, rather
than in mixed order.