bignum: allow concurrent BN_MONT_CTX_set_locked()
The lazy-initialisation of BN_MONT_CTX was serialising all threads, as
noted by Daniel Sands and co at Sandia. This was to handle the case that
2 or more threads race to lazy-init the same context, but stunted all
scalability in the case where 2 or more threads are doing unrelated
things! We favour the latter case by punishing the former. The init work
gets done by each thread that finds the context to be uninitialised, and
we then lock the "set" logic after that work is done - the winning
thread's work gets used, the losing threads throw away what they've done.
Signed-off-by: Geoff Thorpe <geoff@openssl.org>