Totally remove the supposedly 'faster' variant in
BN_mod_mul_montgomery, which calls bn_sqr_recursive
without much preparation.
bn_sqr_recursive requires the length of its argument to be
a power of 2, which is not always the case here.
There's no reason for not using BN_sqr -- if a simpler
approach to squaring made sense, then why not change
BN_sqr? (Using BN_sqr should also speed up DH where g is chosen
such that it becomes small [e.g., 2] when converted
to Montgomery representation.)
Case closed :-)