further simplify and optimize new cond var
the main idea of the changes made is to have waiters wait directly on
the "barrier" lock that was used to prevent them from making forward
progress too early rather than first waiting on the atomic state value
and then attempting to lock the barrier.
in addition, adjustments to the mutex waiter count are optimized.
previously, each waking waiter decremented the count (unless it was
the first) then immediately incremented it again for the next waiter
(unless it was the last). this was a roundabout was of achieving the
equivalent of incrementing it once for the first waiter and
decrementing it once for the last.