improve aarch64 atomics
aarch64 provides ll/sc variants with acquire/release memory order,
freeing us from the need to have full barriers both before and after
the ll/sc operation. previously they were not used because the a_cas
can fail without performing a_sc, in which case half of the barrier
would be omitted. instead, define a custom version of a_cas for
aarch64 which uses a_barrier explicitly when aborting the cas
operation. aside from cas, other operations built on top of ll/sc are
not affected since they never abort but rather loop until they
succeed.
a split ll/sc version of the pointer-sized a_cas_p is also introduced
using the same technique.
patch by Szabolcs Nagy.