overhaul mbsrtowcs
these changes fix at least two bugs:
- misaligned access to the input as uint32_t for vectorized ASCII test
- incorrect src pointer after stopping on EILSEQ
in addition, the text of the standard makes it unclear whether the
mbstate_t object is to be modified when the destination pointer is
null; previously it was cleared either way; now, it's only cleared
when the destination is non-null. this change may need revisiting, but
it should not affect most applications, since calling mbsrtowcs with
non-zero state can only happen when the head of the string was already
processed with mbrtowc.
finally, these changes shave about 20% size off the function and seem
to improve performance by 1-5%.