From: Rich Felker Date: Mon, 20 Nov 2017 21:25:54 +0000 (-0500) Subject: make fgetwc handling of encoding errors consistent with/without buffer X-Git-Tag: v1.1.19~43 X-Git-Url: https://git.librecmc.org/?a=commitdiff_plain;h=4000b0107ddd7fe733fa31d4f078c6fcd35851d6;p=oweals%2Fmusl.git make fgetwc handling of encoding errors consistent with/without buffer previously, fgetwc left all but the first byte of an illegal sequence unread (available for subsequent calls) when reading out of the FILE buffer, but dropped all bytes contibuting to the error when falling back to reading a byte at a time. neither behavior was ideal. in the buffered case, each malformed character produced one error per byte, rather than one per character. in the unbuffered case, consuming the last byte that caused the transition from "incomplete" to "invalid" state potentially dropped (and produced additional spurious encoding errors for) the next valid character. to handle both cases uniformly without duplicate code, revise the buffered case to only cover situations where a complete and valid character is present in the buffer, and fall back to byte-at-a-time for all other cases. this allows using mbtowc (stateless) instead of mbrtowc, which may slightly improve performance too. when an encoding error has been hit in the byte-at-a-time case, leave the final byte that produced the error unread (via ungetc) except in the case of single-byte errors (for UTF-8, bytes c0, c1, f5-ff, and continuation bytes with no lead byte). single-byte errors are fully consumed so as not to leave the caller in an infinite loop repeating the same error. none of these changes are distinguished from a conformance standpoint, since the file position is unspecified after encoding errors. they are intended merely as QoI/consistency improvements. --- diff --git a/src/stdio/fgetwc.c b/src/stdio/fgetwc.c index a00c1a86..07fb6d7c 100644 --- a/src/stdio/fgetwc.c +++ b/src/stdio/fgetwc.c @@ -5,36 +5,36 @@ static wint_t __fgetwc_unlocked_internal(FILE *f) { - mbstate_t st = { 0 }; wchar_t wc; int c; - unsigned char b; size_t l; /* Convert character from buffer if possible */ if (f->rpos < f->rend) { - l = mbrtowc(&wc, (void *)f->rpos, f->rend - f->rpos, &st); - if (l+2 >= 2) { + l = mbtowc(&wc, (void *)f->rpos, f->rend - f->rpos); + if (l+1 >= 1) { f->rpos += l + !l; /* l==0 means 1 byte, null */ return wc; } - if (l == -1) { - f->rpos++; - return WEOF; - } - f->rpos = f->rend; - } else l = -2; + } /* Convert character byte-by-byte */ - while (l == -2) { + mbstate_t st = { 0 }; + unsigned char b; + int first = 1; + do { b = c = getc_unlocked(f); if (c < 0) { - if (!mbsinit(&st)) errno = EILSEQ; + if (!first) errno = EILSEQ; return WEOF; } l = mbrtowc(&wc, (void *)&b, 1, &st); - if (l == -1) return WEOF; - } + if (l == -1) { + if (!first) ungetc(b, f); + return WEOF; + } + first = 0; + } while (l == -2); return wc; }