diff mbox

[1/4] unicode: New mod_utf8_codepoint()

Message ID 1363283360-26220-2-git-send-email-armbru@redhat.com
State New
Headers show

Commit Message

Markus Armbruster March 14, 2013, 5:49 p.m. UTC
Signed-off-by: Markus Armbruster <armbru@redhat.com>
---
 include/qemu-common.h |  3 ++
 util/Makefile.objs    |  1 +
 util/unicode.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 100 insertions(+)
 create mode 100644 util/unicode.c

Comments

Laszlo Ersek March 21, 2013, 7:37 p.m. UTC | #1
I've (re-)read the UTF-8 article in wikipedia now... comments below

On 03/14/13 18:49, Markus Armbruster wrote:

> diff --git a/util/unicode.c b/util/unicode.c
> new file mode 100644
> index 0000000..954bcf7

> +/**
> + * mod_utf8_codepoint:
> + * @s: string encoded in modified UTF-8
> + * @n: maximum number of bytes to read from @s, if less than 6
> + * @end: set to end of sequence on return
> + *
> + * Convert the modified UTF-8 sequence at the start of @s.  Modified
> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
> + * "\xC0\x80".
> + *
> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
> + * byte, the sequence is invalid, and @end is set to @s + 1
> + *
> + * Else, the first byte determines how many continuation bytes are
> + * expected.  If there are fewer, the sequence is invalid, and @end is
> + * set to @s + 1 + actual number of continuation bytes.  Else, the
> + * sequence is well-formed, and @end is set to @s + 1 + expected
> + * number of continuation bytes.

The wording also covers (number of cont. bytes == 0), OK.

... One point that wasn't clear to me until I read the code too is that
"there are fewer" covers both "out of bytes" and "byte available, but
it's not a cont. byte". Subtle :)

The way "*end" is set ensures progress.

> + *
> + * A well-formed sequence is valid unless it encodes a code point
> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
> + * noncharacters, a surrogate code point, or is overlong.  Except the
> + * overlong sequence "\xC0\x80" is valid.
> + *
> + * Conversion succeeds if and only if the sequence is valid.
> + *
> + * Returns: the Unicode code point on success, -1 on failure.
> + */
> +int mod_utf8_codepoint(const char *s, size_t n, char **end)

The type of "end" follows the strtol() prototype. I guess I'd prefer
"const char **end", and "unsigned char" all around actually. (The input
is binary garbage.) Anyway this is irrelevant. ("const char **end" would
be probably quite inconvenient for the caller,
<http://www.c-faq.com/ansi/constmismatch.html>.)

> +{
> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
> +    const unsigned char *p;
> +    unsigned byte, mask, len, i;
> +    int cp;
> +
> +    if (n == 0) {
> +        *end = (char *)s;
> +        return 0;
> +    }

This is a special case (we return the code point U+0000 after looking at
zero bytes); we can probably expect the caller to know about n==0.

> +
> +    p = (const unsigned char *)s;
> +    byte = *p++;

We have n>=1 bytes here, and pointing one past the last one (in case
n==1) is allowed.

> +    if (byte < 0x80) {
> +        cp = byte;              /* one byte sequence */

So, since this is modified UTF-8, the U+0000 code point is represented
with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part
of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C
and similar to zero-terminate the string, but I think that's outside the
scope of the wire format.)

If we find a '\0' here, we report it as code point U+0000 instead of
rejecting it. One could maybe argue that it's a violation of the
interface contract (namely, due to '\0' being at offset #0, the caller
should have passed in n==0), but I assume from the description that the
caller need not have any idea about the contents, knowing the size
should be enough.

IOW I'm not sure about the intended use of this function, but if the
caller is allowed to throw any binary data at it (with correctly
communicated size of course), then we can misreport U+0000 here. (*)

> +    } else if (byte >= 0xFE) {
> +        cp = -1;                /* impossible bytes 0xFE, 0xFF */

OK, binary 1111111x is as first byte misformed.

> +    } else if ((byte & 0x40) == 0) {
> +        cp = -1;                /* unexpected continuation byte */

We know here that byte >= 0x80, and continuation bytes look like
10xxxxxx, OK.

> +    } else {
> +        /* multi-byte sequence */

We have one of:

110xxxxx
1110xxxx
11110xxx
111110xx
1111110x

> +        len = 0;
> +        for (mask = 0x80; byte & mask; mask >>= 1) {
> +            len++;
> +        }
> +        assert(len > 1 && len < 7);

OK, [2..6].

(Maybe use g_assert()? :))

> +        cp = byte & (mask - 1);

OK, the only bit set in "mask" matches the leftmost clear bit in "byte".
(mask-1) selects the xxxx bits in "byte".

> +        for (i = 1; i < len; i++) {

Runs at least once, and as many times as we need continuation bytes. OK.

> +            byte = i < n ? *p : 0;

"p" is valid to evaluate, "*p" may not be, so we check first. OK.

> +            if ((byte & 0xC0) != 0x80) {
> +                cp = -1;        /* continuation byte missing */
> +                goto out;
> +            }

Right, if a byte is available, it must look 10xxxxxx (binary). If we're
out of bytes, we also take this branch. *end will be left one past the
actual cont. bytes.

> +            p++;
> +            cp <<= 6;
> +            cp |= byte & 0x3F;
> +        }

We consume the cont. byte: p++ is valid to evaluate, and we shift in the
low six bits from the cont byte into the codepoint.

The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits
at most, in addition to the initial cp assignment. The initial cp
assignment clusters the masked-in bits at the LSB end:

    mask == (0x80 >> len) - 1

hence the number of masked-in bits in the original cp assignment is
7-len (which difference falls into [1..5]). So the total number of bits
shifted into "cp" is

    7-len + (len-1)*6 == 1 + 5 * len

Since len <= 6, the above is <= 31, which is perfect for our "int"
(int32_t in practice).

> +        if (cp > 0x10FFFF) {
> +            cp = -1;            /* beyond Unicode range */

OK.

> +        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
> +                   || (cp & 0xFFFE) == 0xFFFE) {
> +            cp = -1;            /* noncharacter */

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters

Interesting. OK.

> +        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
> +            cp = -1;            /* surrogate code point */

OK.

> +        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
> +            cp = -1;            /* overlong, not \xC0\x80 */
> +        }
> +    }
> +
> +out:
> +    *end = (char *)p;
> +    return cp;
> +}
> 

I think if we want to adhere to Modified UTF-8 strictly, then (*) should
be fixed. If not, I can give my Rb; this is nice code.

Laszlo
Markus Armbruster March 22, 2013, 9:23 a.m. UTC | #2
Laszlo Ersek <lersek@redhat.com> writes:

> I've (re-)read the UTF-8 article in wikipedia now... comments below
>
> On 03/14/13 18:49, Markus Armbruster wrote:
>
>> diff --git a/util/unicode.c b/util/unicode.c
>> new file mode 100644
>> index 0000000..954bcf7
>
>> +/**
>> + * mod_utf8_codepoint:
>> + * @s: string encoded in modified UTF-8
>> + * @n: maximum number of bytes to read from @s, if less than 6
>> + * @end: set to end of sequence on return
>> + *
>> + * Convert the modified UTF-8 sequence at the start of @s.  Modified
>> + * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
>> + * "\xC0\x80".
>> + *
>> + * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
>> + * byte, the sequence is invalid, and @end is set to @s + 1
>> + *
>> + * Else, the first byte determines how many continuation bytes are
>> + * expected.  If there are fewer, the sequence is invalid, and @end is
>> + * set to @s + 1 + actual number of continuation bytes.  Else, the
>> + * sequence is well-formed, and @end is set to @s + 1 + expected
>> + * number of continuation bytes.
>
> The wording also covers (number of cont. bytes == 0), OK.
>
> ... One point that wasn't clear to me until I read the code too is that
> "there are fewer" covers both "out of bytes" and "byte available, but
> it's not a cont. byte". Subtle :)

Guilty as charged :)

> The way "*end" is set ensures progress.

Yes.

>> + *
>> + * A well-formed sequence is valid unless it encodes a code point
>> + * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
>> + * noncharacters, a surrogate code point, or is overlong.  Except the
>> + * overlong sequence "\xC0\x80" is valid.
>> + *
>> + * Conversion succeeds if and only if the sequence is valid.
>> + *
>> + * Returns: the Unicode code point on success, -1 on failure.
>> + */
>> +int mod_utf8_codepoint(const char *s, size_t n, char **end)
>
> The type of "end" follows the strtol() prototype. I guess I'd prefer
> "const char **end", and "unsigned char" all around actually. (The input
> is binary garbage.) Anyway this is irrelevant. ("const char **end" would
> be probably quite inconvenient for the caller,
> <http://www.c-faq.com/ansi/constmismatch.html>.)

C's type system is far too inexpressive to do const properly.  Best we
could do within the constraints, I guess.  Results in tension between
"declare unchanging things const" (to help compilers as well as human
readers) and "avoid type casts" (because they defeat the type checker
and reduce legibility).

I prefer to err on the side of avoiding casts.  In particular, I try to
make my interfaces so that they can be used without casts as much as
possible.

Thus, I made the first argument const, but not the last.  I guess
similar thinking prevailed when the library part of C was constified;
see strtol() precedence you quoted.

>> +{
>> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
>> +    const unsigned char *p;
>> +    unsigned byte, mask, len, i;
>> +    int cp;
>> +
>> +    if (n == 0) {
>> +        *end = (char *)s;
>> +        return 0;
>> +    }
>
> This is a special case (we return the code point U+0000 after looking at
> zero bytes); we can probably expect the caller to know about n==0.

We could make it an error instead.  What's your gut feeling?

>> +
>> +    p = (const unsigned char *)s;
>> +    byte = *p++;
>
> We have n>=1 bytes here, and pointing one past the last one (in case
> n==1) is allowed.
>
>> +    if (byte < 0x80) {
>> +        cp = byte;              /* one byte sequence */
>
> So, since this is modified UTF-8, the U+0000 code point is represented
> with the overlong "\xC0\x80" sequence, and a bare '\0' can never be part
> of the Modified UTF-8 byte stream. (According to wp, '\0' is used in C
> and similar to zero-terminate the string, but I think that's outside the
> scope of the wire format.)
>
> If we find a '\0' here, we report it as code point U+0000 instead of
> rejecting it. One could maybe argue that it's a violation of the
> interface contract (namely, due to '\0' being at offset #0, the caller
> should have passed in n==0), but I assume from the description that the
> caller need not have any idea about the contents, knowing the size
> should be enough.
>
> IOW I'm not sure about the intended use of this function, but if the
> caller is allowed to throw any binary data at it (with correctly
> communicated size of course), then we can misreport U+0000 here. (*)

Good point.

'\0' should be treated as string terminator, exactly like reaching @n.

Let's check what my current code does:

    n == 0:
        set @end to @s, and return 0

    n > 0 && *s == 0:
        set @end to @s + 1, and return 0

Conclusion: you caught a bug.  Thanks, will fix!

>> +    } else if (byte >= 0xFE) {
>> +        cp = -1;                /* impossible bytes 0xFE, 0xFF */
>
> OK, binary 1111111x is as first byte misformed.

Correct.

>> +    } else if ((byte & 0x40) == 0) {
>> +        cp = -1;                /* unexpected continuation byte */
>
> We know here that byte >= 0x80, and continuation bytes look like
> 10xxxxxx, OK.

Correct.

>> +    } else {
>> +        /* multi-byte sequence */
>
> We have one of:
>
> 110xxxxx
> 1110xxxx
> 11110xxx
> 111110xx
> 1111110x

Correct.

>> +        len = 0;
>> +        for (mask = 0x80; byte & mask; mask >>= 1) {
>> +            len++;
>> +        }
>> +        assert(len > 1 && len < 7);
>
> OK, [2..6].
>
> (Maybe use g_assert()? :))

Now you're teasing me!

>> +        cp = byte & (mask - 1);
>
> OK, the only bit set in "mask" matches the leftmost clear bit in "byte".
> (mask-1) selects the xxxx bits in "byte".

Correct.

>> +        for (i = 1; i < len; i++) {
>
> Runs at least once, and as many times as we need continuation bytes. OK.

Correct.

>> +            byte = i < n ? *p : 0;
>
> "p" is valid to evaluate, "*p" may not be, so we check first. OK.
>
>> +            if ((byte & 0xC0) != 0x80) {
>> +                cp = -1;        /* continuation byte missing */
>> +                goto out;
>> +            }
>
> Right, if a byte is available, it must look 10xxxxxx (binary). If we're
> out of bytes, we also take this branch. *end will be left one past the
> actual cont. bytes.

Correct.

>> +            p++;
>> +            cp <<= 6;
>> +            cp |= byte & 0x3F;
>> +        }
>
> We consume the cont. byte: p++ is valid to evaluate, and we shift in the
> low six bits from the cont byte into the codepoint.
>
> The loop runs at most 5 times (for len==6), shifting in (len-1)*6 bits
> at most, in addition to the initial cp assignment. The initial cp
> assignment clusters the masked-in bits at the LSB end:
>
>     mask == (0x80 >> len) - 1
>
> hence the number of masked-in bits in the original cp assignment is
> 7-len (which difference falls into [1..5]). So the total number of bits
> shifted into "cp" is
>
>     7-len + (len-1)*6 == 1 + 5 * len
>
> Since len <= 6, the above is <= 31, which is perfect for our "int"
> (int32_t in practice).

Exactly!

>> +        if (cp > 0x10FFFF) {
>> +            cp = -1;            /* beyond Unicode range */
>
> OK.
>
>> +        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
>> +                   || (cp & 0xFFFE) == 0xFFFE) {
>> +            cp = -1;            /* noncharacter */
>
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Noncharacters
>
> Interesting. OK.
>
>> +        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
>> +            cp = -1;            /* surrogate code point */
>
> OK.
>
>> +        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
>> +            cp = -1;            /* overlong, not \xC0\x80 */
>> +        }
>> +    }
>> +
>> +out:
>> +    *end = (char *)p;
>> +    return cp;
>> +}
>> 
>
> I think if we want to adhere to Modified UTF-8 strictly, then (*) should
> be fixed. If not, I can give my Rb; this is nice code.

I think it needs fixing.

Thanks for the thorough review!
Laszlo Ersek March 22, 2013, 11:46 a.m. UTC | #3
On 03/22/13 10:23, Markus Armbruster wrote:
> Laszlo Ersek <lersek@redhat.com> writes:
>> On 03/14/13 18:49, Markus Armbruster wrote:

>>> +{
>>> +    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
>>> +    const unsigned char *p;
>>> +    unsigned byte, mask, len, i;
>>> +    int cp;
>>> +
>>> +    if (n == 0) {
>>> +        *end = (char *)s;
>>> +        return 0;
>>> +    }
>>
>> This is a special case (we return the code point U+0000 after looking at
>> zero bytes); we can probably expect the caller to know about n==0.
> 
> We could make it an error instead.  What's your gut feeling?

(If the question still stands -- maybe it doesn't any more, considering
future handling of '\0':) I guess this function would be called in a
loop, with increasing "s" and decreasing "n" values. "end" can only be
checked after the first call. If you write a loop that checks "end" in
the controlling expression, then accepting n==0 without error is useful.
If you write a loop that checks "n" in the controlling expression, then
refusing n==0 is OK. I'd probably write the latter kind of loop (I like
pre-testing more), but I can't say in general :)

Laszlo
diff mbox

Patch

diff --git a/include/qemu-common.h b/include/qemu-common.h
index 5e13708..28c10a0 100644
--- a/include/qemu-common.h
+++ b/include/qemu-common.h
@@ -442,4 +442,7 @@  int64_t pow2floor(int64_t value);
 int uleb128_encode_small(uint8_t *out, uint32_t n);
 int uleb128_decode_small(const uint8_t *in, uint32_t *n);
 
+/* unicode.c */
+int mod_utf8_codepoint(const char *s, size_t n, char **end);
+
 #endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index cad5ce8..65c8c12 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -9,3 +9,4 @@  util-obj-y += error.o qemu-error.o
 util-obj-$(CONFIG_POSIX) += compatfd.o
 util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o
 util-obj-y += qemu-option.o qemu-progress.o
+util-obj-y += unicode.o
diff --git a/util/unicode.c b/util/unicode.c
new file mode 100644
index 0000000..954bcf7
--- /dev/null
+++ b/util/unicode.c
@@ -0,0 +1,96 @@ 
+/*
+ * Dealing with Unicode
+ *
+ * Copyright (C) 2013 Red Hat, Inc.
+ *
+ * Authors:
+ *  Markus Armbruster <armbru@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or
+ * later.  See the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+
+/**
+ * mod_utf8_codepoint:
+ * @s: string encoded in modified UTF-8
+ * @n: maximum number of bytes to read from @s, if less than 6
+ * @end: set to end of sequence on return
+ *
+ * Convert the modified UTF-8 sequence at the start of @s.  Modified
+ * UTF-8 is exactly like UTF-8, except U+0000 is encoded as
+ * "\xC0\x80".
+ *
+ * If @s points to an impossible byte (0xFE or 0xFF) or a continuation
+ * byte, the sequence is invalid, and @end is set to @s + 1
+ *
+ * Else, the first byte determines how many continuation bytes are
+ * expected.  If there are fewer, the sequence is invalid, and @end is
+ * set to @s + 1 + actual number of continuation bytes.  Else, the
+ * sequence is well-formed, and @end is set to @s + 1 + expected
+ * number of continuation bytes.
+ *
+ * A well-formed sequence is valid unless it encodes a code point
+ * outside the Unicode range U+0000..U+10FFFF, one of Unicode's 66
+ * noncharacters, a surrogate code point, or is overlong.  Except the
+ * overlong sequence "\xC0\x80" is valid.
+ *
+ * Conversion succeeds if and only if the sequence is valid.
+ *
+ * Returns: the Unicode code point on success, -1 on failure.
+ */
+int mod_utf8_codepoint(const char *s, size_t n, char **end)
+{
+    static int min_cp[5] = { 0x80, 0x800, 0x10000, 0x200000, 0x4000000 };
+    const unsigned char *p;
+    unsigned byte, mask, len, i;
+    int cp;
+
+    if (n == 0) {
+        *end = (char *)s;
+        return 0;
+    }
+
+    p = (const unsigned char *)s;
+    byte = *p++;
+    if (byte < 0x80) {
+        cp = byte;              /* one byte sequence */
+    } else if (byte >= 0xFE) {
+        cp = -1;                /* impossible bytes 0xFE, 0xFF */
+    } else if ((byte & 0x40) == 0) {
+        cp = -1;                /* unexpected continuation byte */
+    } else {
+        /* multi-byte sequence */
+        len = 0;
+        for (mask = 0x80; byte & mask; mask >>= 1) {
+            len++;
+        }
+        assert(len > 1 && len < 7);
+        cp = byte & (mask - 1);
+        for (i = 1; i < len; i++) {
+            byte = i < n ? *p : 0;
+            if ((byte & 0xC0) != 0x80) {
+                cp = -1;        /* continuation byte missing */
+                goto out;
+            }
+            p++;
+            cp <<= 6;
+            cp |= byte & 0x3F;
+        }
+        if (cp > 0x10FFFF) {
+            cp = -1;            /* beyond Unicode range */
+        } else if ((cp >= 0xFDD0 && cp <= 0xFDEF)
+                   || (cp & 0xFFFE) == 0xFFFE) {
+            cp = -1;            /* noncharacter */
+        } else if (cp >= 0xD800 && cp <= 0xDFFF) {
+            cp = -1;            /* surrogate code point */
+        } else if (cp < min_cp[len - 2] && !(cp == 0 && len == 2)) {
+            cp = -1;            /* overlong, not \xC0\x80 */
+        }
+    }
+
+out:
+    *end = (char *)p;
+    return cp;
+}