Message ID | 20181120203628.2367003-1-eblake@redhat.com |
---|---|
State | New |
Headers | show |
Series | misc: Avoid UTF-8 in error messages | expand |
On 11/20/18 3:36 PM, Eric Blake wrote: > While most developers are now using UTF-8 environments, it's > harder to guarantee that error messages will be output to > a multibyte locale. Rather than risking error messages that > get corrupted into mojibake when the user runs qemu in a > non-multibyte locale, let's stick to straight ASCII error > messages, rather than assuming that our use of UTF-8 in source > code string constants will work unchanged in other locales. > > Reported-by: Markus Armbruster <armbru@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > hw/misc/tmp105.c | 2 +- > hw/misc/tmp421.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c > index 0918f3a6ea2..f6d7163273a 100644 > --- a/hw/misc/tmp105.c > +++ b/hw/misc/tmp105.c > @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name, > return; > } > if (temp >= 128000 || temp < -128000) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c > index c234044305d..eeb11000f0f 100644 > --- a/hw/misc/tmp421.c > +++ b/hw/misc/tmp421.c > @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name, > } > > if (temp >= maxs[ext_range] || temp < mins[ext_range]) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > Do we have any policy in place to prohibit this in the future? (Presumably a policy that is automatic and won't interfere with QEMU localization efforts which may rightly attempt to use UTF-8 for those locales.) Do you have a script or trick to find utf-8 containing strings in our source? Only curious, don't hold this patch up on my account. I'm not raising a challenge. --js
[adding Markus in CC, since git didn't do it automatically from the 'Reported-by'] On 11/20/18 3:28 PM, John Snow wrote: > > > On 11/20/18 3:36 PM, Eric Blake wrote: >> While most developers are now using UTF-8 environments, it's >> harder to guarantee that error messages will be output to >> a multibyte locale. Rather than risking error messages that >> get corrupted into mojibake when the user runs qemu in a >> non-multibyte locale, let's stick to straight ASCII error >> messages, rather than assuming that our use of UTF-8 in source >> code string constants will work unchanged in other locales. >> >> Reported-by: Markus Armbruster <armbru@redhat.com> >> Signed-off-by: Eric Blake <eblake@redhat.com> >> --- >> hw/misc/tmp105.c | 2 +- >> hw/misc/tmp421.c | 2 +- >> 2 files changed, 2 insertions(+), 2 deletions(-) > > Do we have any policy in place to prohibit this in the future? > (Presumably a policy that is automatic and won't interfere with QEMU > localization efforts which may rightly attempt to use UTF-8 for those > locales.) Not that I know of. > > Do you have a script or trick to find utf-8 containing strings in our > source? Markus found these two, probably by reading over a list resulting from his claim of finding 217 out of 6455 files (53 of them binary, which don't count): https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html My quick and dirty attempt, which does not quite reproduce his numbers: $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc 279 279 7490 Thus, by forcing a unibyte locale (where encoding errors are impossible) with sane range expressions (POSIX says only the C locale is required to interpret regex ranges according to byte value - all bets are off in other locales) and using $'' to type non-UTF-8 bytes into my search, I found 279 files with at least one byte outside of ASCII. But the use of -l has no easy way to filter which of those files are binary; while dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to scroll through, especially considering there ARE binary files in the mix. Narrowing the search to a more specific pattern: $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc 129 685 8808 is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc (false positive hits, due to interesting? comments), in po/ (which doesn't count), or in scripts/ for python. And the proof for THIS patch: $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat origin:hw/misc/tmp105.c origin:hw/misc/tmp421.c > > Only curious, don't hold this patch up on my account. I'm not raising a > challenge. Maybe checkpatch.pl could be taught to do a similar check?
On 11/20/18 5:01 PM, Eric Blake wrote: > [adding Markus in CC, since git didn't do it automatically from the > 'Reported-by'] > > On 11/20/18 3:28 PM, John Snow wrote: >> >> >> On 11/20/18 3:36 PM, Eric Blake wrote: >>> While most developers are now using UTF-8 environments, it's >>> harder to guarantee that error messages will be output to >>> a multibyte locale. Rather than risking error messages that >>> get corrupted into mojibake when the user runs qemu in a >>> non-multibyte locale, let's stick to straight ASCII error >>> messages, rather than assuming that our use of UTF-8 in source >>> code string constants will work unchanged in other locales. >>> >>> Reported-by: Markus Armbruster <armbru@redhat.com> >>> Signed-off-by: Eric Blake <eblake@redhat.com> >>> --- >>> hw/misc/tmp105.c | 2 +- >>> hw/misc/tmp421.c | 2 +- >>> 2 files changed, 2 insertions(+), 2 deletions(-) > >> >> Do we have any policy in place to prohibit this in the future? >> (Presumably a policy that is automatic and won't interfere with QEMU >> localization efforts which may rightly attempt to use UTF-8 for those >> locales.) > > Not that I know of. > >> >> Do you have a script or trick to find utf-8 containing strings in our >> source? > > Markus found these two, probably by reading over a list resulting from > his claim of finding 217 out of 6455 files (53 of them binary, which > don't count): > https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html > > My quick and dirty attempt, which does not quite reproduce his numbers: > > $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc > 279 279 7490 > > Thus, by forcing a unibyte locale (where encoding errors are impossible) > with sane range expressions (POSIX says only the C locale is required to > interpret regex ranges according to byte value - all bets are off in > other locales) and using $'' to type non-UTF-8 bytes into my search, I > found 279 files with at least one byte outside of ASCII. But the use of > -l has no easy way to filter which of those files are binary; while > dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to > scroll through, especially considering there ARE binary files in the mix. > > Narrowing the search to a more specific pattern: > > $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc > 129 685 8808 > > is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc > (false positive hits, due to interesting? comments), in po/ (which > doesn't count), or in scripts/ for python. And the proof for THIS patch: > > $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat > origin:hw/misc/tmp105.c > origin:hw/misc/tmp421.c > Aha! Thanks. I was surprised we've managed to introduce only two cases of this so far. Your patch seems complete in this case. Since I took the time to read along, I may as well: Reviewed-by: John Snow <jsnow@redhat.com> >> >> Only curious, don't hold this patch up on my account. I'm not raising a >> challenge. > > Maybe checkpatch.pl could be taught to do a similar check? >
On 2018-11-20 21:36, Eric Blake wrote: > While most developers are now using UTF-8 environments, it's > harder to guarantee that error messages will be output to > a multibyte locale. Rather than risking error messages that > get corrupted into mojibake when the user runs qemu in a > non-multibyte locale, let's stick to straight ASCII error > messages, rather than assuming that our use of UTF-8 in source > code string constants will work unchanged in other locales. > > Reported-by: Markus Armbruster <armbru@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > hw/misc/tmp105.c | 2 +- > hw/misc/tmp421.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c > index 0918f3a6ea2..f6d7163273a 100644 > --- a/hw/misc/tmp105.c > +++ b/hw/misc/tmp105.c > @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name, > return; > } > if (temp >= 128000 || temp < -128000) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c > index c234044305d..eeb11000f0f 100644 > --- a/hw/misc/tmp421.c > +++ b/hw/misc/tmp421.c > @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name, > } > > if (temp >= maxs[ext_range] || temp < mins[ext_range]) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > Reviewed-by: Thomas Huth <thuth@redhat.com>
On 20/11/18 23:01, Eric Blake wrote: > [adding Markus in CC, since git didn't do it automatically from the > 'Reported-by'] > > On 11/20/18 3:28 PM, John Snow wrote: >> >> >> On 11/20/18 3:36 PM, Eric Blake wrote: >>> While most developers are now using UTF-8 environments, it's >>> harder to guarantee that error messages will be output to >>> a multibyte locale. Rather than risking error messages that >>> get corrupted into mojibake when the user runs qemu in a >>> non-multibyte locale, let's stick to straight ASCII error >>> messages, rather than assuming that our use of UTF-8 in source >>> code string constants will work unchanged in other locales. >>> >>> Reported-by: Markus Armbruster <armbru@redhat.com> >>> Signed-off-by: Eric Blake <eblake@redhat.com> >>> --- >>> hw/misc/tmp105.c | 2 +- >>> hw/misc/tmp421.c | 2 +- >>> 2 files changed, 2 insertions(+), 2 deletions(-) > >> >> Do we have any policy in place to prohibit this in the future? >> (Presumably a policy that is automatic and won't interfere with QEMU >> localization efforts which may rightly attempt to use UTF-8 for those >> locales.) > > Not that I know of. > >> >> Do you have a script or trick to find utf-8 containing strings in our >> source? > > Markus found these two, probably by reading over a list resulting from > his claim of finding 217 out of 6455 files (53 of them binary, which > don't count): > https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg04017.html > > My quick and dirty attempt, which does not quite reproduce his numbers: > > $ LC_ALL=C git grep -l $'[\x80-\xff]' | wc > 279 279 7490 > > Thus, by forcing a unibyte locale (where encoding errors are impossible) > with sane range expressions (POSIX says only the C locale is required to > interpret regex ranges according to byte value - all bets are off in > other locales) and using $'' to type non-UTF-8 bytes into my search, I > found 279 files with at least one byte outside of ASCII. But the use of > -l has no easy way to filter which of those files are binary; while > dropping -l claims 2138 "lines" with non-ASCII, which gets tedious to > scroll through, especially considering there ARE binary files in the mix. > > Narrowing the search to a more specific pattern: > > $ LC_ALL=C git grep $'".*[\x80-\xff].*"' | grep -v 'Binary file' | wc > 129 685 8808 > > is a bit more manageable, with MOST of the hits in pc-bios/qemu.rsrc > (false positive hits, due to interesting? comments), in po/ (which > doesn't count), or in scripts/ for python. And the proof for THIS patch: > > $ LC_ALL=C git grep -l $'".*[\x80-\xff].*"' origin -- '**/*.[ch]' | cat > origin:hw/misc/tmp105.c > origin:hw/misc/tmp421.c Can we add the last 3 lines in the commit message? > >> >> Only curious, don't hold this patch up on my account. I'm not raising a >> challenge. > > Maybe checkpatch.pl could be taught to do a similar check? It looks easier in shell than perl... We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl $@' or similar? Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Philippe Mathieu-Daudé <philmd@redhat.com> writes: > On 20/11/18 23:01, Eric Blake wrote: >> [adding Markus in CC, since git didn't do it automatically from the >> 'Reported-by'] >> >> On 11/20/18 3:28 PM, John Snow wrote: >>> >>> >>> On 11/20/18 3:36 PM, Eric Blake wrote: >>>> While most developers are now using UTF-8 environments, it's >>>> harder to guarantee that error messages will be output to >>>> a multibyte locale. Rather than risking error messages that >>>> get corrupted into mojibake when the user runs qemu in a >>>> non-multibyte locale, let's stick to straight ASCII error >>>> messages, rather than assuming that our use of UTF-8 in source >>>> code string constants will work unchanged in other locales. >>>> >>>> Reported-by: Markus Armbruster <armbru@redhat.com> >>>> Signed-off-by: Eric Blake <eblake@redhat.com> >>>> --- >>>> hw/misc/tmp105.c | 2 +- >>>> hw/misc/tmp421.c | 2 +- >>>> 2 files changed, 2 insertions(+), 2 deletions(-) >> >>> >>> Do we have any policy in place to prohibit this in the future? >>> (Presumably a policy that is automatic and won't interfere with QEMU >>> localization efforts which may rightly attempt to use UTF-8 for those >>> locales.) >> >> Not that I know of. We already outlaw newline and trailing punctuation, we could amend that to outlaw non-ASCII. $ git-grep 'no newline' include/qapi/error.h: * The resulting message should be a single phrase, with no newline or util/qemu-error.c: * a single phrase, with no newline or trailing punctuation. util/qemu-error.c: * a single phrase, with no newline or trailing punctuation. util/qemu-error.c: * a single phrase, with no newline or trailing punctuation. util/qemu-error.c: * a single phrase, with no newline or trailing punctuation. util/qemu-error.c: * a single phrase, with no newline or trailing punctuation. util/qemu-error.c: * single phrase, with no newline or trailing punctuation. util/qemu-error.c: * single phrase, with no newline or trailing punctuation. [...] >> Maybe checkpatch.pl could be taught to do a similar check? > > It looks easier in shell than perl... > > We could add a checkpatch.sh which finally call 'exec -l checkpatch.pl > $@' or similar? Congratulations, you found yet another way to make our checkpatch program less readable. Back to serious. checkpatch.pl already flags error messages containing newlines (search for "should not contain newlines"). Extending that to flag non-ASCII characters shouldn't be hard. > Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com>
On 20/11/2018 21:36, Eric Blake wrote: > While most developers are now using UTF-8 environments, it's > harder to guarantee that error messages will be output to > a multibyte locale. Rather than risking error messages that > get corrupted into mojibake when the user runs qemu in a > non-multibyte locale, let's stick to straight ASCII error > messages, rather than assuming that our use of UTF-8 in source > code string constants will work unchanged in other locales. > > Reported-by: Markus Armbruster <armbru@redhat.com> > Signed-off-by: Eric Blake <eblake@redhat.com> > --- > hw/misc/tmp105.c | 2 +- > hw/misc/tmp421.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c > index 0918f3a6ea2..f6d7163273a 100644 > --- a/hw/misc/tmp105.c > +++ b/hw/misc/tmp105.c > @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name, > return; > } > if (temp >= 128000 || temp < -128000) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c > index c234044305d..eeb11000f0f 100644 > --- a/hw/misc/tmp421.c > +++ b/hw/misc/tmp421.c > @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name, > } > > if (temp >= maxs[ext_range] || temp < mins[ext_range]) { > - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", > + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", > temp / 1000, temp % 1000); > return; > } > Applied to qemu-trivial with updated message log (command to find the non-ASCII characters). Thanks, Laurent
diff --git a/hw/misc/tmp105.c b/hw/misc/tmp105.c index 0918f3a6ea2..f6d7163273a 100644 --- a/hw/misc/tmp105.c +++ b/hw/misc/tmp105.c @@ -79,7 +79,7 @@ static void tmp105_set_temperature(Object *obj, Visitor *v, const char *name, return; } if (temp >= 128000 || temp < -128000) { - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", temp / 1000, temp % 1000); return; } diff --git a/hw/misc/tmp421.c b/hw/misc/tmp421.c index c234044305d..eeb11000f0f 100644 --- a/hw/misc/tmp421.c +++ b/hw/misc/tmp421.c @@ -153,7 +153,7 @@ static void tmp421_set_temperature(Object *obj, Visitor *v, const char *name, } if (temp >= maxs[ext_range] || temp < mins[ext_range]) { - error_setg(errp, "value %" PRId64 ".%03" PRIu64 " °C is out of range", + error_setg(errp, "value %" PRId64 ".%03" PRIu64 " C is out of range", temp / 1000, temp % 1000); return; }
While most developers are now using UTF-8 environments, it's harder to guarantee that error messages will be output to a multibyte locale. Rather than risking error messages that get corrupted into mojibake when the user runs qemu in a non-multibyte locale, let's stick to straight ASCII error messages, rather than assuming that our use of UTF-8 in source code string constants will work unchanged in other locales. Reported-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com> --- hw/misc/tmp105.c | 2 +- hw/misc/tmp421.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)