diff mbox series

[v4,07/13] qapi: force a UTF-8 locale for running Python

Message ID 20180115170243.24578-8-berrange@redhat.com
State New
Headers show
Series Support building with py2 or py3 | expand

Commit Message

Daniel P. Berrangé Jan. 15, 2018, 5:02 p.m. UTC
Python2 did not validate locale correctness when reading input data, so
would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so
if you try to read UTF-8 data in the C locale, it will raise an error
for any UTF-8 bytes that aren't representable in 7-bit ascii encoding.
e.g.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128)
Traceback (most recent call last):
  File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module>
    schema = QAPISchema(input_file)
  File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__
    parser = QAPISchemaParser(open(fname, 'r'))
  File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__
    previously_included)
  File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include
    exprs_include = QAPISchemaParser(fobj, previously_included, info)
  File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__
    self.src = fp.read()
  File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]

Many distros support a new C.UTF-8 locale that is like the C locale,
but with UTF-8 instead of 7-bit ASCII. That is not entirely portable
though, so this patch instead forces the en_US.UTF-8 locale, which
is pretty similar but more widely available.

We set LANG, rather than only LC_CTYPE, since generated source ought
to be independant of all of the user's locale settings.

This patch only forces UTF-8 for QAPI scripts, since that is the one
showing the immediate error under Python3 with C locale, but potentially
we ought to force this for all python scripts used in the build process.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
---
 Makefile | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

Comments

Eric Blake Jan. 15, 2018, 5:15 p.m. UTC | #1
On 01/15/2018 11:02 AM, Daniel P. Berrange wrote:
> Python2 did not validate locale correctness when reading input data, so
> would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so
> if you try to read UTF-8 data in the C locale, it will raise an error
> for any UTF-8 bytes that aren't representable in 7-bit ascii encoding.

Urgh, that sounds like a Python bug. The C locale is defined by POSIX to
be 8-bit clean (ie. a superset of ascii with 256 characters, not strict
ascii with only 128 characters and 128 bytes that form encoding errors).
 But that doesn't change the fact that we have to work around python's
braindead misinterpretation of reality.

> e.g.
> 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128)
> Traceback (most recent call last):
>   File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module>
>     schema = QAPISchema(input_file)
>   File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__
>     parser = QAPISchemaParser(open(fname, 'r'))
>   File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__
>     previously_included)
>   File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include
>     exprs_include = QAPISchemaParser(fobj, previously_included, info)
>   File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__
>     self.src = fp.read()
>   File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode
>     return codecs.ascii_decode(input, self.errors)[0]
> 
> Many distros support a new C.UTF-8 locale that is like the C locale,
> but with UTF-8 instead of 7-bit ASCII. That is not entirely portable
> though, so this patch instead forces the en_US.UTF-8 locale, which
> is pretty similar but more widely available.
> 
> We set LANG, rather than only LC_CTYPE, since generated source ought
> to be independant of all of the user's locale settings.

s/independant/independent/

LANG is the lowest-priority setting - if the user has explicitly set
LC_CTYPE or LC_ALL, their settings override what is in LANG.

> 
> This patch only forces UTF-8 for QAPI scripts, since that is the one
> showing the immediate error under Python3 with C locale, but potentially
> we ought to force this for all python scripts used in the build process.
> 
> Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> ---
>  Makefile | 22 ++++++++++++----------
>  1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index d86ecd2dd4..fde91cc42d 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),)
>  all:
>  include config-host.mak
>  
> +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON)

I'm worried that this is not reproducible in the face of a user that
explicitly sets different locale env-vars with higher priority than LANG.

> +
>  git-submodule-update:
>  
>  .PHONY: git-submodule-update
> @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py
>  
>  qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\
>  $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
> -	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
> +	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \

But once we agree on the right override to stuff into PYTHON_UTF8, the
rest of the patch converting invocations to PYTHON_UTF8 makes sense.
Daniel P. Berrangé Jan. 15, 2018, 5:28 p.m. UTC | #2
On Mon, Jan 15, 2018 at 11:15:01AM -0600, Eric Blake wrote:
> On 01/15/2018 11:02 AM, Daniel P. Berrange wrote:
> > Python2 did not validate locale correctness when reading input data, so
> > would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so
> > if you try to read UTF-8 data in the C locale, it will raise an error
> > for any UTF-8 bytes that aren't representable in 7-bit ascii encoding.
> 
> Urgh, that sounds like a Python bug. The C locale is defined by POSIX to
> be 8-bit clean (ie. a superset of ascii with 256 characters, not strict
> ascii with only 128 characters and 128 bytes that form encoding errors).
>  But that doesn't change the fact that we have to work around python's
> braindead misinterpretation of reality.

FYI there is some background on this behaviour here:

 https://www.python.org/dev/peps/pep-0538/

NB that doc says the new C-is-UTF-8 assumpion is for Python 3.7 or later,
but Fedora backported it to F27's  Python 3.6 :-)

The failure can be seen on Fedora with 3.0 -> 3.5 only. (BTW you can
install many Python 3.x versions concurrently on Fedora which is handy
for testing)

> > e.g.
> > 
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128)
> > Traceback (most recent call last):
> >   File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module>
> >     schema = QAPISchema(input_file)
> >   File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__
> >     parser = QAPISchemaParser(open(fname, 'r'))
> >   File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__
> >     previously_included)
> >   File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include
> >     exprs_include = QAPISchemaParser(fobj, previously_included, info)
> >   File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__
> >     self.src = fp.read()
> >   File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode
> >     return codecs.ascii_decode(input, self.errors)[0]
> > 
> > Many distros support a new C.UTF-8 locale that is like the C locale,
> > but with UTF-8 instead of 7-bit ASCII. That is not entirely portable
> > though, so this patch instead forces the en_US.UTF-8 locale, which
> > is pretty similar but more widely available.
> > 
> > We set LANG, rather than only LC_CTYPE, since generated source ought
> > to be independant of all of the user's locale settings.
> 
> s/independant/independent/
> 
> LANG is the lowest-priority setting - if the user has explicitly set
> LC_CTYPE or LC_ALL, their settings override what is in LANG.
> 
> > 
> > This patch only forces UTF-8 for QAPI scripts, since that is the one
> > showing the immediate error under Python3 with C locale, but potentially
> > we ought to force this for all python scripts used in the build process.
> > 
> > Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
> > ---
> >  Makefile | 22 ++++++++++++----------
> >  1 file changed, 12 insertions(+), 10 deletions(-)
> > 
> > diff --git a/Makefile b/Makefile
> > index d86ecd2dd4..fde91cc42d 100644
> > --- a/Makefile
> > +++ b/Makefile
> > @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),)
> >  all:
> >  include config-host.mak
> >  
> > +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON)
> 
> I'm worried that this is not reproducible in the face of a user that
> explicitly sets different locale env-vars with higher priority than LANG.

You might remember a similar issue affecting libvirt-glib/libosinfo when
glib-mkenums was rewritten to use Python instead of Perl. For that I ended
up doing

   LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 


> > +
> >  git-submodule-update:
> >  
> >  .PHONY: git-submodule-update
> > @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py
> >  
> >  qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\
> >  $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
> > -	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
> > +	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \
> 
> But once we agree on the right override to stuff into PYTHON_UTF8, the
> rest of the patch converting invocations to PYTHON_UTF8 makes sense.

Any thoughts on whether we should apply this more widely to our build
to make its output predictable regardless of user's locale ?

Regards,
Daniel
diff mbox series

Patch

diff --git a/Makefile b/Makefile
index d86ecd2dd4..fde91cc42d 100644
--- a/Makefile
+++ b/Makefile
@@ -17,6 +17,8 @@  ifneq ($(wildcard config-host.mak),)
 all:
 include config-host.mak
 
+PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON)
+
 git-submodule-update:
 
 .PHONY: git-submodule-update
@@ -471,17 +473,17 @@  qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py
 
 qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\
 $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \
 		$(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \
 		"GEN","$@")
 qga/qapi-generated/qga-qapi-visit.c qga/qapi-generated/qga-qapi-visit.h :\
 $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-visit.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-visit.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-visit.py \
 		$(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \
 		"GEN","$@")
 qga/qapi-generated/qga-qmp-commands.h qga/qapi-generated/qga-qmp-marshal.c :\
 $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-commands.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-commands.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-commands.py \
 		$(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \
 		"GEN","$@")
 
@@ -502,27 +504,27 @@  qapi-modules = $(SRC_PATH)/qapi-schema.json $(SRC_PATH)/qapi/common.json \
 
 qapi-types.c qapi-types.h :\
 $(qapi-modules) $(SRC_PATH)/scripts/qapi-types.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \
 		$(gen-out-type) -o "." -b $<, \
 		"GEN","$@")
 qapi-visit.c qapi-visit.h :\
 $(qapi-modules) $(SRC_PATH)/scripts/qapi-visit.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-visit.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-visit.py \
 		$(gen-out-type) -o "." -b $<, \
 		"GEN","$@")
 qapi-event.c qapi-event.h :\
 $(qapi-modules) $(SRC_PATH)/scripts/qapi-event.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-event.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-event.py \
 		$(gen-out-type) -o "." $<, \
 		"GEN","$@")
 qmp-commands.h qmp-marshal.c :\
 $(qapi-modules) $(SRC_PATH)/scripts/qapi-commands.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-commands.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-commands.py \
 		$(gen-out-type) -o "." $<, \
 		"GEN","$@")
 qmp-introspect.h qmp-introspect.c :\
 $(qapi-modules) $(SRC_PATH)/scripts/qapi-introspect.py $(qapi-py)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-introspect.py \
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-introspect.py \
 		$(gen-out-type) -o "." $<, \
 		"GEN","$@")
 
@@ -792,10 +794,10 @@  qemu-img-cmds.texi: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool
 docs/interop/qemu-qmp-qapi.texi docs/interop/qemu-ga-qapi.texi: $(SRC_PATH)/scripts/qapi2texi.py $(qapi-py)
 
 docs/interop/qemu-qmp-qapi.texi: $(qapi-modules)
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@")
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@")
 
 docs/interop/qemu-ga-qapi.texi: $(SRC_PATH)/qga/qapi-schema.json
-	$(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@")
+	$(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@")
 
 qemu.1: qemu-doc.texi qemu-options.texi qemu-monitor.texi qemu-monitor-info.texi
 qemu.1: qemu-option-trace.texi