Message ID | 20180115170243.24578-8-berrange@redhat.com |
---|---|
State | New |
Headers | show |
Series | Support building with py2 or py3 | expand |
On 01/15/2018 11:02 AM, Daniel P. Berrange wrote: > Python2 did not validate locale correctness when reading input data, so > would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so > if you try to read UTF-8 data in the C locale, it will raise an error > for any UTF-8 bytes that aren't representable in 7-bit ascii encoding. Urgh, that sounds like a Python bug. The C locale is defined by POSIX to be 8-bit clean (ie. a superset of ascii with 256 characters, not strict ascii with only 128 characters and 128 bytes that form encoding errors). But that doesn't change the fact that we have to work around python's braindead misinterpretation of reality. > e.g. > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128) > Traceback (most recent call last): > File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module> > schema = QAPISchema(input_file) > File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__ > parser = QAPISchemaParser(open(fname, 'r')) > File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__ > previously_included) > File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include > exprs_include = QAPISchemaParser(fobj, previously_included, info) > File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__ > self.src = fp.read() > File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode > return codecs.ascii_decode(input, self.errors)[0] > > Many distros support a new C.UTF-8 locale that is like the C locale, > but with UTF-8 instead of 7-bit ASCII. That is not entirely portable > though, so this patch instead forces the en_US.UTF-8 locale, which > is pretty similar but more widely available. > > We set LANG, rather than only LC_CTYPE, since generated source ought > to be independant of all of the user's locale settings. s/independant/independent/ LANG is the lowest-priority setting - if the user has explicitly set LC_CTYPE or LC_ALL, their settings override what is in LANG. > > This patch only forces UTF-8 for QAPI scripts, since that is the one > showing the immediate error under Python3 with C locale, but potentially > we ought to force this for all python scripts used in the build process. > > Signed-off-by: Daniel P. Berrange <berrange@redhat.com> > --- > Makefile | 22 ++++++++++++---------- > 1 file changed, 12 insertions(+), 10 deletions(-) > > diff --git a/Makefile b/Makefile > index d86ecd2dd4..fde91cc42d 100644 > --- a/Makefile > +++ b/Makefile > @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),) > all: > include config-host.mak > > +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON) I'm worried that this is not reproducible in the face of a user that explicitly sets different locale env-vars with higher priority than LANG. > + > git-submodule-update: > > .PHONY: git-submodule-update > @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py > > qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\ > $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py) > - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \ > + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \ But once we agree on the right override to stuff into PYTHON_UTF8, the rest of the patch converting invocations to PYTHON_UTF8 makes sense.
On Mon, Jan 15, 2018 at 11:15:01AM -0600, Eric Blake wrote: > On 01/15/2018 11:02 AM, Daniel P. Berrange wrote: > > Python2 did not validate locale correctness when reading input data, so > > would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so > > if you try to read UTF-8 data in the C locale, it will raise an error > > for any UTF-8 bytes that aren't representable in 7-bit ascii encoding. > > Urgh, that sounds like a Python bug. The C locale is defined by POSIX to > be 8-bit clean (ie. a superset of ascii with 256 characters, not strict > ascii with only 128 characters and 128 bytes that form encoding errors). > But that doesn't change the fact that we have to work around python's > braindead misinterpretation of reality. FYI there is some background on this behaviour here: https://www.python.org/dev/peps/pep-0538/ NB that doc says the new C-is-UTF-8 assumpion is for Python 3.7 or later, but Fedora backported it to F27's Python 3.6 :-) The failure can be seen on Fedora with 3.0 -> 3.5 only. (BTW you can install many Python 3.x versions concurrently on Fedora which is handy for testing) > > e.g. > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128) > > Traceback (most recent call last): > > File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module> > > schema = QAPISchema(input_file) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__ > > parser = QAPISchemaParser(open(fname, 'r')) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__ > > previously_included) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include > > exprs_include = QAPISchemaParser(fobj, previously_included, info) > > File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__ > > self.src = fp.read() > > File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode > > return codecs.ascii_decode(input, self.errors)[0] > > > > Many distros support a new C.UTF-8 locale that is like the C locale, > > but with UTF-8 instead of 7-bit ASCII. That is not entirely portable > > though, so this patch instead forces the en_US.UTF-8 locale, which > > is pretty similar but more widely available. > > > > We set LANG, rather than only LC_CTYPE, since generated source ought > > to be independant of all of the user's locale settings. > > s/independant/independent/ > > LANG is the lowest-priority setting - if the user has explicitly set > LC_CTYPE or LC_ALL, their settings override what is in LANG. > > > > > This patch only forces UTF-8 for QAPI scripts, since that is the one > > showing the immediate error under Python3 with C locale, but potentially > > we ought to force this for all python scripts used in the build process. > > > > Signed-off-by: Daniel P. Berrange <berrange@redhat.com> > > --- > > Makefile | 22 ++++++++++++---------- > > 1 file changed, 12 insertions(+), 10 deletions(-) > > > > diff --git a/Makefile b/Makefile > > index d86ecd2dd4..fde91cc42d 100644 > > --- a/Makefile > > +++ b/Makefile > > @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),) > > all: > > include config-host.mak > > > > +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON) > > I'm worried that this is not reproducible in the face of a user that > explicitly sets different locale env-vars with higher priority than LANG. You might remember a similar issue affecting libvirt-glib/libosinfo when glib-mkenums was rewritten to use Python instead of Perl. For that I ended up doing LC_ALL= LANG=C LC_CTYPE=en_US.UTF-8 > > + > > git-submodule-update: > > > > .PHONY: git-submodule-update > > @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py > > > > qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\ > > $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py) > > - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \ > > + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \ > > But once we agree on the right override to stuff into PYTHON_UTF8, the > rest of the patch converting invocations to PYTHON_UTF8 makes sense. Any thoughts on whether we should apply this more widely to our build to make its output predictable regardless of user's locale ? Regards, Daniel
diff --git a/Makefile b/Makefile index d86ecd2dd4..fde91cc42d 100644 --- a/Makefile +++ b/Makefile @@ -17,6 +17,8 @@ ifneq ($(wildcard config-host.mak),) all: include config-host.mak +PYTHON_UTF8 = LANG=en_US.UTF-8 $(PYTHON) + git-submodule-update: .PHONY: git-submodule-update @@ -471,17 +473,17 @@ qapi-py = $(SRC_PATH)/scripts/qapi.py $(SRC_PATH)/scripts/ordereddict.py qga/qapi-generated/qga-qapi-types.c qga/qapi-generated/qga-qapi-types.h :\ $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-types.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \ $(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \ "GEN","$@") qga/qapi-generated/qga-qapi-visit.c qga/qapi-generated/qga-qapi-visit.h :\ $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-visit.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-visit.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-visit.py \ $(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \ "GEN","$@") qga/qapi-generated/qga-qmp-commands.h qga/qapi-generated/qga-qmp-marshal.c :\ $(SRC_PATH)/qga/qapi-schema.json $(SRC_PATH)/scripts/qapi-commands.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-commands.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-commands.py \ $(gen-out-type) -o qga/qapi-generated -p "qga-" $<, \ "GEN","$@") @@ -502,27 +504,27 @@ qapi-modules = $(SRC_PATH)/qapi-schema.json $(SRC_PATH)/qapi/common.json \ qapi-types.c qapi-types.h :\ $(qapi-modules) $(SRC_PATH)/scripts/qapi-types.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-types.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-types.py \ $(gen-out-type) -o "." -b $<, \ "GEN","$@") qapi-visit.c qapi-visit.h :\ $(qapi-modules) $(SRC_PATH)/scripts/qapi-visit.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-visit.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-visit.py \ $(gen-out-type) -o "." -b $<, \ "GEN","$@") qapi-event.c qapi-event.h :\ $(qapi-modules) $(SRC_PATH)/scripts/qapi-event.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-event.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-event.py \ $(gen-out-type) -o "." $<, \ "GEN","$@") qmp-commands.h qmp-marshal.c :\ $(qapi-modules) $(SRC_PATH)/scripts/qapi-commands.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-commands.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-commands.py \ $(gen-out-type) -o "." $<, \ "GEN","$@") qmp-introspect.h qmp-introspect.c :\ $(qapi-modules) $(SRC_PATH)/scripts/qapi-introspect.py $(qapi-py) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi-introspect.py \ + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi-introspect.py \ $(gen-out-type) -o "." $<, \ "GEN","$@") @@ -792,10 +794,10 @@ qemu-img-cmds.texi: $(SRC_PATH)/qemu-img-cmds.hx $(SRC_PATH)/scripts/hxtool docs/interop/qemu-qmp-qapi.texi docs/interop/qemu-ga-qapi.texi: $(SRC_PATH)/scripts/qapi2texi.py $(qapi-py) docs/interop/qemu-qmp-qapi.texi: $(qapi-modules) - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@") + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@") docs/interop/qemu-ga-qapi.texi: $(SRC_PATH)/qga/qapi-schema.json - $(call quiet-command,$(PYTHON) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@") + $(call quiet-command,$(PYTHON_UTF8) $(SRC_PATH)/scripts/qapi2texi.py $< > $@,"GEN","$@") qemu.1: qemu-doc.texi qemu-options.texi qemu-monitor.texi qemu-monitor-info.texi qemu.1: qemu-option-trace.texi
Python2 did not validate locale correctness when reading input data, so would happily read UTF-8 data in non-UTF-8 locales. Python3 is strict so if you try to read UTF-8 data in the C locale, it will raise an error for any UTF-8 bytes that aren't representable in 7-bit ascii encoding. e.g. UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 54: ordinal not in range(128) Traceback (most recent call last): File "/tmp/qemu-test/src/scripts/qapi-commands.py", line 317, in <module> schema = QAPISchema(input_file) File "/tmp/qemu-test/src/scripts/qapi.py", line 1468, in __init__ parser = QAPISchemaParser(open(fname, 'r')) File "/tmp/qemu-test/src/scripts/qapi.py", line 301, in __init__ previously_included) File "/tmp/qemu-test/src/scripts/qapi.py", line 348, in _include exprs_include = QAPISchemaParser(fobj, previously_included, info) File "/tmp/qemu-test/src/scripts/qapi.py", line 271, in __init__ self.src = fp.read() File "/usr/lib64/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] Many distros support a new C.UTF-8 locale that is like the C locale, but with UTF-8 instead of 7-bit ASCII. That is not entirely portable though, so this patch instead forces the en_US.UTF-8 locale, which is pretty similar but more widely available. We set LANG, rather than only LC_CTYPE, since generated source ought to be independant of all of the user's locale settings. This patch only forces UTF-8 for QAPI scripts, since that is the one showing the immediate error under Python3 with C locale, but potentially we ought to force this for all python scripts used in the build process. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> --- Makefile | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-)