diff mbox series

[08/19] support/check-uniq-files: decode as many strings as possible

Message ID fcf8eccf9d275bd95c7e61099722c1a3b22a6da9.1546898693.git.yann.morin.1998@free.fr
State Changes Requested
Headers show
Series [01/19] infra/pkg-generic: display MESSAGE before running PRE_HOOKS | expand

Commit Message

Yann E. MORIN Jan. 7, 2019, 10:05 p.m. UTC
Currently, when there is at least one string we can't decode when
reporting the file and the packages that touched it, we fallback to not
decoding any string at all, which generates a report like:

    Warning: target file "b'/some/file'" is touched by more than one package: [b'toolchain', b'busybox']

This is not very nice, though, so we introduce a decoder that returns
the decoded string if possible, and falls back to returning the repr() of
the un-decoded string.

Also, using a set as argument to format() further yields a not-so-nice
output either (even if the decoding was OK):
    [u'toolchain', u'busybox']

So, we just join together all the elements of the set into a string,
which is what we pass to format().

Now the output is much nicer to look at:

    Warning: file "/some/file" is touched by more than one package: busybox, toolchain

and even in the case of an un-decodable string (with a manually tweaked
list, \xbd being œ in iso8859-15, and not a valid UTF-8 encoding):

    Warning: file "/some/file" is touched by more than one package: 'busyb\xbdx', toolchain

Signed-off-by: "Yann E. MORIN" <yann.morin.1998@free.fr>
Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
---
 support/scripts/check-uniq-files | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

Comments

Arnout Vandecappelle Feb. 7, 2019, 11:40 p.m. UTC | #1
On 07/01/2019 23:05, Yann E. MORIN wrote:
> +# If possible, try to decode the binary string s with the user's locale.
> +# If s contains characters that can't be decoded with that locale, return
> +# the representation (in the user's locale) of the un-decoded string.
> +def str_decode(s):
> +    try:
> +        return s.decode()
> +    except UnicodeDecodeError:
> +        return repr(s)

 I think s.decode(errors='replace') is exactly what we want: it prints the
question mark character for things that can't be represented, just like ls does.

 Regards,
 Arnout
Yann E. MORIN Feb. 8, 2019, 5:25 p.m. UTC | #2
Arnout, All,

On 2019-02-08 00:40 +0100, Arnout Vandecappelle spake thusly:
> On 07/01/2019 23:05, Yann E. MORIN wrote:
> > +# If possible, try to decode the binary string s with the user's locale.
> > +# If s contains characters that can't be decoded with that locale, return
> > +# the representation (in the user's locale) of the un-decoded string.
> > +def str_decode(s):
> > +    try:
> > +        return s.decode()
> > +    except UnicodeDecodeError:
> > +        return repr(s)
> 
>  I think s.decode(errors='replace') is exactly what we want: it prints the
> question mark character for things that can't be represented, just like ls does.

In the case I used as example, i.e. œ (LATIN SMALL LIGATURE OE) as encoded
in iso8859-15, i.e. \xbd (e.g. stored in a file named 'meh'), with python
2.7:

    >>> with open('meh', 'rb') as f:
    ...    lines = f.readlines()
    ...
    >>> lines
    ['\xbd\n']
    >>> lines[0].decode(errors='replace')
    u'\ufffd\n'
    >>> print('{}'.format(lines[0].decode(errors='replace')))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
    >>>

The output with python3 is indeed what you believe will happen, but I
don't think it is so nice:

    >>> lines
    [b'\xbd\n']
    >>> lines[0].decode(errors='replace')
    '�\n'
    >>> print('{}'.format(lines[0].decode(errors='replace')))
    >>> 

And anyway, check-uniq file should work with python 2.7, since it is part
of the build tools, and python 2.7 is what we require.

Regards,
Yann E. MORIN.
Arnout Vandecappelle Feb. 8, 2019, 8:42 p.m. UTC | #3
On 08/02/2019 18:25, Yann E. MORIN wrote:
> Arnout, All,
> 
> On 2019-02-08 00:40 +0100, Arnout Vandecappelle spake thusly:
>> On 07/01/2019 23:05, Yann E. MORIN wrote:
>>> +# If possible, try to decode the binary string s with the user's locale.
>>> +# If s contains characters that can't be decoded with that locale, return
>>> +# the representation (in the user's locale) of the un-decoded string.
>>> +def str_decode(s):
>>> +    try:
>>> +        return s.decode()
>>> +    except UnicodeDecodeError:
>>> +        return repr(s)
>>
>>  I think s.decode(errors='replace') is exactly what we want: it prints the
>> question mark character for things that can't be represented, just like ls does.
> 
> In the case I used as example, i.e. œ (LATIN SMALL LIGATURE OE) as encoded
> in iso8859-15, i.e. \xbd (e.g. stored in a file named 'meh'), with python
> 2.7:
> 
>     >>> with open('meh', 'rb') as f:
>     ...    lines = f.readlines()
>     ...
>     >>> lines
>     ['\xbd\n']
>     >>> lines[0].decode(errors='replace')
>     u'\ufffd\n'
>     >>> print('{}'.format(lines[0].decode(errors='replace')))
>     Traceback (most recent call last):
>       File "<stdin>", line 1, in <module>
>     UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)

 Meh, Python2 unicode handling always confuses the hell out of me...

 So, to do it well, in python3 you need to do:

print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(),errors='replace'))

while in python2 the proper thing to do is

print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(), \
	errors='replace').encode(sys.getfilesystemencoding(),errors='replace'))

(sys.getfilesystemencoding() makes sure we use the user's encoding so stuff that
can be printed gets properly printed).

 I couldn't find a way to do the right thing both in python2 and python3...

 Regards,
 Arnout

>     >>>
> 
> The output with python3 is indeed what you believe will happen, but I
> don't think it is so nice:
> 
>     >>> lines
>     [b'\xbd\n']
>     >>> lines[0].decode(errors='replace')
>     '�\n'
>     >>> print('{}'.format(lines[0].decode(errors='replace')))
>     �
> 
>     >>> 
> 
> And anyway, check-uniq file should work with python 2.7, since it is part
> of the build tools, and python 2.7 is what we require.
> 
> Regards,
> Yann E. MORIN.
>
Yann E. MORIN Feb. 8, 2019, 9:22 p.m. UTC | #4
Arnout, All,

On 2019-02-08 21:42 +0100, Arnout Vandecappelle spake thusly:
> On 08/02/2019 18:25, Yann E. MORIN wrote:
> > On 2019-02-08 00:40 +0100, Arnout Vandecappelle spake thusly:
> >> On 07/01/2019 23:05, Yann E. MORIN wrote:
> >>> +def str_decode(s):
> >>> +    try:
> >>> +        return s.decode()
> >>> +    except UnicodeDecodeError:
> >>> +        return repr(s)
> >>
> >>  I think s.decode(errors='replace') is exactly what we want: it prints the
> >> question mark character for things that can't be represented, just like ls does.
[--SNIP--]
> >     >>> lines[0].decode(errors='replace')
> >     u'\ufffd\n'
> >     >>> print('{}'.format(lines[0].decode(errors='replace')))
> >     Traceback (most recent call last):
> >       File "<stdin>", line 1, in <module>
> >     UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
> 
>  Meh, Python2 unicode handling always confuses the hell out of me...
> 
>  So, to do it well, in python3 you need to do:
> print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(),errors='replace'))
> 
> while in python2 the proper thing to do is
> 
> print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(), \
> 	errors='replace').encode(sys.getfilesystemencoding(),errors='replace'))
> 
> (sys.getfilesystemencoding() makes sure we use the user's encoding so stuff that
> can be printed gets properly printed).
> 
>  I couldn't find a way to do the right thing both in python2 and python3...

At which point, my proposal is much simpler, and more understandable,
don't you think?

Regards,
Yann E. MORIN.
Arnout Vandecappelle Feb. 8, 2019, 10:02 p.m. UTC | #5
On 08/02/2019 22:22, Yann E. MORIN wrote:
> Arnout, All,
> 
> On 2019-02-08 21:42 +0100, Arnout Vandecappelle spake thusly:
>> On 08/02/2019 18:25, Yann E. MORIN wrote:
>>> On 2019-02-08 00:40 +0100, Arnout Vandecappelle spake thusly:
>>>> On 07/01/2019 23:05, Yann E. MORIN wrote:
>>>>> +def str_decode(s):
>>>>> +    try:
>>>>> +        return s.decode()
>>>>> +    except UnicodeDecodeError:
>>>>> +        return repr(s)
>>>>
>>>>  I think s.decode(errors='replace') is exactly what we want: it prints the
>>>> question mark character for things that can't be represented, just like ls does.
> [--SNIP--]
>>>     >>> lines[0].decode(errors='replace')
>>>     u'\ufffd\n'
>>>     >>> print('{}'.format(lines[0].decode(errors='replace')))
>>>     Traceback (most recent call last):
>>>       File "<stdin>", line 1, in <module>
>>>     UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
>>
>>  Meh, Python2 unicode handling always confuses the hell out of me...
>>
>>  So, to do it well, in python3 you need to do:
>> print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(),errors='replace'))
>>
>> while in python2 the proper thing to do is
>>
>> print(b'\xc5\x93\xff'.decode(sys.getfilesystemencoding(), \
>> 	errors='replace').encode(sys.getfilesystemencoding(),errors='replace'))
>>
>> (sys.getfilesystemencoding() makes sure we use the user's encoding so stuff that
>> can be printed gets properly printed).
>>
>>  I couldn't find a way to do the right thing both in python2 and python3...
> 
> At which point, my proposal is much simpler, and more understandable,
> don't you think?

 Absolutely. Well, it's imperfect because it prints the ugly b'....' in case
there is an non-decodable character, but it's good enough.

 Regards,
 Arnout
diff mbox series

Patch

diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
index eb92724e42..e95a134168 100755
--- a/support/scripts/check-uniq-files
+++ b/support/scripts/check-uniq-files
@@ -7,6 +7,16 @@  from collections import defaultdict
 warn = 'Warning: {0} file "{1}" is touched by more than one package: {2}\n'
 
 
+# If possible, try to decode the binary string s with the user's locale.
+# If s contains characters that can't be decoded with that locale, return
+# the representation (in the user's locale) of the un-decoded string.
+def str_decode(s):
+    try:
+        return s.decode()
+    except UnicodeDecodeError:
+        return repr(s)
+
+
 def main():
     parser = argparse.ArgumentParser()
     parser.add_argument('packages_file_list', nargs='*',
@@ -32,16 +42,9 @@  def main():
 
     for file in file_to_pkg:
         if len(file_to_pkg[file]) > 1:
-            # If possible, try to decode the binary strings with
-            # the default user's locale
-            try:
-                sys.stderr.write(warn.format(args.type, file.decode(),
-                                             [p.decode() for p in file_to_pkg[file]]))
-            except UnicodeDecodeError:
-                # ... but fallback to just dumping them raw if they
-                # contain non-representable chars
-                sys.stderr.write(warn.format(args.type, file,
-                                             file_to_pkg[file]))
+            sys.stderr.write(warn.format(args.type, str_decode(file),
+                                         ", ".join([str_decode(p)
+                                                    for p in file_to_pkg[file]])))
 
 
 if __name__ == "__main__":