diff mbox

pwclient: fix handling of UTF-8 char in submitter name

Message ID 1291736900-66294-1-git-send-email-andreas.devel@googlemail.com
State Accepted
Headers show

Commit Message

Andreas Bießmann Dec. 7, 2010, 3:48 p.m. UTC
This patch fixes following bug in 'list':

---8<---
# pwclient list -p uboot -w andreas.devel | grep New
Traceback (most recent call last):
  File "/Users/andreas/bin/pwclient", line 463, in <module>
    main()
  File "/Users/andreas/bin/pwclient", line 411, in main
    action_list(rpc, filt, submitter_str, delegate_str)
  File "/Users/andreas/bin/pwclient", line 182, in action_list
    (person['name'], person['email'])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in
position 32: ordinal not in range(128)
--->8---

Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com>
---
One note, I could reproduce this error only when output is piped.

 apps/patchwork/bin/pwclient |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

Comments

Jeremy Kerr Dec. 10, 2010, 4:31 a.m. UTC | #1
Hi Andreas,

> This patch fixes following bug in 'list':
> 
> ---8<---
> # pwclient list -p uboot -w andreas.devel | grep New
> Traceback (most recent call last):
>   File "/Users/andreas/bin/pwclient", line 463, in <module>
>     main()
>   File "/Users/andreas/bin/pwclient", line 411, in main
>     action_list(rpc, filt, submitter_str, delegate_str)
>   File "/Users/andreas/bin/pwclient", line 182, in action_list
>     (person['name'], person['email'])
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in
> position 32: ordinal not in range(128)
> --->8---
> 
> Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com>

Nice catch, applied.

Thanks for the patch!


Jeremy
Wolfgang Denk Dec. 10, 2010, 5:51 a.m. UTC | #2
Dear Jeremy Kerr,

In message <201012101231.51075.jk@ozlabs.org> you wrote:
> > This patch fixes following bug in 'list':
> > 
> > ---8<---
> > # pwclient list -p uboot -w andreas.devel | grep New
> > Traceback (most recent call last):
> >   File "/Users/andreas/bin/pwclient", line 463, in <module>
> >     main()
> >   File "/Users/andreas/bin/pwclient", line 411, in main
> >     action_list(rpc, filt, submitter_str, delegate_str)
> >   File "/Users/andreas/bin/pwclient", line 182, in action_list
> >     (person['name'], person['email'])
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in
> > position 32: ordinal not in range(128)
> > --->8---
> > 
> > Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com>
> 
> Nice catch, applied.

Hm...

When running this on the U-Boot list I now get sometimes things like
this:

Traceback (most recent call last):
  File "/home/wd/bin/pwparser", line 241, in <module>
    sys.exit(main(sys.argv))
  File "/home/wd/bin/pwparser", line 226, in main
    content = sys.stdin.read().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478: invalid continuation byte


Best regards,

Wolfgang Denk
Jeremy Kerr Dec. 10, 2010, 5:54 a.m. UTC | #3
Hi Wolfgang,

> When running this on the U-Boot list I now get sometimes things like
> this:
> 
> Traceback (most recent call last):
>   File "/home/wd/bin/pwparser", line 241, in <module>
>     sys.exit(main(sys.argv))
>   File "/home/wd/bin/pwparser", line 226, in main
>     content = sys.stdin.read().decode('utf-8')
>   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478:
> invalid continuation byte

Sounds like your input isn't valid utf-8? Would an option to select the 
encoding help?

Cheers,


Jeremy
Wolfgang Denk Dec. 10, 2010, 6:40 a.m. UTC | #4
Dear Jeremy Kerr,

In message <201012101354.32301.jk@ozlabs.org> you wrote:
> 
> > When running this on the U-Boot list I now get sometimes things like
> > this:
> > 
> > Traceback (most recent call last):
> >   File "/home/wd/bin/pwparser", line 241, in <module>
> >     sys.exit(main(sys.argv))
> >   File "/home/wd/bin/pwparser", line 226, in main
> >     content = sys.stdin.read().decode('utf-8')
> >   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> >     return codecs.utf_8_decode(input, errors, True)
> > UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478:
> > invalid continuation byte
> 
> Sounds like your input isn't valid utf-8? Would an option to select the 
> encoding help?

I think a number of messages have special characters in iso8859-1 and
iso8859-9.

A n option would not really help. I'm running into this when
auto-updating the status of some patches using a script similat to
what you just posted.

Best regards,

Wolfgang Denk
Jeremy Kerr Dec. 10, 2010, 7:07 a.m. UTC | #5
Hi Wolfgang,

> I think a number of messages have special characters in iso8859-1 and
> iso8859-9.
> 
> A n option would not really help. I'm running into this when
> auto-updating the status of some patches using a script similat to
> what you just posted.

OK, sounds like we need the parser to be able to take an mbox and read the 
encoding from the headers then.

Cheers,


Jeremy
Jeremy Kerr Dec. 10, 2010, 7:10 a.m. UTC | #6
Hi Wolfgang,

> > I think a number of messages have special characters in iso8859-1 and
> > iso8859-9.
> > 
> > A n option would not really help. I'm running into this when
> > auto-updating the status of some patches using a script similat to
> > what you just posted.
> 
> OK, sounds like we need the parser to be able to take an mbox and read the
> encoding from the headers then.

Hm, but these aren't mbox messages, right? (they'd be git commits).

I think in this case, we have to assume some encoding, as it isn't specified 
in any metadata. Autodetection is just going to cause pain.

Cheers,


Jeremy
Mauro Carvalho Chehab Dec. 10, 2010, 10:16 a.m. UTC | #7
Em 10-12-2010 05:10, Jeremy Kerr escreveu:
> Hi Wolfgang,
> 
>>> I think a number of messages have special characters in iso8859-1 and
>>> iso8859-9.
>>>
>>> A n option would not really help. I'm running into this when
>>> auto-updating the status of some patches using a script similat to
>>> what you just posted.
>>
>> OK, sounds like we need the parser to be able to take an mbox and read the
>> encoding from the headers then.
> 
> Hm, but these aren't mbox messages, right? (they'd be git commits).
> 
> I think in this case, we have to assume some encoding, as it isn't specified 
> in any metadata. Autodetection is just going to cause pain.

I never used it, nor I am a python expert, but it sems that django defines a class
of lazy utf decoders that won't cause python to crash due to a string that it is
not following the proper encoding:
	http://docs.djangoproject.com/en/dev/ref/unicode/

I had one interesting case of a patch with a driver from staging being changed/moved
to another place, with a string inside using a non-utf8. Patchwork simply discarded
this patch. I only noticed it because this were patch 6 of a sequence of patches,
so I went to the ML to double check what were missing.

Patchwork should be reliable enough to just import a patch, even if python dislikes
the charset.

Cheers,
Mauro
Jeremy Kerr Dec. 13, 2010, 12:58 a.m. UTC | #8
Hi Mauro,

> I never used it, nor I am a python expert, but it sems that django defines
> a class of lazy utf decoders that won't cause python to crash due to a
> string that it is not following the proper encoding:
> 	http://docs.djangoproject.com/en/dev/ref/unicode/
> 
> I had one interesting case of a patch with a driver from staging being
> changed/moved to another place, with a string inside using a non-utf8.
> Patchwork simply discarded this patch. I only noticed it because this were
> patch 6 of a sequence of patches, so I went to the ML to double check what
> were missing.

The parser (and pwclient) need to be fairly independent of django, as they're 
both intended to be run on machine with a fairly minimal python environment. 

However, the unicode decoder has a 'replace'-mode, where invalid byte 
sequences are replaced with U+FFFD REPLACEMENT CHARACTER:

  '\x80'.decode('utf-8', 'replace') = '\ufffd'

The reason that I don't do this currently is that patchwork would now be 
altering your patches to something that the author didn't write. If you were 
to apply the resulting patch, you would be introducing the U+FFFD character to 
your source tree.

However, dropping patches isn't a great solution either, so other alternatives 
welcome :)

Cheers,


Jeremy
Mauro Carvalho Chehab Dec. 13, 2010, 2:03 a.m. UTC | #9
Em 12-12-2010 22:58, Jeremy Kerr escreveu:
> Hi Mauro,
> 
>> I never used it, nor I am a python expert, but it sems that django defines
>> a class of lazy utf decoders that won't cause python to crash due to a
>> string that it is not following the proper encoding:
>> 	http://docs.djangoproject.com/en/dev/ref/unicode/
>>
>> I had one interesting case of a patch with a driver from staging being
>> changed/moved to another place, with a string inside using a non-utf8.
>> Patchwork simply discarded this patch. I only noticed it because this were
>> patch 6 of a sequence of patches, so I went to the ML to double check what
>> were missing.
> 
> The parser (and pwclient) need to be fairly independent of django, as they're 
> both intended to be run on machine with a fairly minimal python environment.

I don't see much problem for the parser, as it runs on a server, but I agree
that a lighter environment at the client side is interesting. Yet, it is better to
install some additional python packages locally than to loose patches.
 
> 
> However, the unicode decoder has a 'replace'-mode, where invalid byte 
> sequences are replaced with U+FFFD REPLACEMENT CHARACTER:
> 
>   '\x80'.decode('utf-8', 'replace') = '\ufffd'

Interesting.
 
> The reason that I don't do this currently is that patchwork would now be 
> altering your patches to something that the author didn't write. If you were 
> to apply the resulting patch, you would be introducing the U+FFFD character to 
> your source tree.
> 
> However, dropping patches isn't a great solution either, so other alternatives 
> welcome :)

Would it be possible to handle the error at decode with "try"? If so, maybe you could
add some logic there to try to decode first with the email charset. Then, try utf-8. 
If both fails, try to decode with some other protocols, like iso8859-11. This will
likely catch 99% of the issues. If everything fails, it is preferred to use the
replacement character than to loose the patch. 

I would also add a meta-tag to inticate the cases where patchwork is guessing a
type (or using a replacement character). This way, the maintainer may manually 
take care of the fixes.

Cheers,
Mauro
Jeremy Kerr Dec. 13, 2010, 7:17 a.m. UTC | #10
Hi Mauro,

> I don't see much problem for the parser, as it runs on a server, but I
> agree that a lighter environment at the client side is interesting. Yet,
> it is better to install some additional python packages locally than to
> loose patches.

The parser may be run on client machines currently, to generate patch hashes. 

I think the django utils are fairly thin wrappers around the standard unicode 
objects though, so we should be alright with what's in the vanilla python 
install.

> > The reason that I don't do this currently is that patchwork would now be
> > altering your patches to something that the author didn't write. If you
> > were to apply the resulting patch, you would be introducing the U+FFFD
> > character to your source tree.
> > 
> > However, dropping patches isn't a great solution either, so other
> > alternatives welcome :)
> 
> Would it be possible to handle the error at decode with "try"? If so, maybe
> you could add some logic there to try to decode first with the email
> charset. Then, try utf-8. If both fails, try to decode with some other
> protocols, like iso8859-11. This will likely catch 99% of the issues. If
> everything fails, it is preferred to use the replacement character than to
> loose the patch.
> 
> I would also add a meta-tag to inticate the cases where patchwork is
> guessing a type (or using a replacement character). This way, the
> maintainer may manually take care of the fixes.

That sounds pretty reasonable. For cases like these, I'd like to add 
'warnings' to the patch; either a 'had to guess the charset' or 'invalid 
encoding', depending on what we had to do to get a sucessful parse. The 
warnings would then appear in the web UI, or on stderr when running pwclient.

*adds to the TODO list*

Cheers,


Jeremy
Eduardo Habkost Dec. 13, 2010, 12:24 p.m. UTC | #11
On Mon, Dec 13, 2010 at 03:17:24PM +0800, Jeremy Kerr wrote:
> > > The reason that I don't do this currently is that patchwork would now be
> > > altering your patches to something that the author didn't write. If you
> > > were to apply the resulting patch, you would be introducing the U+FFFD
> > > character to your source tree.
> > > 
> > > However, dropping patches isn't a great solution either, so other
> > > alternatives welcome :)
> > 
> > Would it be possible to handle the error at decode with "try"? If so, maybe
> > you could add some logic there to try to decode first with the email
> > charset. Then, try utf-8. If both fails, try to decode with some other
> > protocols, like iso8859-11. This will likely catch 99% of the issues. If
> > everything fails, it is preferred to use the replacement character than to
> > loose the patch.
> > 
> > I would also add a meta-tag to inticate the cases where patchwork is
> > guessing a type (or using a replacement character). This way, the
> > maintainer may manually take care of the fixes.
> 
> That sounds pretty reasonable. For cases like these, I'd like to add 
> 'warnings' to the patch; either a 'had to guess the charset' or 'invalid 
> encoding', depending on what we had to do to get a sucessful parse. The 
> warnings would then appear in the web UI, or on stderr when running pwclient.
> 
> *adds to the TODO list*

Why not just handle and store the patch as an array of bytes (Python
'str' type) instead of a unicode string?

The restriction that every patch should be valid unicode makes it
impossible to patch existing source files that already have non-utf8
data inside them (I suppose this includes source trees where files are
encoded as iso-8859-1, as the unicode diff won't be encoded back to the
original encoding when exporting the patches from Patchwork, will it?).

This would require changing the database model and xmlrpc API to use
binary data (I hope Django support it) instead of a unicode string, but
it sounds better than piling up unicode encoding/decoding hacks.
Jeremy Kerr Dec. 13, 2010, 1:55 p.m. UTC | #12
Hi Eduardo,

> Why not just handle and store the patch as an array of bytes (Python
> 'str' type) instead of a unicode string?

Basically, because we need to process the patch itself; either extracting it 
from the email message, or for finding the hash. Both of these require looking 
into the content of the patch, which means we need to be able to decode it.

> The restriction that every patch should be valid unicode makes it
> impossible to patch existing source files that already have non-utf8
> data inside them (I suppose this includes source trees where files are
> encoded as iso-8859-1, as the unicode diff won't be encoded back to the
> original encoding when exporting the patches from Patchwork, will it?).
> 
> This would require changing the database model and xmlrpc API to use
> binary data (I hope Django support it)

no, django doesn't support it out of the box, I believe this is a django 
design decision.

> instead of a unicode string, but
> it sounds better than piling up unicode encoding/decoding hacks.

Cheers,


Jeremy
Eduardo Habkost Dec. 13, 2010, 2:34 p.m. UTC | #13
On Mon, Dec 13, 2010 at 09:55:18PM +0800, Jeremy Kerr wrote:
> Hi Eduardo,
> 
> > Why not just handle and store the patch as an array of bytes (Python
> > 'str' type) instead of a unicode string?
> 
> Basically, because we need to process the patch itself; either extracting it 
> from the email message, or for finding the hash. Both of these require looking 
> into the content of the patch, which means we need to be able to decode it.

I don't get it. You can process and look into a byte array as easily as
you can process a unicode string.

patch(1) operates at byte level, it doesn't care about unicode and
character encoding. It just get a description of byte-level changes to
source files. So we don't need to pretend that every diff is going to be
valid unicode.

I understand it is hard to change this on Patchwork today, though. It
would affect the database models and the xmlrpc interface.


> 
> > The restriction that every patch should be valid unicode makes it
> > impossible to patch existing source files that already have non-utf8
> > data inside them (I suppose this includes source trees where files are
> > encoded as iso-8859-1, as the unicode diff won't be encoded back to the
> > original encoding when exporting the patches from Patchwork, will it?).
> > 
> > This would require changing the database model and xmlrpc API to use
> > binary data (I hope Django support it)
> 
> no, django doesn't support it out of the box, I believe this is a django 
> design decision.

Ouch. I understand that discouraging storing binary data is a good
thing, but I didn't expect Django to simply not allow it.
Stefan Tauner May 26, 2011, 1:19 p.m. UTC | #14
hey there

i am topposting because this topic is very old and my problem is "only"
strongly related to this discussion.

i tried to "pwclient apply <id>" this patch:
http://patchwork.coreboot.org/patch/2997/

pwclient bails with:
> pwclient apply 2997
> Applying patch #2997 to current directory
> Description: ichspi: fix unused FREG detection
> Traceback (most recent call last):
>   File "/home/ameno/bin/pwclient", line 463, in <module>
>     main()
>   File "/home/ameno/bin/pwclient", line 446, in main
>     action_apply(rpc, patch_id)
>   File "/home/ameno/bin/pwclient", line 263, in action_apply
>     proc.communicate(s)
>   File "/usr/lib/python2.6/subprocess.py", line 680, in communicate
>     self.stdin.write(input)

obviously it does not like the '•' inside the mail... which is
even more unfortunate than the original problem described in this
thread because it is not even part of the patch itself.

changing line 263 by adding
'.encode("utf-8")'
resulting in
'        proc.communicate(s.encode("utf-8"))'
fixes this problem but probably with the side effects mentioned here.

appending the encode call to line 259 would probably "solve" the
problem for the patch name/subject btw.

i have not and will not follow the development of pwclient, but would
be happy to receive and replies via cc, thanks.

> Em 12-12-2010 22:58, Jeremy Kerr escreveu:
> > Hi Mauro,
> > 
> >> I never used it, nor I am a python expert, but it sems that django defines
> >> a class of lazy utf decoders that won't cause python to crash due to a
> >> string that it is not following the proper encoding:
> >> 	http://docs.djangoproject.com/en/dev/ref/unicode/
> >>
> >> I had one interesting case of a patch with a driver from staging being
> >> changed/moved to another place, with a string inside using a non-utf8.
> >> Patchwork simply discarded this patch. I only noticed it because this were
> >> patch 6 of a sequence of patches, so I went to the ML to double check what
> >> were missing.
> > 
> > The parser (and pwclient) need to be fairly independent of django, as they're 
> > both intended to be run on machine with a fairly minimal python environment.
> 
> I don't see much problem for the parser, as it runs on a server, but I agree
> that a lighter environment at the client side is interesting. Yet, it is better to
> install some additional python packages locally than to loose patches.
>  
> > 
> > However, the unicode decoder has a 'replace'-mode, where invalid byte 
> > sequences are replaced with U+FFFD REPLACEMENT CHARACTER:
> > 
> >   '\x80'.decode('utf-8', 'replace') = '\ufffd'
> 
> Interesting.
>  
> > The reason that I don't do this currently is that patchwork would now be 
> > altering your patches to something that the author didn't write. If you were 
> > to apply the resulting patch, you would be introducing the U+FFFD character to 
> > your source tree.
> > 
> > However, dropping patches isn't a great solution either, so other alternatives 
> > welcome :)
> 
> Would it be possible to handle the error at decode with "try"? If so, maybe you could
> add some logic there to try to decode first with the email charset. Then, try utf-8. 
> If both fails, try to decode with some other protocols, like iso8859-11. This will
> likely catch 99% of the issues. If everything fails, it is preferred to use the
> replacement character than to loose the patch. 
> 
> I would also add a meta-tag to inticate the cases where patchwork is guessing a
> type (or using a replacement character). This way, the maintainer may manually 
> take care of the fixes.
> 
> Cheers,
> Mauro
diff mbox

Patch

diff --git a/apps/patchwork/bin/pwclient b/apps/patchwork/bin/pwclient
index dc836e9..dba68fb 100755
--- a/apps/patchwork/bin/pwclient
+++ b/apps/patchwork/bin/pwclient
@@ -179,7 +179,8 @@  def action_list(rpc, filter, submitter_str, delegate_str):
             for id in ids:
                 person = rpc.person_get(id)
                 print "Patches submitted by %s <%s>:" % \
-                        (person['name'], person['email'])
+                        (unicode(person['name']).encode("utf-8"), \
+                         unicode(person['email']).encode("utf-8"))
                 f = filter
                 f.add("submitter_id", id)
                 patches = rpc.patch_list(f.d)