Message ID | 1291736900-66294-1-git-send-email-andreas.devel@googlemail.com |
---|---|
State | Accepted |
Headers | show |
Hi Andreas, > This patch fixes following bug in 'list': > > ---8<--- > # pwclient list -p uboot -w andreas.devel | grep New > Traceback (most recent call last): > File "/Users/andreas/bin/pwclient", line 463, in <module> > main() > File "/Users/andreas/bin/pwclient", line 411, in main > action_list(rpc, filt, submitter_str, delegate_str) > File "/Users/andreas/bin/pwclient", line 182, in action_list > (person['name'], person['email']) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in > position 32: ordinal not in range(128) > --->8--- > > Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com> Nice catch, applied. Thanks for the patch! Jeremy
Dear Jeremy Kerr, In message <201012101231.51075.jk@ozlabs.org> you wrote: > > This patch fixes following bug in 'list': > > > > ---8<--- > > # pwclient list -p uboot -w andreas.devel | grep New > > Traceback (most recent call last): > > File "/Users/andreas/bin/pwclient", line 463, in <module> > > main() > > File "/Users/andreas/bin/pwclient", line 411, in main > > action_list(rpc, filt, submitter_str, delegate_str) > > File "/Users/andreas/bin/pwclient", line 182, in action_list > > (person['name'], person['email']) > > UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in > > position 32: ordinal not in range(128) > > --->8--- > > > > Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com> > > Nice catch, applied. Hm... When running this on the U-Boot list I now get sometimes things like this: Traceback (most recent call last): File "/home/wd/bin/pwparser", line 241, in <module> sys.exit(main(sys.argv)) File "/home/wd/bin/pwparser", line 226, in main content = sys.stdin.read().decode('utf-8') File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478: invalid continuation byte Best regards, Wolfgang Denk
Hi Wolfgang, > When running this on the U-Boot list I now get sometimes things like > this: > > Traceback (most recent call last): > File "/home/wd/bin/pwparser", line 241, in <module> > sys.exit(main(sys.argv)) > File "/home/wd/bin/pwparser", line 226, in main > content = sys.stdin.read().decode('utf-8') > File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478: > invalid continuation byte Sounds like your input isn't valid utf-8? Would an option to select the encoding help? Cheers, Jeremy
Dear Jeremy Kerr, In message <201012101354.32301.jk@ozlabs.org> you wrote: > > > When running this on the U-Boot list I now get sometimes things like > > this: > > > > Traceback (most recent call last): > > File "/home/wd/bin/pwparser", line 241, in <module> > > sys.exit(main(sys.argv)) > > File "/home/wd/bin/pwparser", line 226, in main > > content = sys.stdin.read().decode('utf-8') > > File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode > > return codecs.utf_8_decode(input, errors, True) > > UnicodeDecodeError: 'utf8' codec can't decode byte 0xdf in position 478: > > invalid continuation byte > > Sounds like your input isn't valid utf-8? Would an option to select the > encoding help? I think a number of messages have special characters in iso8859-1 and iso8859-9. A n option would not really help. I'm running into this when auto-updating the status of some patches using a script similat to what you just posted. Best regards, Wolfgang Denk
Hi Wolfgang, > I think a number of messages have special characters in iso8859-1 and > iso8859-9. > > A n option would not really help. I'm running into this when > auto-updating the status of some patches using a script similat to > what you just posted. OK, sounds like we need the parser to be able to take an mbox and read the encoding from the headers then. Cheers, Jeremy
Hi Wolfgang, > > I think a number of messages have special characters in iso8859-1 and > > iso8859-9. > > > > A n option would not really help. I'm running into this when > > auto-updating the status of some patches using a script similat to > > what you just posted. > > OK, sounds like we need the parser to be able to take an mbox and read the > encoding from the headers then. Hm, but these aren't mbox messages, right? (they'd be git commits). I think in this case, we have to assume some encoding, as it isn't specified in any metadata. Autodetection is just going to cause pain. Cheers, Jeremy
Em 10-12-2010 05:10, Jeremy Kerr escreveu: > Hi Wolfgang, > >>> I think a number of messages have special characters in iso8859-1 and >>> iso8859-9. >>> >>> A n option would not really help. I'm running into this when >>> auto-updating the status of some patches using a script similat to >>> what you just posted. >> >> OK, sounds like we need the parser to be able to take an mbox and read the >> encoding from the headers then. > > Hm, but these aren't mbox messages, right? (they'd be git commits). > > I think in this case, we have to assume some encoding, as it isn't specified > in any metadata. Autodetection is just going to cause pain. I never used it, nor I am a python expert, but it sems that django defines a class of lazy utf decoders that won't cause python to crash due to a string that it is not following the proper encoding: http://docs.djangoproject.com/en/dev/ref/unicode/ I had one interesting case of a patch with a driver from staging being changed/moved to another place, with a string inside using a non-utf8. Patchwork simply discarded this patch. I only noticed it because this were patch 6 of a sequence of patches, so I went to the ML to double check what were missing. Patchwork should be reliable enough to just import a patch, even if python dislikes the charset. Cheers, Mauro
Hi Mauro, > I never used it, nor I am a python expert, but it sems that django defines > a class of lazy utf decoders that won't cause python to crash due to a > string that it is not following the proper encoding: > http://docs.djangoproject.com/en/dev/ref/unicode/ > > I had one interesting case of a patch with a driver from staging being > changed/moved to another place, with a string inside using a non-utf8. > Patchwork simply discarded this patch. I only noticed it because this were > patch 6 of a sequence of patches, so I went to the ML to double check what > were missing. The parser (and pwclient) need to be fairly independent of django, as they're both intended to be run on machine with a fairly minimal python environment. However, the unicode decoder has a 'replace'-mode, where invalid byte sequences are replaced with U+FFFD REPLACEMENT CHARACTER: '\x80'.decode('utf-8', 'replace') = '\ufffd' The reason that I don't do this currently is that patchwork would now be altering your patches to something that the author didn't write. If you were to apply the resulting patch, you would be introducing the U+FFFD character to your source tree. However, dropping patches isn't a great solution either, so other alternatives welcome :) Cheers, Jeremy
Em 12-12-2010 22:58, Jeremy Kerr escreveu: > Hi Mauro, > >> I never used it, nor I am a python expert, but it sems that django defines >> a class of lazy utf decoders that won't cause python to crash due to a >> string that it is not following the proper encoding: >> http://docs.djangoproject.com/en/dev/ref/unicode/ >> >> I had one interesting case of a patch with a driver from staging being >> changed/moved to another place, with a string inside using a non-utf8. >> Patchwork simply discarded this patch. I only noticed it because this were >> patch 6 of a sequence of patches, so I went to the ML to double check what >> were missing. > > The parser (and pwclient) need to be fairly independent of django, as they're > both intended to be run on machine with a fairly minimal python environment. I don't see much problem for the parser, as it runs on a server, but I agree that a lighter environment at the client side is interesting. Yet, it is better to install some additional python packages locally than to loose patches. > > However, the unicode decoder has a 'replace'-mode, where invalid byte > sequences are replaced with U+FFFD REPLACEMENT CHARACTER: > > '\x80'.decode('utf-8', 'replace') = '\ufffd' Interesting. > The reason that I don't do this currently is that patchwork would now be > altering your patches to something that the author didn't write. If you were > to apply the resulting patch, you would be introducing the U+FFFD character to > your source tree. > > However, dropping patches isn't a great solution either, so other alternatives > welcome :) Would it be possible to handle the error at decode with "try"? If so, maybe you could add some logic there to try to decode first with the email charset. Then, try utf-8. If both fails, try to decode with some other protocols, like iso8859-11. This will likely catch 99% of the issues. If everything fails, it is preferred to use the replacement character than to loose the patch. I would also add a meta-tag to inticate the cases where patchwork is guessing a type (or using a replacement character). This way, the maintainer may manually take care of the fixes. Cheers, Mauro
Hi Mauro, > I don't see much problem for the parser, as it runs on a server, but I > agree that a lighter environment at the client side is interesting. Yet, > it is better to install some additional python packages locally than to > loose patches. The parser may be run on client machines currently, to generate patch hashes. I think the django utils are fairly thin wrappers around the standard unicode objects though, so we should be alright with what's in the vanilla python install. > > The reason that I don't do this currently is that patchwork would now be > > altering your patches to something that the author didn't write. If you > > were to apply the resulting patch, you would be introducing the U+FFFD > > character to your source tree. > > > > However, dropping patches isn't a great solution either, so other > > alternatives welcome :) > > Would it be possible to handle the error at decode with "try"? If so, maybe > you could add some logic there to try to decode first with the email > charset. Then, try utf-8. If both fails, try to decode with some other > protocols, like iso8859-11. This will likely catch 99% of the issues. If > everything fails, it is preferred to use the replacement character than to > loose the patch. > > I would also add a meta-tag to inticate the cases where patchwork is > guessing a type (or using a replacement character). This way, the > maintainer may manually take care of the fixes. That sounds pretty reasonable. For cases like these, I'd like to add 'warnings' to the patch; either a 'had to guess the charset' or 'invalid encoding', depending on what we had to do to get a sucessful parse. The warnings would then appear in the web UI, or on stderr when running pwclient. *adds to the TODO list* Cheers, Jeremy
On Mon, Dec 13, 2010 at 03:17:24PM +0800, Jeremy Kerr wrote: > > > The reason that I don't do this currently is that patchwork would now be > > > altering your patches to something that the author didn't write. If you > > > were to apply the resulting patch, you would be introducing the U+FFFD > > > character to your source tree. > > > > > > However, dropping patches isn't a great solution either, so other > > > alternatives welcome :) > > > > Would it be possible to handle the error at decode with "try"? If so, maybe > > you could add some logic there to try to decode first with the email > > charset. Then, try utf-8. If both fails, try to decode with some other > > protocols, like iso8859-11. This will likely catch 99% of the issues. If > > everything fails, it is preferred to use the replacement character than to > > loose the patch. > > > > I would also add a meta-tag to inticate the cases where patchwork is > > guessing a type (or using a replacement character). This way, the > > maintainer may manually take care of the fixes. > > That sounds pretty reasonable. For cases like these, I'd like to add > 'warnings' to the patch; either a 'had to guess the charset' or 'invalid > encoding', depending on what we had to do to get a sucessful parse. The > warnings would then appear in the web UI, or on stderr when running pwclient. > > *adds to the TODO list* Why not just handle and store the patch as an array of bytes (Python 'str' type) instead of a unicode string? The restriction that every patch should be valid unicode makes it impossible to patch existing source files that already have non-utf8 data inside them (I suppose this includes source trees where files are encoded as iso-8859-1, as the unicode diff won't be encoded back to the original encoding when exporting the patches from Patchwork, will it?). This would require changing the database model and xmlrpc API to use binary data (I hope Django support it) instead of a unicode string, but it sounds better than piling up unicode encoding/decoding hacks.
Hi Eduardo, > Why not just handle and store the patch as an array of bytes (Python > 'str' type) instead of a unicode string? Basically, because we need to process the patch itself; either extracting it from the email message, or for finding the hash. Both of these require looking into the content of the patch, which means we need to be able to decode it. > The restriction that every patch should be valid unicode makes it > impossible to patch existing source files that already have non-utf8 > data inside them (I suppose this includes source trees where files are > encoded as iso-8859-1, as the unicode diff won't be encoded back to the > original encoding when exporting the patches from Patchwork, will it?). > > This would require changing the database model and xmlrpc API to use > binary data (I hope Django support it) no, django doesn't support it out of the box, I believe this is a django design decision. > instead of a unicode string, but > it sounds better than piling up unicode encoding/decoding hacks. Cheers, Jeremy
On Mon, Dec 13, 2010 at 09:55:18PM +0800, Jeremy Kerr wrote: > Hi Eduardo, > > > Why not just handle and store the patch as an array of bytes (Python > > 'str' type) instead of a unicode string? > > Basically, because we need to process the patch itself; either extracting it > from the email message, or for finding the hash. Both of these require looking > into the content of the patch, which means we need to be able to decode it. I don't get it. You can process and look into a byte array as easily as you can process a unicode string. patch(1) operates at byte level, it doesn't care about unicode and character encoding. It just get a description of byte-level changes to source files. So we don't need to pretend that every diff is going to be valid unicode. I understand it is hard to change this on Patchwork today, though. It would affect the database models and the xmlrpc interface. > > > The restriction that every patch should be valid unicode makes it > > impossible to patch existing source files that already have non-utf8 > > data inside them (I suppose this includes source trees where files are > > encoded as iso-8859-1, as the unicode diff won't be encoded back to the > > original encoding when exporting the patches from Patchwork, will it?). > > > > This would require changing the database model and xmlrpc API to use > > binary data (I hope Django support it) > > no, django doesn't support it out of the box, I believe this is a django > design decision. Ouch. I understand that discouraging storing binary data is a good thing, but I didn't expect Django to simply not allow it.
hey there i am topposting because this topic is very old and my problem is "only" strongly related to this discussion. i tried to "pwclient apply <id>" this patch: http://patchwork.coreboot.org/patch/2997/ pwclient bails with: > pwclient apply 2997 > Applying patch #2997 to current directory > Description: ichspi: fix unused FREG detection > Traceback (most recent call last): > File "/home/ameno/bin/pwclient", line 463, in <module> > main() > File "/home/ameno/bin/pwclient", line 446, in main > action_apply(rpc, patch_id) > File "/home/ameno/bin/pwclient", line 263, in action_apply > proc.communicate(s) > File "/usr/lib/python2.6/subprocess.py", line 680, in communicate > self.stdin.write(input) obviously it does not like the '•' inside the mail... which is even more unfortunate than the original problem described in this thread because it is not even part of the patch itself. changing line 263 by adding '.encode("utf-8")' resulting in ' proc.communicate(s.encode("utf-8"))' fixes this problem but probably with the side effects mentioned here. appending the encode call to line 259 would probably "solve" the problem for the patch name/subject btw. i have not and will not follow the development of pwclient, but would be happy to receive and replies via cc, thanks. > Em 12-12-2010 22:58, Jeremy Kerr escreveu: > > Hi Mauro, > > > >> I never used it, nor I am a python expert, but it sems that django defines > >> a class of lazy utf decoders that won't cause python to crash due to a > >> string that it is not following the proper encoding: > >> http://docs.djangoproject.com/en/dev/ref/unicode/ > >> > >> I had one interesting case of a patch with a driver from staging being > >> changed/moved to another place, with a string inside using a non-utf8. > >> Patchwork simply discarded this patch. I only noticed it because this were > >> patch 6 of a sequence of patches, so I went to the ML to double check what > >> were missing. > > > > The parser (and pwclient) need to be fairly independent of django, as they're > > both intended to be run on machine with a fairly minimal python environment. > > I don't see much problem for the parser, as it runs on a server, but I agree > that a lighter environment at the client side is interesting. Yet, it is better to > install some additional python packages locally than to loose patches. > > > > > However, the unicode decoder has a 'replace'-mode, where invalid byte > > sequences are replaced with U+FFFD REPLACEMENT CHARACTER: > > > > '\x80'.decode('utf-8', 'replace') = '\ufffd' > > Interesting. > > > The reason that I don't do this currently is that patchwork would now be > > altering your patches to something that the author didn't write. If you were > > to apply the resulting patch, you would be introducing the U+FFFD character to > > your source tree. > > > > However, dropping patches isn't a great solution either, so other alternatives > > welcome :) > > Would it be possible to handle the error at decode with "try"? If so, maybe you could > add some logic there to try to decode first with the email charset. Then, try utf-8. > If both fails, try to decode with some other protocols, like iso8859-11. This will > likely catch 99% of the issues. If everything fails, it is preferred to use the > replacement character than to loose the patch. > > I would also add a meta-tag to inticate the cases where patchwork is guessing a > type (or using a replacement character). This way, the maintainer may manually > take care of the fixes. > > Cheers, > Mauro
diff --git a/apps/patchwork/bin/pwclient b/apps/patchwork/bin/pwclient index dc836e9..dba68fb 100755 --- a/apps/patchwork/bin/pwclient +++ b/apps/patchwork/bin/pwclient @@ -179,7 +179,8 @@ def action_list(rpc, filter, submitter_str, delegate_str): for id in ids: person = rpc.person_get(id) print "Patches submitted by %s <%s>:" % \ - (person['name'], person['email']) + (unicode(person['name']).encode("utf-8"), \ + unicode(person['email']).encode("utf-8")) f = filter f.add("submitter_id", id) patches = rpc.patch_list(f.d)
This patch fixes following bug in 'list': ---8<--- # pwclient list -p uboot -w andreas.devel | grep New Traceback (most recent call last): File "/Users/andreas/bin/pwclient", line 463, in <module> main() File "/Users/andreas/bin/pwclient", line 411, in main action_list(rpc, filt, submitter_str, delegate_str) File "/Users/andreas/bin/pwclient", line 182, in action_list (person['name'], person['email']) UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 32: ordinal not in range(128) --->8--- Signed-off-by: Andreas Bießmann <andreas.devel@googlemail.com> --- One note, I could reproduce this error only when output is piped. apps/patchwork/bin/pwclient | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)