Message ID | 20110113084942.GB12433@zap |
---|---|
State | Superseded |
Headers | show |
Hi Paul & Martin, > > > I think this patch needs to be reapplied, after the branch change(?) in > > > March 2010 [1]. > > > > Thanks, good catch, I've reapplied it. > > This does now decode twice and raises an exception (on both paths). I > don't see a reason for the second time, but attached is a patch that > makes it conditional. I'm also having trouble finding where parse_patch will be passed anything but unicode. There are two call sites for parse_patch: 1) From find_content(), but we'd explicitly checked for a unicode type here 2) From main() in parser.py, but we're using sys.stdin.read().decode('utf-8'), which should also give us a unicode type. Martin, could you explain what this fix was originally for? If so, I'll write a testcase. However, it seems to no longer be necessary. Paul - I'd rather ensure we're checking passing the correct arguments in, rather than conditionally doing the fixup. Cheers, Jeremy
also sprach Jeremy Kerr <jk@ozlabs.org> [2011.02.10.0418 +0100]: > I'm also having trouble finding where parse_patch will be passed > anything but unicode. Exactly; hence it should treat the text as Unicode and decode it before manipulating it. I do not remember what the actual problem was when I wrote this patch, I am sorry.
On Thu, 2011-02-10 at 06:33 +0100, martin f. krafft wrote: > also sprach Jeremy Kerr <jk@ozlabs.org> [2011.02.10.0418 +0100]: > > I'm also having trouble finding where parse_patch will be passed > > anything but unicode. > > Exactly; hence it should treat the text as Unicode and decode it > before manipulating it. If indeed parse_patch() will only be passed unicode objects, having it call text.decode('utf8') is wrong because that will cause an implicit encoding of text using the default codec (ascii), which will fail if text has any non-ascii characters.
also sprach Guilherme Salgado <guilherme.salgado@linaro.org> [2011.02.10.1302 +0100]: > If indeed parse_patch() will only be passed unicode objects, having it > call text.decode('utf8') is wrong because that will cause an implicit > encoding of text using the default codec (ascii), which will fail if > text has any non-ascii characters. True, but I am sure there was another reason. I cannot remember it though. Has someone tried submitting a patch with Unicode? Maybe it does work now…
On Thu, 2011-02-10 at 15:43 +0100, martin f. krafft wrote: > also sprach Guilherme Salgado <guilherme.salgado@linaro.org> [2011.02.10.1302 +0100]: > > If indeed parse_patch() will only be passed unicode objects, having it > > call text.decode('utf8') is wrong because that will cause an implicit > > encoding of text using the default codec (ascii), which will fail if > > text has any non-ascii characters. > > True, but I am sure there was another reason. I cannot remember it > though. > > Has someone tried submitting a patch with Unicode? Maybe it does > work now… It's actually the other way around -- it works if we remove the .decode('utf-8') from parse_patch(). With the decode() there the tests that try parsing email addresses with non-ascii stuff (e.g. patchwork.tests.patchparser.UTF8InlinePatchTest) fail but if we remove the .decode('utf-8') they pass.
diff --git a/apps/patchwork/parser.py b/apps/patchwork/parser.py index 24631b7..57b25c8 100644 --- a/apps/patchwork/parser.py +++ b/apps/patchwork/parser.py @@ -63,7 +63,10 @@ def parse_patch(text): lc = (0, 0) hunk = 0 - for line in text.decode('utf-8').split('\n'): + if not isinstance(text, unicode): + text = unicode(text, 'utf-8') + + for line in text.split('\n'): line += '\n' if state == 0: