Patchwork Decode patch from UTF-8 while parsing from stdin

login
register
mail settings
Submitter Dirk Wallenstein
Date Jan. 13, 2011, 8:49 a.m.
Message ID <20110113084942.GB12433@zap>
Download mbox | patch
Permalink /patch/78682/
State Superseded
Headers show

Comments

Dirk Wallenstein - Jan. 13, 2011, 8:49 a.m.
On Mon, Dec 13, 2010 at 08:37:25AM +0800, Jeremy Kerr wrote:
> Hi Paul,
> 
> > I think this patch needs to be reapplied, after the branch change(?) in
> > March 2010 [1].
> 
> Thanks, good catch, I've reapplied it.

This does now decode twice and raises an exception (on both paths).  I
don't see a reason for the second time, but attached is a patch that
makes it conditional.

Try it with this (the copyright sign will be decoded twice):
http://patchwork.freedesktop.org/patch/3648/
Jeremy Kerr - Feb. 10, 2011, 3:18 a.m.
Hi Paul & Martin,

> > > I think this patch needs to be reapplied, after the branch change(?) in
> > > March 2010 [1].
> > 
> > Thanks, good catch, I've reapplied it.
> 
> This does now decode twice and raises an exception (on both paths).  I
> don't see a reason for the second time, but attached is a patch that
> makes it conditional.

I'm also having trouble finding where parse_patch will be passed anything but 
unicode. There are two call sites for parse_patch:

1) From find_content(), but we'd explicitly checked for a unicode type here

2) From main() in parser.py, but we're using sys.stdin.read().decode('utf-8'), 
which should also give us a unicode type.

Martin, could you explain what this fix was originally for? If so, I'll write 
a testcase. However, it seems to no longer be necessary.

Paul - I'd rather ensure we're checking passing the correct arguments in, 
rather than conditionally doing the fixup.

Cheers,


Jeremy
Martin Krafft - Feb. 10, 2011, 5:33 a.m.
also sprach Jeremy Kerr <jk@ozlabs.org> [2011.02.10.0418 +0100]:
> I'm also having trouble finding where parse_patch will be passed
> anything but unicode.

Exactly; hence it should treat the text as Unicode and decode it
before manipulating it.

I do not remember what the actual problem was when I wrote this
patch, I am sorry.
Guilherme Salgado - Feb. 10, 2011, 12:02 p.m.
On Thu, 2011-02-10 at 06:33 +0100, martin f. krafft wrote:
> also sprach Jeremy Kerr <jk@ozlabs.org> [2011.02.10.0418 +0100]:
> > I'm also having trouble finding where parse_patch will be passed
> > anything but unicode.
> 
> Exactly; hence it should treat the text as Unicode and decode it
> before manipulating it.

If indeed parse_patch() will only be passed unicode objects, having it
call text.decode('utf8') is wrong because that will cause an implicit
encoding of text using the default codec (ascii), which will fail if
text has any non-ascii characters.
Martin Krafft - Feb. 10, 2011, 2:43 p.m.
also sprach Guilherme Salgado <guilherme.salgado@linaro.org> [2011.02.10.1302 +0100]:
> If indeed parse_patch() will only be passed unicode objects, having it
> call text.decode('utf8') is wrong because that will cause an implicit
> encoding of text using the default codec (ascii), which will fail if
> text has any non-ascii characters.

True, but I am sure there was another reason. I cannot remember it
though.

Has someone tried submitting a patch with Unicode? Maybe it does
work now…
Guilherme Salgado - Feb. 10, 2011, 3:55 p.m.
On Thu, 2011-02-10 at 15:43 +0100, martin f. krafft wrote:
> also sprach Guilherme Salgado <guilherme.salgado@linaro.org> [2011.02.10.1302 +0100]:
> > If indeed parse_patch() will only be passed unicode objects, having it
> > call text.decode('utf8') is wrong because that will cause an implicit
> > encoding of text using the default codec (ascii), which will fail if
> > text has any non-ascii characters.
> 
> True, but I am sure there was another reason. I cannot remember it
> though.
> 
> Has someone tried submitting a patch with Unicode? Maybe it does
> work now…

It's actually the other way around -- it works if we remove
the .decode('utf-8') from parse_patch().  With the decode() there the
tests that try parsing email addresses with non-ascii stuff (e.g.
patchwork.tests.patchparser.UTF8InlinePatchTest) fail but if we remove
the .decode('utf-8') they pass.

Patch

diff --git a/apps/patchwork/parser.py b/apps/patchwork/parser.py
index 24631b7..57b25c8 100644
--- a/apps/patchwork/parser.py
+++ b/apps/patchwork/parser.py
@@ -63,7 +63,10 @@  def parse_patch(text):
     lc = (0, 0)
     hunk = 0
 
-    for line in text.decode('utf-8').split('\n'):
+    if not isinstance(text, unicode):
+        text = unicode(text, 'utf-8')
+
+    for line in text.split('\n'):
         line += '\n'
 
         if state == 0: