diff mbox

Fallback to common charsets when charset is None or x-unknown

Message ID 20140704161322.GA31280@spoyarek.pnq.redhat.com
State Superseded
Headers show

Commit Message

Siddhesh Poyarekar July 4, 2014, 4:13 p.m. UTC
We recently encountered a case in our glibc patchwork instance on
sourceware, where a patch was dropped because it had x-unknown
charset.  I used the following patch to fix this in our instance.  The
fix I used was to fall back on a set of encodings (instead of just
utf-8) when the charset is not mentioned or if it is set as x-unknown.

v2 removes ascii as a fallback since it won't work anyway if utf-8
failed.

Signed-off-by: Siddhesh Poyarekar <siddhesh@redhat.com>
---
 apps/patchwork/bin/parsemail.py | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

Comments

Siddhesh Poyarekar July 11, 2014, 5:09 p.m. UTC | #1
Hi,

Ping!

Siddhesh

On Fri, Jul 04, 2014 at 09:43:23PM +0530, Siddhesh Poyarekar wrote:
> We recently encountered a case in our glibc patchwork instance on
> sourceware, where a patch was dropped because it had x-unknown
> charset.  I used the following patch to fix this in our instance.  The
> fix I used was to fall back on a set of encodings (instead of just
> utf-8) when the charset is not mentioned or if it is set as x-unknown.
> 
> v2 removes ascii as a fallback since it won't work anyway if utf-8
> failed.
> 
> Signed-off-by: Siddhesh Poyarekar <siddhesh@redhat.com>
> ---
>  apps/patchwork/bin/parsemail.py | 31 +++++++++++++++++++++++++------
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/apps/patchwork/bin/parsemail.py b/apps/patchwork/bin/parsemail.py
> index b6eb97a..7c173d9 100755
> --- a/apps/patchwork/bin/parsemail.py
> +++ b/apps/patchwork/bin/parsemail.py
> @@ -147,6 +147,13 @@ def find_pull_request(content):
>          return match.group(1)
>      return None
>  
> +def try_decode(payload, charset):
> +    try:
> +        payload = unicode(payload, charset)
> +    except UnicodeDecodeError:
> +        return None
> +    return payload
> +
>  def find_content(project, mail):
>      patchbuf = None
>      commentbuf = ''
> @@ -157,15 +164,27 @@ def find_content(project, mail):
>              continue
>  
>          payload = part.get_payload(decode=True)
> -        charset = part.get_content_charset()
>          subtype = part.get_content_subtype()
>  
> -        # if we don't have a charset, assume utf-8
> -        if charset is None:
> -            charset = 'utf-8'
> -
>          if not isinstance(payload, unicode):
> -            payload = unicode(payload, charset)
> +            charset = part.get_content_charset()
> +
> +            # If there is no charset or if it is unknown, then try some common
> +            # charsets before we fail.
> +            if charset is None or charset == 'x-unknown':
> +                try_charsets = ['utf-8', 'windows-1252', 'iso-8859-1']
> +            else:
> +                try_charsets = [charset]
> +
> +            for cset in try_charsets:
> +                decoded_payload = try_decode(payload, cset)
> +                if decoded_payload is not None:
> +                    break
> +            payload = decoded_payload
> +
> +            # Could not find a valid decoded payload.  Fail.
> +            if payload is None:
> +                return (None, None)
>  
>          if subtype in ['x-patch', 'x-diff']:
>              patchbuf = payload
> -- 
> 1.9.3
> 



> _______________________________________________
> Patchwork mailing list
> Patchwork@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork
Jeremy Kerr July 14, 2014, 2:09 a.m. UTC | #2
Hi Siddesh,

> We recently encountered a case in our glibc patchwork instance on
> sourceware, where a patch was dropped because it had x-unknown
> charset.  I used the following patch to fix this in our instance.  The
> fix I used was to fall back on a set of encodings (instead of just
> utf-8) when the charset is not mentioned or if it is set as x-unknown.

I tried this patch with a testcase from the failing patch you forwarded
me, and get the following:

ERROR: testPatch (patchwork.tests.test_patchparser.CharsetFallbackPatchTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/home/jk/devel/patchwork/apps/patchwork/tests/test_patchparser.py",
line 432, in testPatch
    (patch, comment) = find_content(self.project, self.mail)
  File "/home/jk/devel/patchwork/apps/patchwork/bin/parsemail.py", line
180, in find_content
    decoded_payload = try_decode(payload, cset)
  File "/home/jk/devel/patchwork/apps/patchwork/bin/parsemail.py", line
152, in try_decode
    payload = unicode(payload, charset)
LookupError: unknown encoding: none

In this case, it looks like the encoding is actually ascii, but the
"none" charset is causing trouble:

  Content-Type: text/plain; charset="none"
  Content-Transfer-Encoding: QUOTED-PRINTABLE

So the failure there is due to the "none", and not the charset of the
patch content. I've put together an updated change (will send it
shortly), could you let me know if it looks OK to you?

Regards,


Jeremy
diff mbox

Patch

diff --git a/apps/patchwork/bin/parsemail.py b/apps/patchwork/bin/parsemail.py
index b6eb97a..7c173d9 100755
--- a/apps/patchwork/bin/parsemail.py
+++ b/apps/patchwork/bin/parsemail.py
@@ -147,6 +147,13 @@  def find_pull_request(content):
         return match.group(1)
     return None
 
+def try_decode(payload, charset):
+    try:
+        payload = unicode(payload, charset)
+    except UnicodeDecodeError:
+        return None
+    return payload
+
 def find_content(project, mail):
     patchbuf = None
     commentbuf = ''
@@ -157,15 +164,27 @@  def find_content(project, mail):
             continue
 
         payload = part.get_payload(decode=True)
-        charset = part.get_content_charset()
         subtype = part.get_content_subtype()
 
-        # if we don't have a charset, assume utf-8
-        if charset is None:
-            charset = 'utf-8'
-
         if not isinstance(payload, unicode):
-            payload = unicode(payload, charset)
+            charset = part.get_content_charset()
+
+            # If there is no charset or if it is unknown, then try some common
+            # charsets before we fail.
+            if charset is None or charset == 'x-unknown':
+                try_charsets = ['utf-8', 'windows-1252', 'iso-8859-1']
+            else:
+                try_charsets = [charset]
+
+            for cset in try_charsets:
+                decoded_payload = try_decode(payload, cset)
+                if decoded_payload is not None:
+                    break
+            payload = decoded_payload
+
+            # Could not find a valid decoded payload.  Fail.
+            if payload is None:
+                return (None, None)
 
         if subtype in ['x-patch', 'x-diff']:
             patchbuf = payload