diff mbox

Fallback to common charsets when charset is None or x-unknown

Message ID 20140611193445.GG10378@spoyarek.pnq.redhat.com
State Superseded
Headers show

Commit Message

Siddhesh Poyarekar June 11, 2014, 7:34 p.m. UTC
Trying again after signing up to the mailing list (patch is slightly
modified from my first submission, which may either be in moderation
or may have gotten lost somehow):

On Wed, Jun 11, 2014 at 04:09:16PM +0530, Siddhesh Poyarekar wrote:
> Hi,
> 
> We recently encountered a case in our glibc patchwork instance on
> sourceware, where a patch was dropped because it had x-unknown
> charset.  I used the following patch to fix this in our instance.  The
> fix I used was to fall back on a set of encodings (instead of just
> utf-8) when the charset is not mentioned or if it is set as x-unknown.
> 
> I hope this is useful.  I'd love to know if you all think there is a
> better way to fix this so that I can implement that in our instance
> instead of my hack.
> 
> Cheers,
> Siddhesh

Comments

Siddhesh Poyarekar June 30, 2014, 6:03 a.m. UTC | #1
Hi,

Ping!

On Thu, Jun 12, 2014 at 01:04:46AM +0530, Siddhesh Poyarekar wrote:
> Trying again after signing up to the mailing list (patch is slightly
> modified from my first submission, which may either be in moderation
> or may have gotten lost somehow):
> 
> On Wed, Jun 11, 2014 at 04:09:16PM +0530, Siddhesh Poyarekar wrote:
> > Hi,
> > 
> > We recently encountered a case in our glibc patchwork instance on
> > sourceware, where a patch was dropped because it had x-unknown
> > charset.  I used the following patch to fix this in our instance.  The
> > fix I used was to fall back on a set of encodings (instead of just
> > utf-8) when the charset is not mentioned or if it is set as x-unknown.
> > 
> > I hope this is useful.  I'd love to know if you all think there is a
> > better way to fix this so that I can implement that in our instance
> > instead of my hack.
> > 
> > Cheers,
> > Siddhesh
> 
> --- a/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:12.685666812 +0530
> +++ b/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:03.991667186 +0530
> @@ -147,6 +147,13 @@
>          return match.group(1)
>      return None
>  
> +def try_decode(payload, charset):
> +    try:
> +        payload = unicode(payload, charset)
> +    except UnicodeDecodeError:
> +        return None
> +    return payload
> +
>  def find_content(project, mail):
>      patchbuf = None
>      commentbuf = ''
> @@ -157,15 +164,27 @@
>              continue
>  
>          payload = part.get_payload(decode=True)
> -        charset = part.get_content_charset()
>          subtype = part.get_content_subtype()
>  
> -        # if we don't have a charset, assume utf-8
> -        if charset is None:
> -            charset = 'utf-8'
> -
>          if not isinstance(payload, unicode):
> -            payload = unicode(payload, charset)
> +            charset = part.get_content_charset()
> +
> +            # If there is no charset or if it is unknown, then try some common
> +            # charsets before we fail.
> +            if charset is None or charset == 'x-unknown':
> +                try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1']
> +            else:
> +                try_charsets = [charset]
> +
> +            for cset in try_charsets:
> +                decoded_payload = try_decode(payload, cset)
> +                if decoded_payload is not None:
> +                    break
> +            payload = decoded_payload
> +
> +            # Could not find a valid decoded payload.  Fail.
> +            if payload is None:
> +                return (None, None)
>  
>          if subtype in ['x-patch', 'x-diff']:
>              patchbuf = payload



> _______________________________________________
> Patchwork mailing list
> Patchwork@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork
Jeremy Kerr July 4, 2014, 12:47 a.m. UTC | #2
Hi Siddesh,

>> We recently encountered a case in our glibc patchwork instance on
>> sourceware, where a patch was dropped because it had x-unknown
>> charset.  I used the following patch to fix this in our instance.  The
>> fix I used was to fall back on a set of encodings (instead of just
>> utf-8) when the charset is not mentioned or if it is set as x-unknown.
>>
>> I hope this is useful.  I'd love to know if you all think there is a
>> better way to fix this so that I can implement that in our instance
>> instead of my hack.

Just one thing I noticed when applying this:

> +
> +            # If there is no charset or if it is unknown, then try some common
> +            # charsets before we fail.
> +            if charset is None or charset == 'x-unknown':
> +                try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1']

Is there any point including ascii there? If it didn't parse as utf-8, I 
don't think it'll parse as ascii either.

Also, could you send me a Signed-off-by line for this patch too?

Regards,


Jeremy
Siddhesh Poyarekar July 4, 2014, 2:21 a.m. UTC | #3
On Fri, Jul 04, 2014 at 08:47:01AM +0800, Jeremy Kerr wrote:
> Hi Siddesh,
> 
> >>We recently encountered a case in our glibc patchwork instance on
> >>sourceware, where a patch was dropped because it had x-unknown
> >>charset.  I used the following patch to fix this in our instance.  The
> >>fix I used was to fall back on a set of encodings (instead of just
> >>utf-8) when the charset is not mentioned or if it is set as x-unknown.
> >>
> >>I hope this is useful.  I'd love to know if you all think there is a
> >>better way to fix this so that I can implement that in our instance
> >>instead of my hack.
> 
> Just one thing I noticed when applying this:
> 
> >+
> >+            # If there is no charset or if it is unknown, then try some common
> >+            # charsets before we fail.
> >+            if charset is None or charset == 'x-unknown':
> >+                try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1']
> 
> Is there any point including ascii there? If it didn't parse as utf-8, I
> don't think it'll parse as ascii either.

Right, it won't.  I'll remove it.

> Also, could you send me a Signed-off-by line for this patch too?

Sure, I'll send an updated patch shortly.

Siddhesh
diff mbox

Patch

--- a/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:12.685666812 +0530
+++ b/apps/patchwork/bin/parsemail.py	2014-06-11 15:53:03.991667186 +0530
@@ -147,6 +147,13 @@ 
         return match.group(1)
     return None
 
+def try_decode(payload, charset):
+    try:
+        payload = unicode(payload, charset)
+    except UnicodeDecodeError:
+        return None
+    return payload
+
 def find_content(project, mail):
     patchbuf = None
     commentbuf = ''
@@ -157,15 +164,27 @@ 
             continue
 
         payload = part.get_payload(decode=True)
-        charset = part.get_content_charset()
         subtype = part.get_content_subtype()
 
-        # if we don't have a charset, assume utf-8
-        if charset is None:
-            charset = 'utf-8'
-
         if not isinstance(payload, unicode):
-            payload = unicode(payload, charset)
+            charset = part.get_content_charset()
+
+            # If there is no charset or if it is unknown, then try some common
+            # charsets before we fail.
+            if charset is None or charset == 'x-unknown':
+                try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1']
+            else:
+                try_charsets = [charset]
+
+            for cset in try_charsets:
+                decoded_payload = try_decode(payload, cset)
+                if decoded_payload is not None:
+                    break
+            payload = decoded_payload
+
+            # Could not find a valid decoded payload.  Fail.
+            if payload is None:
+                return (None, None)
 
         if subtype in ['x-patch', 'x-diff']:
             patchbuf = payload