Message ID | 20140611193445.GG10378@spoyarek.pnq.redhat.com |
---|---|
State | Superseded |
Headers | show |
Hi, Ping! On Thu, Jun 12, 2014 at 01:04:46AM +0530, Siddhesh Poyarekar wrote: > Trying again after signing up to the mailing list (patch is slightly > modified from my first submission, which may either be in moderation > or may have gotten lost somehow): > > On Wed, Jun 11, 2014 at 04:09:16PM +0530, Siddhesh Poyarekar wrote: > > Hi, > > > > We recently encountered a case in our glibc patchwork instance on > > sourceware, where a patch was dropped because it had x-unknown > > charset. I used the following patch to fix this in our instance. The > > fix I used was to fall back on a set of encodings (instead of just > > utf-8) when the charset is not mentioned or if it is set as x-unknown. > > > > I hope this is useful. I'd love to know if you all think there is a > > better way to fix this so that I can implement that in our instance > > instead of my hack. > > > > Cheers, > > Siddhesh > > --- a/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:12.685666812 +0530 > +++ b/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:03.991667186 +0530 > @@ -147,6 +147,13 @@ > return match.group(1) > return None > > +def try_decode(payload, charset): > + try: > + payload = unicode(payload, charset) > + except UnicodeDecodeError: > + return None > + return payload > + > def find_content(project, mail): > patchbuf = None > commentbuf = '' > @@ -157,15 +164,27 @@ > continue > > payload = part.get_payload(decode=True) > - charset = part.get_content_charset() > subtype = part.get_content_subtype() > > - # if we don't have a charset, assume utf-8 > - if charset is None: > - charset = 'utf-8' > - > if not isinstance(payload, unicode): > - payload = unicode(payload, charset) > + charset = part.get_content_charset() > + > + # If there is no charset or if it is unknown, then try some common > + # charsets before we fail. > + if charset is None or charset == 'x-unknown': > + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1'] > + else: > + try_charsets = [charset] > + > + for cset in try_charsets: > + decoded_payload = try_decode(payload, cset) > + if decoded_payload is not None: > + break > + payload = decoded_payload > + > + # Could not find a valid decoded payload. Fail. > + if payload is None: > + return (None, None) > > if subtype in ['x-patch', 'x-diff']: > patchbuf = payload > _______________________________________________ > Patchwork mailing list > Patchwork@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/patchwork
Hi Siddesh, >> We recently encountered a case in our glibc patchwork instance on >> sourceware, where a patch was dropped because it had x-unknown >> charset. I used the following patch to fix this in our instance. The >> fix I used was to fall back on a set of encodings (instead of just >> utf-8) when the charset is not mentioned or if it is set as x-unknown. >> >> I hope this is useful. I'd love to know if you all think there is a >> better way to fix this so that I can implement that in our instance >> instead of my hack. Just one thing I noticed when applying this: > + > + # If there is no charset or if it is unknown, then try some common > + # charsets before we fail. > + if charset is None or charset == 'x-unknown': > + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1'] Is there any point including ascii there? If it didn't parse as utf-8, I don't think it'll parse as ascii either. Also, could you send me a Signed-off-by line for this patch too? Regards, Jeremy
On Fri, Jul 04, 2014 at 08:47:01AM +0800, Jeremy Kerr wrote: > Hi Siddesh, > > >>We recently encountered a case in our glibc patchwork instance on > >>sourceware, where a patch was dropped because it had x-unknown > >>charset. I used the following patch to fix this in our instance. The > >>fix I used was to fall back on a set of encodings (instead of just > >>utf-8) when the charset is not mentioned or if it is set as x-unknown. > >> > >>I hope this is useful. I'd love to know if you all think there is a > >>better way to fix this so that I can implement that in our instance > >>instead of my hack. > > Just one thing I noticed when applying this: > > >+ > >+ # If there is no charset or if it is unknown, then try some common > >+ # charsets before we fail. > >+ if charset is None or charset == 'x-unknown': > >+ try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1'] > > Is there any point including ascii there? If it didn't parse as utf-8, I > don't think it'll parse as ascii either. Right, it won't. I'll remove it. > Also, could you send me a Signed-off-by line for this patch too? Sure, I'll send an updated patch shortly. Siddhesh
--- a/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:12.685666812 +0530 +++ b/apps/patchwork/bin/parsemail.py 2014-06-11 15:53:03.991667186 +0530 @@ -147,6 +147,13 @@ return match.group(1) return None +def try_decode(payload, charset): + try: + payload = unicode(payload, charset) + except UnicodeDecodeError: + return None + return payload + def find_content(project, mail): patchbuf = None commentbuf = '' @@ -157,15 +164,27 @@ continue payload = part.get_payload(decode=True) - charset = part.get_content_charset() subtype = part.get_content_subtype() - # if we don't have a charset, assume utf-8 - if charset is None: - charset = 'utf-8' - if not isinstance(payload, unicode): - payload = unicode(payload, charset) + charset = part.get_content_charset() + + # If there is no charset or if it is unknown, then try some common + # charsets before we fail. + if charset is None or charset == 'x-unknown': + try_charsets = ['utf-8', 'windows-1252', 'ascii', 'iso-8859-1'] + else: + try_charsets = [charset] + + for cset in try_charsets: + decoded_payload = try_decode(payload, cset) + if decoded_payload is not None: + break + payload = decoded_payload + + # Could not find a valid decoded payload. Fail. + if payload is None: + return (None, None) if subtype in ['x-patch', 'x-diff']: patchbuf = payload