From patchwork Fri Jul 4 16:13:23 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Siddhesh Poyarekar X-Patchwork-Id: 367140 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id C0D101400E7 for ; Sat, 5 Jul 2014 02:13:35 +1000 (EST) Received: from ozlabs.org (ozlabs.org [103.22.144.67]) by lists.ozlabs.org (Postfix) with ESMTP id AA1431A0011 for ; Sat, 5 Jul 2014 02:13:35 +1000 (EST) X-Original-To: patchwork@lists.ozlabs.org Delivered-To: patchwork@lists.ozlabs.org Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id A0F621A000A for ; Sat, 5 Jul 2014 02:13:32 +1000 (EST) Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s64GDTr3012856 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 4 Jul 2014 12:13:30 -0400 Received: from spoyarek.pnq.redhat.com (ovpn-113-78.phx2.redhat.com [10.3.113.78]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s64GDOfN016350 (version=TLSv1/SSLv3 cipher=AES128-GCM-SHA256 bits=128 verify=NO); Fri, 4 Jul 2014 12:13:27 -0400 Date: Fri, 4 Jul 2014 21:43:23 +0530 From: Siddhesh Poyarekar To: patchwork@lists.ozlabs.org Subject: [PATCH] Fallback to common charsets when charset is None or x-unknown Message-ID: <20140704161322.GA31280@spoyarek.pnq.redhat.com> MIME-Version: 1.0 User-Agent: Mutt/1.5.22.1-rc1 (2013-10-16) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 Cc: Jeremy Kerr X-BeenThere: patchwork@lists.ozlabs.org X-Mailman-Version: 2.1.16 Precedence: list List-Id: Patchwork development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Patchwork" We recently encountered a case in our glibc patchwork instance on sourceware, where a patch was dropped because it had x-unknown charset. I used the following patch to fix this in our instance. The fix I used was to fall back on a set of encodings (instead of just utf-8) when the charset is not mentioned or if it is set as x-unknown. v2 removes ascii as a fallback since it won't work anyway if utf-8 failed. Signed-off-by: Siddhesh Poyarekar --- apps/patchwork/bin/parsemail.py | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/apps/patchwork/bin/parsemail.py b/apps/patchwork/bin/parsemail.py index b6eb97a..7c173d9 100755 --- a/apps/patchwork/bin/parsemail.py +++ b/apps/patchwork/bin/parsemail.py @@ -147,6 +147,13 @@ def find_pull_request(content): return match.group(1) return None +def try_decode(payload, charset): + try: + payload = unicode(payload, charset) + except UnicodeDecodeError: + return None + return payload + def find_content(project, mail): patchbuf = None commentbuf = '' @@ -157,15 +164,27 @@ def find_content(project, mail): continue payload = part.get_payload(decode=True) - charset = part.get_content_charset() subtype = part.get_content_subtype() - # if we don't have a charset, assume utf-8 - if charset is None: - charset = 'utf-8' - if not isinstance(payload, unicode): - payload = unicode(payload, charset) + charset = part.get_content_charset() + + # If there is no charset or if it is unknown, then try some common + # charsets before we fail. + if charset is None or charset == 'x-unknown': + try_charsets = ['utf-8', 'windows-1252', 'iso-8859-1'] + else: + try_charsets = [charset] + + for cset in try_charsets: + decoded_payload = try_decode(payload, cset) + if decoded_payload is not None: + break + payload = decoded_payload + + # Could not find a valid decoded payload. Fail. + if payload is None: + return (None, None) if subtype in ['x-patch', 'x-diff']: patchbuf = payload