From patchwork Sun Dec 1 17:10:44 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stephen Finucane X-Patchwork-Id: 1202882 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 47Qvsw0jLcz9sP3 for ; Mon, 2 Dec 2019 04:11:32 +1100 (AEDT) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=that.guru Authentication-Results: ozlabs.org; dkim=fail reason="key not found in DNS" (0-bit key; unprotected) header.d=that.guru header.i=@that.guru header.b="A8Ysgw1z"; dkim-atps=neutral Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 47Qvsv52y0zDqXS for ; Mon, 2 Dec 2019 04:11:31 +1100 (AEDT) X-Original-To: patchwork@lists.ozlabs.org Delivered-To: patchwork@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=none (no SPF record) smtp.mailfrom=that.guru (client-ip=199.181.239.174; helo=relay0174.mxlogin.com; envelope-from=stephen@that.guru; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=none (p=none dis=none) header.from=that.guru Authentication-Results: lists.ozlabs.org; dkim=fail reason="key not found in DNS" (0-bit key; unprotected) header.d=that.guru header.i=@that.guru header.b="A8Ysgw1z"; dkim-atps=neutral Received: from relay0174.mxlogin.com (relay0174.mxlogin.com [199.181.239.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 47QvsK3M0LzDqWR for ; Mon, 2 Dec 2019 04:11:00 +1100 (AEDT) Received: from filter004.mxroute.com (unknown [116.203.155.46]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay0174.mxlogin.com (Postfix) with ESMTPS id 8BB1BCC6031C; Sun, 1 Dec 2019 11:10:56 -0600 (CST) Received: from one.mxroute.com (one.mxroute.com [195.201.59.211]) by filter004.mxroute.com (Postfix) with ESMTPS id 8F2793EAE1; Sun, 1 Dec 2019 17:10:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=that.guru; s=default; h=Content-Transfer-Encoding:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Sender:Reply-To:Content-Type:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=Hn5Hr86CMkgbs5cYojoV1CZqn9ZqySLhhzqlL+Hl3QI=; b=A8Ysgw1zMtzDRviDTyxKYmtSW7 BNY9ajeIeljG/FeteUwpshLkzwkM5aps8eBS38fpCx8TWgbNocSD2V5hea9yNTdzNrbYzR+k2adPq CfMjUCyzMn451eZznl6qQnXBxkGuVkLh5iaVTFjMASJdRLQDMuc6EY/Xf8rQs2p83C25m87dHnQv0 KPiow4WYQOKadj+T2TY0EhOQThA0rL4CrFJ1KmSCbQfOiHAqrFLwCguLdTuhao8D1TcjDhID9gLoJ PZTYu8gioXyTgqkMWkx7NF14fgZRPBIoUY/waenb33DiJi6N0yqTqxSFRGFOTA17P7qS2JbHkOcKQ wzNX4Mzw==; From: Stephen Finucane To: patchwork@lists.ozlabs.org Subject: [PATCH v4 2/3] parser: Use a second query to weed out duplicate series Date: Sun, 1 Dec 2019 17:10:44 +0000 Message-Id: <20191201171045.97913-2-stephen@that.guru> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20191201171045.97913-1-stephen@that.guru> References: <20191201171045.97913-1-stephen@that.guru> MIME-Version: 1.0 X-AuthUser: stephen@that.guru X-BeenThere: patchwork@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Patchwork development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: patchwork-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Patchwork" Annoyingly, not all email clients properly thread emails using the message ID fields originally specified in RFC 822 [1]. Worse, some MTAs (cough, outlook.com, cough) actually override what the client configures, breaking the world in the process. Realising this is an issue, Patchwork supports threading using arbitrary metadata in addition to the RFC 822 metadata. Specifically, it uses a combination of submitter and list-id extracted from the headers along with the series version and total count metadata extracted from the subject. In addition to this, we timebox things so that two or more series that match on all of this metadata but which are sent some time apart from each other aren't combined by accident. This does leave one edge case - duplicate series received within the timebox will be combined. We've resigned ourselves to this fact on the basis that it's extremely unlikely for all of these things to go wrong at once. Given all the above, there should be no reason that attempting to find series by series markers should return more than one series. The timeboxing will prevent us grouping similar looking series by accident and the only other reason for this to happen is because we lost a race and we should try again. [1] https://tools.ietf.org/html/rfc822 Signed-off-by: Stephen Finucane Cc: Daniel Axtens --- v4: - Handle additional corner cases --- patchwork/parser.py | 45 ++++++++++++++++++++++++++++------ patchwork/tests/test_parser.py | 8 +++--- 2 files changed, 41 insertions(+), 12 deletions(-) diff --git a/patchwork/parser.py b/patchwork/parser.py index c0084cc0..45dd8db8 100644 --- a/patchwork/parser.py +++ b/patchwork/parser.py @@ -61,6 +61,11 @@ class DuplicateMailError(Exception): self.msgid = msgid +class DuplicateSeriesError(Exception): + + pass + + def normalise_space(value): whitespace_re = re.compile(r'\s+') return whitespace_re.sub(' ', value).strip() @@ -256,8 +261,10 @@ def _find_series_by_references(project, mail): refs = [h] + refs for ref in refs: try: - return SeriesReference.objects.get( + series = SeriesReference.objects.get( msgid=ref[:255], project=project).series + # we want to return a queryset like '_find_series_by_markers' + return Series.objects.filter(id=series.id) except SeriesReference.DoesNotExist: continue @@ -294,12 +301,9 @@ def _find_series_by_markers(project, mail, author): start_date = date - delta end_date = date + delta - try: - return Series.objects.get( - submitter=author, project=project, version=version, total=total, - date__range=[start_date, end_date]) - except (Series.DoesNotExist, Series.MultipleObjectsReturned): - return + return Series.objects.filter( + submitter=author, project=project, version=version, total=total, + date__range=[start_date, end_date]) def find_series(project, mail, author): @@ -1094,6 +1098,23 @@ def parse_mail(mail, list_id=None): series = None if n: series = find_series(project, mail, author) + if len(series) > 1: + # if this isn't our final attempt, go again + if attempt != 10: + raise DuplicateSeriesError() + + # if it is and we still haven't been able to save a + # series, it's time to give up and just save yet + # another duplicate - find the best possible match + for series in series.order_by('-date'): + if Patch.objects.filter( + series=series, number=x).count(): + continue + break + else: + series = None + elif len(series) == 1: + series = series.first() else: x = n = 1 @@ -1130,8 +1151,16 @@ def parse_mail(mail, list_id=None): except SeriesReference.DoesNotExist: SeriesReference.objects.create( msgid=ref, project=project, series=series) + + # attempt to pull the series in again, raising an + # exception if we lost the race when creating a series + # and force us to go through this again + if attempt != 10 and find_series( + project, mail, author).count() > 1: + raise DuplicateSeriesError() + break - except IntegrityError: + except (IntegrityError, DuplicateSeriesError): # we lost the race so go again logger.warning('Conflict while saving series. This is ' 'probably because multiple patches belonging ' diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py index 0edbd87a..4a85fd42 100644 --- a/patchwork/tests/test_parser.py +++ b/patchwork/tests/test_parser.py @@ -400,8 +400,8 @@ class SeriesCorrelationTest(TestCase): email = self._create_email(msgid) project = create_project() - self.assertIsNone(find_series(project, email, - get_or_create_author(email))) + self.assertFalse(find_series(project, email, + get_or_create_author(email))) def test_first_reply(self): msgid_a = make_msgid() @@ -413,7 +413,7 @@ class SeriesCorrelationTest(TestCase): series = find_series(ref.series.project, email, get_or_create_author(email)) - self.assertEqual(series, ref.series) + self.assertEqual(series.first(), ref.series) def test_nested_series(self): """Handle a series sent in-reply-to an existing series.""" @@ -443,7 +443,7 @@ class SeriesCorrelationTest(TestCase): # this should link to the second series - not the first self.assertEqual(len(msgids), 4 + 1) # old series + new cover - self.assertEqual(series, ref_v2.series) + self.assertEqual(series.first(), ref_v2.series) class SubjectEncodingTest(TestCase):