[v4,1/3] models: Use database constraints to prevent split Series
diff mbox series

Message ID 20191201171045.97913-1-stephen@that.guru
State Accepted
Headers show
Series
  • [v4,1/3] models: Use database constraints to prevent split Series
Related show

Commit Message

Stephen Finucane Dec. 1, 2019, 5:10 p.m. UTC
Currently, the 'SeriesReference' object has a unique constraint on the
two fields it has, 'series', which is a foreign key to 'Series', and
'msgid'. This is the wrong constraint. What we actually want to enforce
is that a patch, cover letter or comment is referenced by a single
series, or rather a single series per project the submission appears on.
As such, we should be enforcing uniqueness on the msgid and the project
that the patch, cover letter or comment belongs to.

This requires adding a new field to the object, 'project', since it's
not possible to do something like the following:

  unique_together = [('msgid', 'series__project')]

This is detailed here [1]. In addition, the migration needs a precursor
migration step to merge any broken series.

[1] https://stackoverflow.com/a/4440189/613428

Signed-off-by: Stephen Finucane <stephen@that.guru>
Closes: #241
Cc: Daniel Axtens <dja@axtens.net>
Cc: Petr Vorel <petr.vorel@gmail.com>
---
v4:
- Further tweaks to logging
v3:
- Improve logging
---
 .../0038_unique_series_references.py          | 121 +++++++++++++++++
 patchwork/models.py                           |   6 +-
 patchwork/parser.py                           | 123 +++++++++---------
 patchwork/tests/test_parser.py                |   9 +-
 patchwork/tests/utils.py                      |   6 +-
 5 files changed, 199 insertions(+), 66 deletions(-)
 create mode 100644 patchwork/migrations/0038_unique_series_references.py

Comments

Daniel Axtens Dec. 4, 2019, 6:05 a.m. UTC | #1
I only see  patch 1 of apparently 3 patches?

I'm not super keen on the whole retry logic, and I think I have a way to
do this that doesn't require that - give me a few days to try and
polish it up enough to send out.

Regards,
Daniel

Stephen Finucane <stephen@that.guru> writes:

> Currently, the 'SeriesReference' object has a unique constraint on the
> two fields it has, 'series', which is a foreign key to 'Series', and
> 'msgid'. This is the wrong constraint. What we actually want to enforce
> is that a patch, cover letter or comment is referenced by a single
> series, or rather a single series per project the submission appears on.
> As such, we should be enforcing uniqueness on the msgid and the project
> that the patch, cover letter or comment belongs to.
>
> This requires adding a new field to the object, 'project', since it's
> not possible to do something like the following:
>
>   unique_together = [('msgid', 'series__project')]
>
> This is detailed here [1]. In addition, the migration needs a precursor
> migration step to merge any broken series.
>
> [1] https://stackoverflow.com/a/4440189/613428
>
> Signed-off-by: Stephen Finucane <stephen@that.guru>
> Closes: #241
> Cc: Daniel Axtens <dja@axtens.net>
> Cc: Petr Vorel <petr.vorel@gmail.com>
> ---
> v4:
> - Further tweaks to logging
> v3:
> - Improve logging
> ---
>  .../0038_unique_series_references.py          | 121 +++++++++++++++++
>  patchwork/models.py                           |   6 +-
>  patchwork/parser.py                           | 123 +++++++++---------
>  patchwork/tests/test_parser.py                |   9 +-
>  patchwork/tests/utils.py                      |   6 +-
>  5 files changed, 199 insertions(+), 66 deletions(-)
>  create mode 100644 patchwork/migrations/0038_unique_series_references.py
>
> diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
> new file mode 100644
> index 00000000..91799020
> --- /dev/null
> +++ b/patchwork/migrations/0038_unique_series_references.py
> @@ -0,0 +1,121 @@
> +from django.db import connection, migrations, models
> +from django.db.models import Count
> +import django.db.models.deletion
> +
> +
> +def merge_duplicate_series(apps, schema_editor):
> +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> +    Patch = apps.get_model('patchwork', 'Patch')
> +
> +    msgid_seriesrefs = {}
> +
> +    # find all SeriesReference that share a msgid but point to different series
> +    # and decide which of the series is going to be the authoritative one
> +    msgid_counts = (
> +        SeriesReference.objects.values('msgid')
> +        .annotate(count=Count('msgid'))
> +        .filter(count__gt=1)
> +    )
> +    for msgid_count in msgid_counts:
> +        msgid = msgid_count['msgid']
> +        chosen_ref = None
> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> +            if series_ref.series.cover_letter:
> +                if chosen_ref:
> +                    # I don't think this can happen, but explode if it does
> +                    raise Exception(
> +                        "Looks like you've got two or more series that share "
> +                        "some patches but do not share a cover letter. Unable "
> +                        "to auto-resolve."
> +                    )
> +
> +                # if a series has a cover letter, that's the one we'll group
> +                # everything under
> +                chosen_ref = series_ref
> +
> +        if not chosen_ref:
> +            # if none of the series have cover letters, simply use the last
> +            # one (hint: this relies on Python's weird scoping for for loops
> +            # where 'series_ref' is still accessible outside the loop)
> +            chosen_ref = series_ref
> +
> +        msgid_seriesrefs[msgid] = chosen_ref
> +
> +    # reassign any patches referring to non-authoritative series to point to
> +    # the authoritative one, and delete the other series; we do this separately
> +    # to allow us a chance to raise the exception above if necessary
> +    for msgid, chosen_ref in msgid_seriesrefs.items():
> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> +            if series_ref == chosen_ref:
> +                continue
> +
> +            # update the patches to point to our chosen series instead, on the
> +            # assumption that all other metadata is correct
> +            for patch in Patch.objects.filter(series=series_ref.series):
> +                patch.series = chosen_ref.series
> +                patch.save()
> +
> +            # delete the other series (which will delete the series ref)
> +            series_ref.series.delete()
> +
> +
> +def copy_project_field(apps, schema_editor):
> +    if connection.vendor == 'postgresql':
> +        schema_editor.execute(
> +            """
> +            UPDATE patchwork_seriesreference
> +              SET project_id = patchwork_series.project_id
> +            FROM patchwork_series
> +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
> +        """
> +        )
> +    elif connection.vendor == 'mysql':
> +        schema_editor.execute(
> +            """
> +            UPDATE patchwork_seriesreference, patchwork_series
> +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
> +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
> +        """
> +        )  # noqa
> +    else:
> +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> +
> +        for series_ref in SeriesReference.objects.all().select_related(
> +            'series'
> +        ):
> +            series_ref.project = series_ref.series.project
> +            series_ref.save()
> +
> +
> +class Migration(migrations.Migration):
> +
> +    dependencies = [('patchwork', '0037_event_actor')]
> +
> +    operations = [
> +        migrations.RunPython(
> +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
> +        ),
> +        migrations.AddField(
> +            model_name='seriesreference',
> +            name='project',
> +            field=models.ForeignKey(
> +                null=True,
> +                on_delete=django.db.models.deletion.CASCADE,
> +                to='patchwork.Project',
> +            ),
> +        ),
> +        migrations.AlterUniqueTogether(
> +            name='seriesreference', unique_together={('project', 'msgid')}
> +        ),
> +        migrations.RunPython(
> +            copy_project_field, migrations.RunPython.noop, atomic=False
> +        ),
> +        migrations.AlterField(
> +            model_name='seriesreference',
> +            name='project',
> +            field=models.ForeignKey(
> +                on_delete=django.db.models.deletion.CASCADE,
> +                to='patchwork.Project',
> +            ),
> +        ),
> +    ]
> diff --git a/patchwork/models.py b/patchwork/models.py
> index 7f0efd48..dbaff024 100644
> --- a/patchwork/models.py
> +++ b/patchwork/models.py
> @@ -79,7 +79,8 @@ class Project(models.Model):
>      webscm_url = models.CharField(max_length=2000, blank=True)
>      list_archive_url = models.CharField(max_length=2000, blank=True)
>      list_archive_url_format = models.CharField(
> -        max_length=2000, blank=True,
> +        max_length=2000,
> +        blank=True,
>          help_text="URL format for the list archive's Message-ID redirector. "
>          "{} will be replaced by the Message-ID.")
>      commit_url_format = models.CharField(
> @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
>      required to handle the case whereby one or more patches are
>      received before the cover letter.
>      """
> +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
>      series = models.ForeignKey(Series, related_name='references',
>                                 related_query_name='reference',
>                                 on_delete=models.CASCADE)
> @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
>          return self.msgid
>  
>      class Meta:
> -        unique_together = [('series', 'msgid')]
> +        unique_together = [('project', 'msgid')]
>  
>  
>  class Bundle(models.Model):
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index c424729b..c0084cc0 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -16,6 +16,7 @@ import re
>  
>  from django.contrib.auth.models import User
>  from django.db.utils import IntegrityError
> +from django.db import transaction
>  from django.utils import six
>  
>  from patchwork.models import Comment
> @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
>      for ref in refs:
>          try:
>              return SeriesReference.objects.get(
> -                msgid=ref[:255], series__project=project).series
> +                msgid=ref[:255], project=project).series
>          except SeriesReference.DoesNotExist:
>              continue
> -        except SeriesReference.MultipleObjectsReturned:
> -            # FIXME: Open bug: this can happen when we're processing
> -            # messages in parallel. Pick the first and log.
> -            logger.error("Multiple SeriesReferences for %s in project %s!" %
> -                         (ref[:255], project.name))
> -            return SeriesReference.objects.filter(
> -                msgid=ref[:255], series__project=project).first().series
>  
>  
>  def _find_series_by_markers(project, mail, author):
> @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
>          except IntegrityError:
>              raise DuplicateMailError(msgid=msgid)
>  
> -        # if we don't have a series marker, we will never have an existing
> -        # series to match against.
> -        series = None
> -        if n:
> -            series = find_series(project, mail, author)
> +        for attempt in range(1, 11):  # arbitrary retry count
> +            try:
> +                with transaction.atomic():
> +                    # if we don't have a series marker, we will never have an
> +                    # existing series to match against.
> +                    series = None
> +                    if n:
> +                        series = find_series(project, mail, author)
> +                    else:
> +                        x = n = 1
> +
> +                    # We will create a new series if:
> +                    # - there is no existing series to assign this patch to, or
> +                    # - there is an existing series, but it already has a patch
> +                    #   with this number in it
> +                    if not series or Patch.objects.filter(
> +                            series=series, number=x).count():
> +                        series = Series.objects.create(
> +                            project=project,
> +                            date=date,
> +                            submitter=author,
> +                            version=version,
> +                            total=n)
> +
> +                        # NOTE(stephenfin) We must save references for series.
> +                        # We do this to handle the case where a later patch is
> +                        # received first. Without storing references, it would
> +                        # not be possible to identify the relationship between
> +                        # patches as the earlier patch does not reference the
> +                        # later one.
> +                        for ref in refs + [msgid]:
> +                            ref = ref[:255]
> +                            # we don't want duplicates
> +                            try:
> +                                # we could have a ref to a previous series.
> +                                # (For example, a series sent in reply to
> +                                # another series.) That should not create a
> +                                # series ref for this series, so check for the
> +                                # msg-id only, not the msg-id/series pair.
> +                                SeriesReference.objects.get(
> +                                    msgid=ref, project=project)
> +                            except SeriesReference.DoesNotExist:
> +                                SeriesReference.objects.create(
> +                                    msgid=ref, project=project, series=series)
> +                    break
> +            except IntegrityError:
> +                # we lost the race so go again
> +                logger.warning('Conflict while saving series. This is '
> +                               'probably because multiple patches belonging '
> +                               'to the same series have been received at '
> +                               'once. Trying again (attempt %02d/10)',
> +                               attempt)
>          else:
> -            x = n = 1
> -
> -        # We will create a new series if:
> -        # - there is no existing series to assign this patch to, or
> -        # - there is an existing series, but it already has a patch with this
> -        #   number in it
> -        if not series or Patch.objects.filter(series=series, number=x).count():
> -            series = Series.objects.create(
> -                project=project,
> -                date=date,
> -                submitter=author,
> -                version=version,
> -                total=n)
> -
> -            # NOTE(stephenfin) We must save references for series. We
> -            # do this to handle the case where a later patch is
> -            # received first. Without storing references, it would not
> -            # be possible to identify the relationship between patches
> -            # as the earlier patch does not reference the later one.
> -            for ref in refs + [msgid]:
> -                ref = ref[:255]
> -                # we don't want duplicates
> -                try:
> -                    # we could have a ref to a previous series. (For
> -                    # example, a series sent in reply to another
> -                    # series.) That should not create a series ref
> -                    # for this series, so check for the msg-id only,
> -                    # not the msg-id/series pair.
> -                    SeriesReference.objects.get(msgid=ref,
> -                                                series__project=project)
> -                except SeriesReference.DoesNotExist:
> -                    SeriesReference.objects.create(series=series, msgid=ref)
> -                except SeriesReference.MultipleObjectsReturned:
> -                    logger.error("Multiple SeriesReferences for %s"
> -                                 " in project %s!" % (ref, project.name))
> +            # we failed to save the series so return the series-less patch
> +            logger.warning('Failed to save series. Your patch with message ID '
> +                           '%s has been saved but this should not happen. '
> +                           'Please report this as a bug and include logs',
> +                           msgid)
> +            return patch
>  
>          # add to a series if we have found one, and we have a numbered
>          # patch. Don't add unnumbered patches (for example diffs sent
> @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
>              # message
>              try:
>                  series = SeriesReference.objects.get(
> -                    msgid=msgid, series__project=project).series
> +                    msgid=msgid, project=project).series
>              except SeriesReference.DoesNotExist:
>                  series = None
> -            except SeriesReference.MultipleObjectsReturned:
> -                logger.error("Multiple SeriesReferences for %s"
> -                             " in project %s!" % (msgid, project.name))
> -                series = SeriesReference.objects.filter(
> -                    msgid=msgid, series__project=project).first().series
>  
>              if not series:
>                  series = Series.objects.create(
> @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
>                  # we don't save the in-reply-to or references fields
>                  # for a cover letter, as they can't refer to the same
>                  # series
> -                try:
> -                    SeriesReference.objects.get_or_create(series=series,
> -                                                          msgid=msgid)
> -                except SeriesReference.MultipleObjectsReturned:
> -                    logger.error("Multiple SeriesReferences for %s"
> -                                 " in project %s!" % (msgid, project.name))
> +                SeriesReference.objects.create(
> +                    msgid=msgid, project=project, series=series)
>  
>              try:
>                  cover_letter = CoverLetter.objects.create(
> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> index 0bf71585..0edbd87a 100644
> --- a/patchwork/tests/test_parser.py
> +++ b/patchwork/tests/test_parser.py
> @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
>          msgids = [make_msgid()]
>          project = create_project()
>          series_v1 = create_series(project=project)
> -        create_series_reference(msgid=msgids[0], series=series_v1)
> +        create_series_reference(msgid=msgids[0], series=series_v1,
> +                                project=project)
>  
>          # ...and three patches
>          for i in range(3):
>              msgids.append(make_msgid())
> -            create_series_reference(msgid=msgids[-1], series=series_v1)
> +            create_series_reference(msgid=msgids[-1], series=series_v1,
> +                                    project=project)
>  
>          # now create a new series with "cover letter"
>          msgids.append(make_msgid())
>          series_v2 = create_series(project=project)
> -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
> +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
> +                                         project=project)
>  
>          # ...and the "first patch" of this new series
>          msgid = make_msgid()
> diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
> index 577183d0..4ead1781 100644
> --- a/patchwork/tests/utils.py
> +++ b/patchwork/tests/utils.py
> @@ -282,8 +282,12 @@ def create_series(**kwargs):
>  
>  def create_series_reference(**kwargs):
>      """Create 'SeriesReference' object."""
> +    project = kwargs.pop('project', create_project())
> +    series = kwargs.pop('series', create_series(project=project))
> +
>      values = {
> -        'series': create_series() if 'series' not in kwargs else None,
> +        'series': series,
> +        'project': project,
>          'msgid': make_msgid(),
>      }
>      values.update(**kwargs)
> -- 
> 2.23.0
Daniel Axtens Dec. 7, 2019, 3:13 p.m. UTC | #2
> Currently, the 'SeriesReference' object has a unique constraint on the
> two fields it has, 'series', which is a foreign key to 'Series', and
> 'msgid'. This is the wrong constraint. What we actually want to enforce
> is that a patch, cover letter or comment is referenced by a single
> series, or rather a single series per project the submission appears on.
> As such, we should be enforcing uniqueness on the msgid and the project
> that the patch, cover letter or comment belongs to.


OK, so I've been working on this and making good progress.

My fundamental issue with the current design, and this extension of it,
is that we're ignoring or working against the tools that databases
provide for concurrency. This patch is an improvement, no doubt, but I'd
really like to solve it properly.

Because we have multiple processes operating on the database - processes
without external syncronisation - we need to rely on the database to
provide the synchronisation primitive. Currently we're doing things like
asking the database to check if something exists and then creating it if
it does not. This is embeds TOCTOU problems, similar to the ones I fixed
for Person creation a while back. What we should be doing is
get_or_create style - basically more aggressively attempting to create
the thing we need and if that fails, then do the lookup for the existing
object.

This design pervades the series parser. It doesn't generally bite us
because we generally, more or less, parse in series.

I'm basically refactoring the parser to be
'try-creating-and-catch-failures' rather than
'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
least 80% of the way there.

I would be happy to leave my series parsing rework and Mete's work to a
2.3, or I'd be happy to try and grind through, get these done, and make
2.2 the final release in the 2.x series. What do you think?

Regards,
Daniel

Stephen Finucane <stephen@that.guru> writes:

>
> This requires adding a new field to the object, 'project', since it's
> not possible to do something like the following:
>
>   unique_together = [('msgid', 'series__project')]
>
> This is detailed here [1]. In addition, the migration needs a precursor
> migration step to merge any broken series.
>
> [1] https://stackoverflow.com/a/4440189/613428
>
> Signed-off-by: Stephen Finucane <stephen@that.guru>
> Closes: #241
> Cc: Daniel Axtens <dja@axtens.net>
> Cc: Petr Vorel <petr.vorel@gmail.com>
> ---
> v4:
> - Further tweaks to logging
> v3:
> - Improve logging
> ---
>  .../0038_unique_series_references.py          | 121 +++++++++++++++++
>  patchwork/models.py                           |   6 +-
>  patchwork/parser.py                           | 123 +++++++++---------
>  patchwork/tests/test_parser.py                |   9 +-
>  patchwork/tests/utils.py                      |   6 +-
>  5 files changed, 199 insertions(+), 66 deletions(-)
>  create mode 100644 patchwork/migrations/0038_unique_series_references.py
>
> diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
> new file mode 100644
> index 00000000..91799020
> --- /dev/null
> +++ b/patchwork/migrations/0038_unique_series_references.py
> @@ -0,0 +1,121 @@
> +from django.db import connection, migrations, models
> +from django.db.models import Count
> +import django.db.models.deletion
> +
> +
> +def merge_duplicate_series(apps, schema_editor):
> +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> +    Patch = apps.get_model('patchwork', 'Patch')
> +
> +    msgid_seriesrefs = {}
> +
> +    # find all SeriesReference that share a msgid but point to different series
> +    # and decide which of the series is going to be the authoritative one
> +    msgid_counts = (
> +        SeriesReference.objects.values('msgid')
> +        .annotate(count=Count('msgid'))
> +        .filter(count__gt=1)
> +    )
> +    for msgid_count in msgid_counts:
> +        msgid = msgid_count['msgid']
> +        chosen_ref = None
> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> +            if series_ref.series.cover_letter:
> +                if chosen_ref:
> +                    # I don't think this can happen, but explode if it does
> +                    raise Exception(
> +                        "Looks like you've got two or more series that share "
> +                        "some patches but do not share a cover letter. Unable "
> +                        "to auto-resolve."
> +                    )
> +
> +                # if a series has a cover letter, that's the one we'll group
> +                # everything under
> +                chosen_ref = series_ref
> +
> +        if not chosen_ref:
> +            # if none of the series have cover letters, simply use the last
> +            # one (hint: this relies on Python's weird scoping for for loops
> +            # where 'series_ref' is still accessible outside the loop)
> +            chosen_ref = series_ref
> +
> +        msgid_seriesrefs[msgid] = chosen_ref
> +
> +    # reassign any patches referring to non-authoritative series to point to
> +    # the authoritative one, and delete the other series; we do this separately
> +    # to allow us a chance to raise the exception above if necessary
> +    for msgid, chosen_ref in msgid_seriesrefs.items():
> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> +            if series_ref == chosen_ref:
> +                continue
> +
> +            # update the patches to point to our chosen series instead, on the
> +            # assumption that all other metadata is correct
> +            for patch in Patch.objects.filter(series=series_ref.series):
> +                patch.series = chosen_ref.series
> +                patch.save()
> +
> +            # delete the other series (which will delete the series ref)
> +            series_ref.series.delete()
> +
> +
> +def copy_project_field(apps, schema_editor):
> +    if connection.vendor == 'postgresql':
> +        schema_editor.execute(
> +            """
> +            UPDATE patchwork_seriesreference
> +              SET project_id = patchwork_series.project_id
> +            FROM patchwork_series
> +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
> +        """
> +        )
> +    elif connection.vendor == 'mysql':
> +        schema_editor.execute(
> +            """
> +            UPDATE patchwork_seriesreference, patchwork_series
> +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
> +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
> +        """
> +        )  # noqa
> +    else:
> +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> +
> +        for series_ref in SeriesReference.objects.all().select_related(
> +            'series'
> +        ):
> +            series_ref.project = series_ref.series.project
> +            series_ref.save()
> +
> +
> +class Migration(migrations.Migration):
> +
> +    dependencies = [('patchwork', '0037_event_actor')]
> +
> +    operations = [
> +        migrations.RunPython(
> +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
> +        ),
> +        migrations.AddField(
> +            model_name='seriesreference',
> +            name='project',
> +            field=models.ForeignKey(
> +                null=True,
> +                on_delete=django.db.models.deletion.CASCADE,
> +                to='patchwork.Project',
> +            ),
> +        ),
> +        migrations.AlterUniqueTogether(
> +            name='seriesreference', unique_together={('project', 'msgid')}
> +        ),
> +        migrations.RunPython(
> +            copy_project_field, migrations.RunPython.noop, atomic=False
> +        ),
> +        migrations.AlterField(
> +            model_name='seriesreference',
> +            name='project',
> +            field=models.ForeignKey(
> +                on_delete=django.db.models.deletion.CASCADE,
> +                to='patchwork.Project',
> +            ),
> +        ),
> +    ]
> diff --git a/patchwork/models.py b/patchwork/models.py
> index 7f0efd48..dbaff024 100644
> --- a/patchwork/models.py
> +++ b/patchwork/models.py
> @@ -79,7 +79,8 @@ class Project(models.Model):
>      webscm_url = models.CharField(max_length=2000, blank=True)
>      list_archive_url = models.CharField(max_length=2000, blank=True)
>      list_archive_url_format = models.CharField(
> -        max_length=2000, blank=True,
> +        max_length=2000,
> +        blank=True,
>          help_text="URL format for the list archive's Message-ID redirector. "
>          "{} will be replaced by the Message-ID.")
>      commit_url_format = models.CharField(
> @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
>      required to handle the case whereby one or more patches are
>      received before the cover letter.
>      """
> +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
>      series = models.ForeignKey(Series, related_name='references',
>                                 related_query_name='reference',
>                                 on_delete=models.CASCADE)
> @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
>          return self.msgid
>  
>      class Meta:
> -        unique_together = [('series', 'msgid')]
> +        unique_together = [('project', 'msgid')]
>  
>  
>  class Bundle(models.Model):
> diff --git a/patchwork/parser.py b/patchwork/parser.py
> index c424729b..c0084cc0 100644
> --- a/patchwork/parser.py
> +++ b/patchwork/parser.py
> @@ -16,6 +16,7 @@ import re
>  
>  from django.contrib.auth.models import User
>  from django.db.utils import IntegrityError
> +from django.db import transaction
>  from django.utils import six
>  
>  from patchwork.models import Comment
> @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
>      for ref in refs:
>          try:
>              return SeriesReference.objects.get(
> -                msgid=ref[:255], series__project=project).series
> +                msgid=ref[:255], project=project).series
>          except SeriesReference.DoesNotExist:
>              continue
> -        except SeriesReference.MultipleObjectsReturned:
> -            # FIXME: Open bug: this can happen when we're processing
> -            # messages in parallel. Pick the first and log.
> -            logger.error("Multiple SeriesReferences for %s in project %s!" %
> -                         (ref[:255], project.name))
> -            return SeriesReference.objects.filter(
> -                msgid=ref[:255], series__project=project).first().series
>  
>  
>  def _find_series_by_markers(project, mail, author):
> @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
>          except IntegrityError:
>              raise DuplicateMailError(msgid=msgid)
>  
> -        # if we don't have a series marker, we will never have an existing
> -        # series to match against.
> -        series = None
> -        if n:
> -            series = find_series(project, mail, author)
> +        for attempt in range(1, 11):  # arbitrary retry count
> +            try:
> +                with transaction.atomic():
> +                    # if we don't have a series marker, we will never have an
> +                    # existing series to match against.
> +                    series = None
> +                    if n:
> +                        series = find_series(project, mail, author)
> +                    else:
> +                        x = n = 1
> +
> +                    # We will create a new series if:
> +                    # - there is no existing series to assign this patch to, or
> +                    # - there is an existing series, but it already has a patch
> +                    #   with this number in it
> +                    if not series or Patch.objects.filter(
> +                            series=series, number=x).count():
> +                        series = Series.objects.create(
> +                            project=project,
> +                            date=date,
> +                            submitter=author,
> +                            version=version,
> +                            total=n)
> +
> +                        # NOTE(stephenfin) We must save references for series.
> +                        # We do this to handle the case where a later patch is
> +                        # received first. Without storing references, it would
> +                        # not be possible to identify the relationship between
> +                        # patches as the earlier patch does not reference the
> +                        # later one.
> +                        for ref in refs + [msgid]:
> +                            ref = ref[:255]
> +                            # we don't want duplicates
> +                            try:
> +                                # we could have a ref to a previous series.
> +                                # (For example, a series sent in reply to
> +                                # another series.) That should not create a
> +                                # series ref for this series, so check for the
> +                                # msg-id only, not the msg-id/series pair.
> +                                SeriesReference.objects.get(
> +                                    msgid=ref, project=project)
> +                            except SeriesReference.DoesNotExist:
> +                                SeriesReference.objects.create(
> +                                    msgid=ref, project=project, series=series)
> +                    break
> +            except IntegrityError:
> +                # we lost the race so go again
> +                logger.warning('Conflict while saving series. This is '
> +                               'probably because multiple patches belonging '
> +                               'to the same series have been received at '
> +                               'once. Trying again (attempt %02d/10)',
> +                               attempt)
>          else:
> -            x = n = 1
> -
> -        # We will create a new series if:
> -        # - there is no existing series to assign this patch to, or
> -        # - there is an existing series, but it already has a patch with this
> -        #   number in it
> -        if not series or Patch.objects.filter(series=series, number=x).count():
> -            series = Series.objects.create(
> -                project=project,
> -                date=date,
> -                submitter=author,
> -                version=version,
> -                total=n)
> -
> -            # NOTE(stephenfin) We must save references for series. We
> -            # do this to handle the case where a later patch is
> -            # received first. Without storing references, it would not
> -            # be possible to identify the relationship between patches
> -            # as the earlier patch does not reference the later one.
> -            for ref in refs + [msgid]:
> -                ref = ref[:255]
> -                # we don't want duplicates
> -                try:
> -                    # we could have a ref to a previous series. (For
> -                    # example, a series sent in reply to another
> -                    # series.) That should not create a series ref
> -                    # for this series, so check for the msg-id only,
> -                    # not the msg-id/series pair.
> -                    SeriesReference.objects.get(msgid=ref,
> -                                                series__project=project)
> -                except SeriesReference.DoesNotExist:
> -                    SeriesReference.objects.create(series=series, msgid=ref)
> -                except SeriesReference.MultipleObjectsReturned:
> -                    logger.error("Multiple SeriesReferences for %s"
> -                                 " in project %s!" % (ref, project.name))
> +            # we failed to save the series so return the series-less patch
> +            logger.warning('Failed to save series. Your patch with message ID '
> +                           '%s has been saved but this should not happen. '
> +                           'Please report this as a bug and include logs',
> +                           msgid)
> +            return patch
>  
>          # add to a series if we have found one, and we have a numbered
>          # patch. Don't add unnumbered patches (for example diffs sent
> @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
>              # message
>              try:
>                  series = SeriesReference.objects.get(
> -                    msgid=msgid, series__project=project).series
> +                    msgid=msgid, project=project).series
>              except SeriesReference.DoesNotExist:
>                  series = None
> -            except SeriesReference.MultipleObjectsReturned:
> -                logger.error("Multiple SeriesReferences for %s"
> -                             " in project %s!" % (msgid, project.name))
> -                series = SeriesReference.objects.filter(
> -                    msgid=msgid, series__project=project).first().series
>  
>              if not series:
>                  series = Series.objects.create(
> @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
>                  # we don't save the in-reply-to or references fields
>                  # for a cover letter, as they can't refer to the same
>                  # series
> -                try:
> -                    SeriesReference.objects.get_or_create(series=series,
> -                                                          msgid=msgid)
> -                except SeriesReference.MultipleObjectsReturned:
> -                    logger.error("Multiple SeriesReferences for %s"
> -                                 " in project %s!" % (msgid, project.name))
> +                SeriesReference.objects.create(
> +                    msgid=msgid, project=project, series=series)
>  
>              try:
>                  cover_letter = CoverLetter.objects.create(
> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> index 0bf71585..0edbd87a 100644
> --- a/patchwork/tests/test_parser.py
> +++ b/patchwork/tests/test_parser.py
> @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
>          msgids = [make_msgid()]
>          project = create_project()
>          series_v1 = create_series(project=project)
> -        create_series_reference(msgid=msgids[0], series=series_v1)
> +        create_series_reference(msgid=msgids[0], series=series_v1,
> +                                project=project)
>  
>          # ...and three patches
>          for i in range(3):
>              msgids.append(make_msgid())
> -            create_series_reference(msgid=msgids[-1], series=series_v1)
> +            create_series_reference(msgid=msgids[-1], series=series_v1,
> +                                    project=project)
>  
>          # now create a new series with "cover letter"
>          msgids.append(make_msgid())
>          series_v2 = create_series(project=project)
> -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
> +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
> +                                         project=project)
>  
>          # ...and the "first patch" of this new series
>          msgid = make_msgid()
> diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
> index 577183d0..4ead1781 100644
> --- a/patchwork/tests/utils.py
> +++ b/patchwork/tests/utils.py
> @@ -282,8 +282,12 @@ def create_series(**kwargs):
>  
>  def create_series_reference(**kwargs):
>      """Create 'SeriesReference' object."""
> +    project = kwargs.pop('project', create_project())
> +    series = kwargs.pop('series', create_series(project=project))
> +
>      values = {
> -        'series': create_series() if 'series' not in kwargs else None,
> +        'series': series,
> +        'project': project,
>          'msgid': make_msgid(),
>      }
>      values.update(**kwargs)
> -- 
> 2.23.0
Daniel Axtens Dec. 8, 2019, 11:14 a.m. UTC | #3
Daniel Axtens <dja@axtens.net> writes:

>> Currently, the 'SeriesReference' object has a unique constraint on the
>> two fields it has, 'series', which is a foreign key to 'Series', and
>> 'msgid'. This is the wrong constraint. What we actually want to enforce
>> is that a patch, cover letter or comment is referenced by a single
>> series, or rather a single series per project the submission appears on.
>> As such, we should be enforcing uniqueness on the msgid and the project
>> that the patch, cover letter or comment belongs to.
>
>
> OK, so I've been working on this and making good progress.
>
> My fundamental issue with the current design, and this extension of it,
> is that we're ignoring or working against the tools that databases
> provide for concurrency. This patch is an improvement, no doubt, but I'd
> really like to solve it properly.
>
> Because we have multiple processes operating on the database - processes
> without external syncronisation - we need to rely on the database to
> provide the synchronisation primitive. Currently we're doing things like
> asking the database to check if something exists and then creating it if
> it does not. This is embeds TOCTOU problems, similar to the ones I fixed
> for Person creation a while back. What we should be doing is
> get_or_create style - basically more aggressively attempting to create
> the thing we need and if that fails, then do the lookup for the existing
> object.
>
> This design pervades the series parser. It doesn't generally bite us
> because we generally, more or less, parse in series.
>
> I'm basically refactoring the parser to be
> 'try-creating-and-catch-failures' rather than
> 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
> least 80% of the way there.

The more I think about this patch, the more I see it does actually move
us in the right direction. Let's run with it for now.

As for my rewrite, well, series parsing is really hard. I'm also working
from a data set that has a whole set of bizzare things including series
of patches in-reply-to a cover letter, a series of patches that's half
threaded and half unthreaded and a patch series that contains - by
design - multiple patches per "number" (i.e. 3 patches each numbered
1/N, 3 patches numbered 2/N...), so that means I'm forever fighting
weird edge-cases of how people used git a decade ago. So I'm going to
pass on trying to rewrite it for now.

I'll go over it with a finer-toothed comb and do a proper review, but I
withdraw my overall objection.

Regards,
Daniel
>
> I would be happy to leave my series parsing rework and Mete's work to a
> 2.3, or I'd be happy to try and grind through, get these done, and make
> 2.2 the final release in the 2.x series. What do you think?
>
> Regards,
> Daniel
>
> Stephen Finucane <stephen@that.guru> writes:
>
>>
>> This requires adding a new field to the object, 'project', since it's
>> not possible to do something like the following:
>>
>>   unique_together = [('msgid', 'series__project')]
>>
>> This is detailed here [1]. In addition, the migration needs a precursor
>> migration step to merge any broken series.
>>
>> [1] https://stackoverflow.com/a/4440189/613428
>>
>> Signed-off-by: Stephen Finucane <stephen@that.guru>
>> Closes: #241
>> Cc: Daniel Axtens <dja@axtens.net>
>> Cc: Petr Vorel <petr.vorel@gmail.com>
>> ---
>> v4:
>> - Further tweaks to logging
>> v3:
>> - Improve logging
>> ---
>>  .../0038_unique_series_references.py          | 121 +++++++++++++++++
>>  patchwork/models.py                           |   6 +-
>>  patchwork/parser.py                           | 123 +++++++++---------
>>  patchwork/tests/test_parser.py                |   9 +-
>>  patchwork/tests/utils.py                      |   6 +-
>>  5 files changed, 199 insertions(+), 66 deletions(-)
>>  create mode 100644 patchwork/migrations/0038_unique_series_references.py
>>
>> diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
>> new file mode 100644
>> index 00000000..91799020
>> --- /dev/null
>> +++ b/patchwork/migrations/0038_unique_series_references.py
>> @@ -0,0 +1,121 @@
>> +from django.db import connection, migrations, models
>> +from django.db.models import Count
>> +import django.db.models.deletion
>> +
>> +
>> +def merge_duplicate_series(apps, schema_editor):
>> +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> +    Patch = apps.get_model('patchwork', 'Patch')
>> +
>> +    msgid_seriesrefs = {}
>> +
>> +    # find all SeriesReference that share a msgid but point to different series
>> +    # and decide which of the series is going to be the authoritative one
>> +    msgid_counts = (
>> +        SeriesReference.objects.values('msgid')
>> +        .annotate(count=Count('msgid'))
>> +        .filter(count__gt=1)
>> +    )
>> +    for msgid_count in msgid_counts:
>> +        msgid = msgid_count['msgid']
>> +        chosen_ref = None
>> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> +            if series_ref.series.cover_letter:
>> +                if chosen_ref:
>> +                    # I don't think this can happen, but explode if it does
>> +                    raise Exception(
>> +                        "Looks like you've got two or more series that share "
>> +                        "some patches but do not share a cover letter. Unable "
>> +                        "to auto-resolve."
>> +                    )
>> +
>> +                # if a series has a cover letter, that's the one we'll group
>> +                # everything under
>> +                chosen_ref = series_ref
>> +
>> +        if not chosen_ref:
>> +            # if none of the series have cover letters, simply use the last
>> +            # one (hint: this relies on Python's weird scoping for for loops
>> +            # where 'series_ref' is still accessible outside the loop)
>> +            chosen_ref = series_ref
>> +
>> +        msgid_seriesrefs[msgid] = chosen_ref
>> +
>> +    # reassign any patches referring to non-authoritative series to point to
>> +    # the authoritative one, and delete the other series; we do this separately
>> +    # to allow us a chance to raise the exception above if necessary
>> +    for msgid, chosen_ref in msgid_seriesrefs.items():
>> +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> +            if series_ref == chosen_ref:
>> +                continue
>> +
>> +            # update the patches to point to our chosen series instead, on the
>> +            # assumption that all other metadata is correct
>> +            for patch in Patch.objects.filter(series=series_ref.series):
>> +                patch.series = chosen_ref.series
>> +                patch.save()
>> +
>> +            # delete the other series (which will delete the series ref)
>> +            series_ref.series.delete()
>> +
>> +
>> +def copy_project_field(apps, schema_editor):
>> +    if connection.vendor == 'postgresql':
>> +        schema_editor.execute(
>> +            """
>> +            UPDATE patchwork_seriesreference
>> +              SET project_id = patchwork_series.project_id
>> +            FROM patchwork_series
>> +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> +        """
>> +        )
>> +    elif connection.vendor == 'mysql':
>> +        schema_editor.execute(
>> +            """
>> +            UPDATE patchwork_seriesreference, patchwork_series
>> +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
>> +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> +        """
>> +        )  # noqa
>> +    else:
>> +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> +
>> +        for series_ref in SeriesReference.objects.all().select_related(
>> +            'series'
>> +        ):
>> +            series_ref.project = series_ref.series.project
>> +            series_ref.save()
>> +
>> +
>> +class Migration(migrations.Migration):
>> +
>> +    dependencies = [('patchwork', '0037_event_actor')]
>> +
>> +    operations = [
>> +        migrations.RunPython(
>> +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
>> +        ),
>> +        migrations.AddField(
>> +            model_name='seriesreference',
>> +            name='project',
>> +            field=models.ForeignKey(
>> +                null=True,
>> +                on_delete=django.db.models.deletion.CASCADE,
>> +                to='patchwork.Project',
>> +            ),
>> +        ),
>> +        migrations.AlterUniqueTogether(
>> +            name='seriesreference', unique_together={('project', 'msgid')}
>> +        ),
>> +        migrations.RunPython(
>> +            copy_project_field, migrations.RunPython.noop, atomic=False
>> +        ),
>> +        migrations.AlterField(
>> +            model_name='seriesreference',
>> +            name='project',
>> +            field=models.ForeignKey(
>> +                on_delete=django.db.models.deletion.CASCADE,
>> +                to='patchwork.Project',
>> +            ),
>> +        ),
>> +    ]
>> diff --git a/patchwork/models.py b/patchwork/models.py
>> index 7f0efd48..dbaff024 100644
>> --- a/patchwork/models.py
>> +++ b/patchwork/models.py
>> @@ -79,7 +79,8 @@ class Project(models.Model):
>>      webscm_url = models.CharField(max_length=2000, blank=True)
>>      list_archive_url = models.CharField(max_length=2000, blank=True)
>>      list_archive_url_format = models.CharField(
>> -        max_length=2000, blank=True,
>> +        max_length=2000,
>> +        blank=True,
>>          help_text="URL format for the list archive's Message-ID redirector. "
>>          "{} will be replaced by the Message-ID.")
>>      commit_url_format = models.CharField(
>> @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
>>      required to handle the case whereby one or more patches are
>>      received before the cover letter.
>>      """
>> +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
>>      series = models.ForeignKey(Series, related_name='references',
>>                                 related_query_name='reference',
>>                                 on_delete=models.CASCADE)
>> @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
>>          return self.msgid
>>  
>>      class Meta:
>> -        unique_together = [('series', 'msgid')]
>> +        unique_together = [('project', 'msgid')]
>>  
>>  
>>  class Bundle(models.Model):
>> diff --git a/patchwork/parser.py b/patchwork/parser.py
>> index c424729b..c0084cc0 100644
>> --- a/patchwork/parser.py
>> +++ b/patchwork/parser.py
>> @@ -16,6 +16,7 @@ import re
>>  
>>  from django.contrib.auth.models import User
>>  from django.db.utils import IntegrityError
>> +from django.db import transaction
>>  from django.utils import six
>>  
>>  from patchwork.models import Comment
>> @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
>>      for ref in refs:
>>          try:
>>              return SeriesReference.objects.get(
>> -                msgid=ref[:255], series__project=project).series
>> +                msgid=ref[:255], project=project).series
>>          except SeriesReference.DoesNotExist:
>>              continue
>> -        except SeriesReference.MultipleObjectsReturned:
>> -            # FIXME: Open bug: this can happen when we're processing
>> -            # messages in parallel. Pick the first and log.
>> -            logger.error("Multiple SeriesReferences for %s in project %s!" %
>> -                         (ref[:255], project.name))
>> -            return SeriesReference.objects.filter(
>> -                msgid=ref[:255], series__project=project).first().series
>>  
>>  
>>  def _find_series_by_markers(project, mail, author):
>> @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
>>          except IntegrityError:
>>              raise DuplicateMailError(msgid=msgid)
>>  
>> -        # if we don't have a series marker, we will never have an existing
>> -        # series to match against.
>> -        series = None
>> -        if n:
>> -            series = find_series(project, mail, author)
>> +        for attempt in range(1, 11):  # arbitrary retry count
>> +            try:
>> +                with transaction.atomic():
>> +                    # if we don't have a series marker, we will never have an
>> +                    # existing series to match against.
>> +                    series = None
>> +                    if n:
>> +                        series = find_series(project, mail, author)
>> +                    else:
>> +                        x = n = 1
>> +
>> +                    # We will create a new series if:
>> +                    # - there is no existing series to assign this patch to, or
>> +                    # - there is an existing series, but it already has a patch
>> +                    #   with this number in it
>> +                    if not series or Patch.objects.filter(
>> +                            series=series, number=x).count():
>> +                        series = Series.objects.create(
>> +                            project=project,
>> +                            date=date,
>> +                            submitter=author,
>> +                            version=version,
>> +                            total=n)
>> +
>> +                        # NOTE(stephenfin) We must save references for series.
>> +                        # We do this to handle the case where a later patch is
>> +                        # received first. Without storing references, it would
>> +                        # not be possible to identify the relationship between
>> +                        # patches as the earlier patch does not reference the
>> +                        # later one.
>> +                        for ref in refs + [msgid]:
>> +                            ref = ref[:255]
>> +                            # we don't want duplicates
>> +                            try:
>> +                                # we could have a ref to a previous series.
>> +                                # (For example, a series sent in reply to
>> +                                # another series.) That should not create a
>> +                                # series ref for this series, so check for the
>> +                                # msg-id only, not the msg-id/series pair.
>> +                                SeriesReference.objects.get(
>> +                                    msgid=ref, project=project)
>> +                            except SeriesReference.DoesNotExist:
>> +                                SeriesReference.objects.create(
>> +                                    msgid=ref, project=project, series=series)
>> +                    break
>> +            except IntegrityError:
>> +                # we lost the race so go again
>> +                logger.warning('Conflict while saving series. This is '
>> +                               'probably because multiple patches belonging '
>> +                               'to the same series have been received at '
>> +                               'once. Trying again (attempt %02d/10)',
>> +                               attempt)
>>          else:
>> -            x = n = 1
>> -
>> -        # We will create a new series if:
>> -        # - there is no existing series to assign this patch to, or
>> -        # - there is an existing series, but it already has a patch with this
>> -        #   number in it
>> -        if not series or Patch.objects.filter(series=series, number=x).count():
>> -            series = Series.objects.create(
>> -                project=project,
>> -                date=date,
>> -                submitter=author,
>> -                version=version,
>> -                total=n)
>> -
>> -            # NOTE(stephenfin) We must save references for series. We
>> -            # do this to handle the case where a later patch is
>> -            # received first. Without storing references, it would not
>> -            # be possible to identify the relationship between patches
>> -            # as the earlier patch does not reference the later one.
>> -            for ref in refs + [msgid]:
>> -                ref = ref[:255]
>> -                # we don't want duplicates
>> -                try:
>> -                    # we could have a ref to a previous series. (For
>> -                    # example, a series sent in reply to another
>> -                    # series.) That should not create a series ref
>> -                    # for this series, so check for the msg-id only,
>> -                    # not the msg-id/series pair.
>> -                    SeriesReference.objects.get(msgid=ref,
>> -                                                series__project=project)
>> -                except SeriesReference.DoesNotExist:
>> -                    SeriesReference.objects.create(series=series, msgid=ref)
>> -                except SeriesReference.MultipleObjectsReturned:
>> -                    logger.error("Multiple SeriesReferences for %s"
>> -                                 " in project %s!" % (ref, project.name))
>> +            # we failed to save the series so return the series-less patch
>> +            logger.warning('Failed to save series. Your patch with message ID '
>> +                           '%s has been saved but this should not happen. '
>> +                           'Please report this as a bug and include logs',
>> +                           msgid)
>> +            return patch
>>  
>>          # add to a series if we have found one, and we have a numbered
>>          # patch. Don't add unnumbered patches (for example diffs sent
>> @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
>>              # message
>>              try:
>>                  series = SeriesReference.objects.get(
>> -                    msgid=msgid, series__project=project).series
>> +                    msgid=msgid, project=project).series
>>              except SeriesReference.DoesNotExist:
>>                  series = None
>> -            except SeriesReference.MultipleObjectsReturned:
>> -                logger.error("Multiple SeriesReferences for %s"
>> -                             " in project %s!" % (msgid, project.name))
>> -                series = SeriesReference.objects.filter(
>> -                    msgid=msgid, series__project=project).first().series
>>  
>>              if not series:
>>                  series = Series.objects.create(
>> @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
>>                  # we don't save the in-reply-to or references fields
>>                  # for a cover letter, as they can't refer to the same
>>                  # series
>> -                try:
>> -                    SeriesReference.objects.get_or_create(series=series,
>> -                                                          msgid=msgid)
>> -                except SeriesReference.MultipleObjectsReturned:
>> -                    logger.error("Multiple SeriesReferences for %s"
>> -                                 " in project %s!" % (msgid, project.name))
>> +                SeriesReference.objects.create(
>> +                    msgid=msgid, project=project, series=series)
>>  
>>              try:
>>                  cover_letter = CoverLetter.objects.create(
>> diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> index 0bf71585..0edbd87a 100644
>> --- a/patchwork/tests/test_parser.py
>> +++ b/patchwork/tests/test_parser.py
>> @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
>>          msgids = [make_msgid()]
>>          project = create_project()
>>          series_v1 = create_series(project=project)
>> -        create_series_reference(msgid=msgids[0], series=series_v1)
>> +        create_series_reference(msgid=msgids[0], series=series_v1,
>> +                                project=project)
>>  
>>          # ...and three patches
>>          for i in range(3):
>>              msgids.append(make_msgid())
>> -            create_series_reference(msgid=msgids[-1], series=series_v1)
>> +            create_series_reference(msgid=msgids[-1], series=series_v1,
>> +                                    project=project)
>>  
>>          # now create a new series with "cover letter"
>>          msgids.append(make_msgid())
>>          series_v2 = create_series(project=project)
>> -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
>> +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
>> +                                         project=project)
>>  
>>          # ...and the "first patch" of this new series
>>          msgid = make_msgid()
>> diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
>> index 577183d0..4ead1781 100644
>> --- a/patchwork/tests/utils.py
>> +++ b/patchwork/tests/utils.py
>> @@ -282,8 +282,12 @@ def create_series(**kwargs):
>>  
>>  def create_series_reference(**kwargs):
>>      """Create 'SeriesReference' object."""
>> +    project = kwargs.pop('project', create_project())
>> +    series = kwargs.pop('series', create_series(project=project))
>> +
>>      values = {
>> -        'series': create_series() if 'series' not in kwargs else None,
>> +        'series': series,
>> +        'project': project,
>>          'msgid': make_msgid(),
>>      }
>>      values.update(**kwargs)
>> -- 
>> 2.23.0
Stephen Finucane Dec. 8, 2019, 1:33 p.m. UTC | #4
On Sun, 2019-12-08 at 22:14 +1100, Daniel Axtens wrote:
> Daniel Axtens <dja@axtens.net> writes:
> 
> > > Currently, the 'SeriesReference' object has a unique constraint on the
> > > two fields it has, 'series', which is a foreign key to 'Series', and
> > > 'msgid'. This is the wrong constraint. What we actually want to enforce
> > > is that a patch, cover letter or comment is referenced by a single
> > > series, or rather a single series per project the submission appears on.
> > > As such, we should be enforcing uniqueness on the msgid and the project
> > > that the patch, cover letter or comment belongs to.
> > 
> > OK, so I've been working on this and making good progress.
> > 
> > My fundamental issue with the current design, and this extension of it,
> > is that we're ignoring or working against the tools that databases
> > provide for concurrency. This patch is an improvement, no doubt, but I'd
> > really like to solve it properly.
> > 
> > Because we have multiple processes operating on the database - processes
> > without external syncronisation - we need to rely on the database to
> > provide the synchronisation primitive. Currently we're doing things like
> > asking the database to check if something exists and then creating it if
> > it does not. This is embeds TOCTOU problems, similar to the ones I fixed
> > for Person creation a while back. What we should be doing is
> > get_or_create style - basically more aggressively attempting to create
> > the thing we need and if that fails, then do the lookup for the existing
> > object.
> > 
> > This design pervades the series parser. It doesn't generally bite us
> > because we generally, more or less, parse in series.
> > 
> > I'm basically refactoring the parser to be
> > 'try-creating-and-catch-failures' rather than
> > 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
> > least 80% of the way there.
> 
> The more I think about this patch, the more I see it does actually move
> us in the right direction. Let's run with it for now.

That was going to be my response to your previous reply: does anything
I'm doing in that series wrt *the models* not make sense, and if not,
should we apply it anyway since the parser is easily reworked again in
the future, but model changes are a little more permanent.

> As for my rewrite, well, series parsing is really hard. I'm also working
> from a data set that has a whole set of bizzare things including series
> of patches in-reply-to a cover letter, a series of patches that's half
> threaded and half unthreaded and a patch series that contains - by
> design - multiple patches per "number" (i.e. 3 patches each numbered
> 1/N, 3 patches numbered 2/N...), so that means I'm forever fighting
> weird edge-cases of how people used git a decade ago. So I'm going to
> pass on trying to rewrite it for now.

Any chance you can share this dataset? The only one I have to work with
is the Patchwork one which isn't that large. I did request the DPDK
archives some time back but I haven't checked whether what I received
even works.

Stephen

> I'll go over it with a finer-toothed comb and do a proper review, but I
> withdraw my overall objection.
> 
> Regards,
> Daniel
> > I would be happy to leave my series parsing rework and Mete's work to a
> > 2.3, or I'd be happy to try and grind through, get these done, and make
> > 2.2 the final release in the 2.x series. What do you think?
> > 
> > Regards,
> > Daniel
> > 
> > Stephen Finucane <stephen@that.guru> writes:
> > 
> > > This requires adding a new field to the object, 'project', since it's
> > > not possible to do something like the following:
> > > 
> > >   unique_together = [('msgid', 'series__project')]
> > > 
> > > This is detailed here [1]. In addition, the migration needs a precursor
> > > migration step to merge any broken series.
> > > 
> > > [1] https://stackoverflow.com/a/4440189/613428
> > > 
> > > Signed-off-by: Stephen Finucane <stephen@that.guru>
> > > Closes: #241
> > > Cc: Daniel Axtens <dja@axtens.net>
> > > Cc: Petr Vorel <petr.vorel@gmail.com>
> > > ---
> > > v4:
> > > - Further tweaks to logging
> > > v3:
> > > - Improve logging
> > > ---
> > >  .../0038_unique_series_references.py          | 121 +++++++++++++++++
> > >  patchwork/models.py                           |   6 +-
> > >  patchwork/parser.py                           | 123 +++++++++---------
> > >  patchwork/tests/test_parser.py                |   9 +-
> > >  patchwork/tests/utils.py                      |   6 +-
> > >  5 files changed, 199 insertions(+), 66 deletions(-)
> > >  create mode 100644 patchwork/migrations/0038_unique_series_references.py
> > > 
> > > diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
> > > new file mode 100644
> > > index 00000000..91799020
> > > --- /dev/null
> > > +++ b/patchwork/migrations/0038_unique_series_references.py
> > > @@ -0,0 +1,121 @@
> > > +from django.db import connection, migrations, models
> > > +from django.db.models import Count
> > > +import django.db.models.deletion
> > > +
> > > +
> > > +def merge_duplicate_series(apps, schema_editor):
> > > +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> > > +    Patch = apps.get_model('patchwork', 'Patch')
> > > +
> > > +    msgid_seriesrefs = {}
> > > +
> > > +    # find all SeriesReference that share a msgid but point to different series
> > > +    # and decide which of the series is going to be the authoritative one
> > > +    msgid_counts = (
> > > +        SeriesReference.objects.values('msgid')
> > > +        .annotate(count=Count('msgid'))
> > > +        .filter(count__gt=1)
> > > +    )
> > > +    for msgid_count in msgid_counts:
> > > +        msgid = msgid_count['msgid']
> > > +        chosen_ref = None
> > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> > > +            if series_ref.series.cover_letter:
> > > +                if chosen_ref:
> > > +                    # I don't think this can happen, but explode if it does
> > > +                    raise Exception(
> > > +                        "Looks like you've got two or more series that share "
> > > +                        "some patches but do not share a cover letter. Unable "
> > > +                        "to auto-resolve."
> > > +                    )
> > > +
> > > +                # if a series has a cover letter, that's the one we'll group
> > > +                # everything under
> > > +                chosen_ref = series_ref
> > > +
> > > +        if not chosen_ref:
> > > +            # if none of the series have cover letters, simply use the last
> > > +            # one (hint: this relies on Python's weird scoping for for loops
> > > +            # where 'series_ref' is still accessible outside the loop)
> > > +            chosen_ref = series_ref
> > > +
> > > +        msgid_seriesrefs[msgid] = chosen_ref
> > > +
> > > +    # reassign any patches referring to non-authoritative series to point to
> > > +    # the authoritative one, and delete the other series; we do this separately
> > > +    # to allow us a chance to raise the exception above if necessary
> > > +    for msgid, chosen_ref in msgid_seriesrefs.items():
> > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> > > +            if series_ref == chosen_ref:
> > > +                continue
> > > +
> > > +            # update the patches to point to our chosen series instead, on the
> > > +            # assumption that all other metadata is correct
> > > +            for patch in Patch.objects.filter(series=series_ref.series):
> > > +                patch.series = chosen_ref.series
> > > +                patch.save()
> > > +
> > > +            # delete the other series (which will delete the series ref)
> > > +            series_ref.series.delete()
> > > +
> > > +
> > > +def copy_project_field(apps, schema_editor):
> > > +    if connection.vendor == 'postgresql':
> > > +        schema_editor.execute(
> > > +            """
> > > +            UPDATE patchwork_seriesreference
> > > +              SET project_id = patchwork_series.project_id
> > > +            FROM patchwork_series
> > > +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
> > > +        """
> > > +        )
> > > +    elif connection.vendor == 'mysql':
> > > +        schema_editor.execute(
> > > +            """
> > > +            UPDATE patchwork_seriesreference, patchwork_series
> > > +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
> > > +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
> > > +        """
> > > +        )  # noqa
> > > +    else:
> > > +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> > > +
> > > +        for series_ref in SeriesReference.objects.all().select_related(
> > > +            'series'
> > > +        ):
> > > +            series_ref.project = series_ref.series.project
> > > +            series_ref.save()
> > > +
> > > +
> > > +class Migration(migrations.Migration):
> > > +
> > > +    dependencies = [('patchwork', '0037_event_actor')]
> > > +
> > > +    operations = [
> > > +        migrations.RunPython(
> > > +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
> > > +        ),
> > > +        migrations.AddField(
> > > +            model_name='seriesreference',
> > > +            name='project',
> > > +            field=models.ForeignKey(
> > > +                null=True,
> > > +                on_delete=django.db.models.deletion.CASCADE,
> > > +                to='patchwork.Project',
> > > +            ),
> > > +        ),
> > > +        migrations.AlterUniqueTogether(
> > > +            name='seriesreference', unique_together={('project', 'msgid')}
> > > +        ),
> > > +        migrations.RunPython(
> > > +            copy_project_field, migrations.RunPython.noop, atomic=False
> > > +        ),
> > > +        migrations.AlterField(
> > > +            model_name='seriesreference',
> > > +            name='project',
> > > +            field=models.ForeignKey(
> > > +                on_delete=django.db.models.deletion.CASCADE,
> > > +                to='patchwork.Project',
> > > +            ),
> > > +        ),
> > > +    ]
> > > diff --git a/patchwork/models.py b/patchwork/models.py
> > > index 7f0efd48..dbaff024 100644
> > > --- a/patchwork/models.py
> > > +++ b/patchwork/models.py
> > > @@ -79,7 +79,8 @@ class Project(models.Model):
> > >      webscm_url = models.CharField(max_length=2000, blank=True)
> > >      list_archive_url = models.CharField(max_length=2000, blank=True)
> > >      list_archive_url_format = models.CharField(
> > > -        max_length=2000, blank=True,
> > > +        max_length=2000,
> > > +        blank=True,
> > >          help_text="URL format for the list archive's Message-ID redirector. "
> > >          "{} will be replaced by the Message-ID.")
> > >      commit_url_format = models.CharField(
> > > @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
> > >      required to handle the case whereby one or more patches are
> > >      received before the cover letter.
> > >      """
> > > +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
> > >      series = models.ForeignKey(Series, related_name='references',
> > >                                 related_query_name='reference',
> > >                                 on_delete=models.CASCADE)
> > > @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
> > >          return self.msgid
> > >  
> > >      class Meta:
> > > -        unique_together = [('series', 'msgid')]
> > > +        unique_together = [('project', 'msgid')]
> > >  
> > >  
> > >  class Bundle(models.Model):
> > > diff --git a/patchwork/parser.py b/patchwork/parser.py
> > > index c424729b..c0084cc0 100644
> > > --- a/patchwork/parser.py
> > > +++ b/patchwork/parser.py
> > > @@ -16,6 +16,7 @@ import re
> > >  
> > >  from django.contrib.auth.models import User
> > >  from django.db.utils import IntegrityError
> > > +from django.db import transaction
> > >  from django.utils import six
> > >  
> > >  from patchwork.models import Comment
> > > @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
> > >      for ref in refs:
> > >          try:
> > >              return SeriesReference.objects.get(
> > > -                msgid=ref[:255], series__project=project).series
> > > +                msgid=ref[:255], project=project).series
> > >          except SeriesReference.DoesNotExist:
> > >              continue
> > > -        except SeriesReference.MultipleObjectsReturned:
> > > -            # FIXME: Open bug: this can happen when we're processing
> > > -            # messages in parallel. Pick the first and log.
> > > -            logger.error("Multiple SeriesReferences for %s in project %s!" %
> > > -                         (ref[:255], project.name))
> > > -            return SeriesReference.objects.filter(
> > > -                msgid=ref[:255], series__project=project).first().series
> > >  
> > >  
> > >  def _find_series_by_markers(project, mail, author):
> > > @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
> > >          except IntegrityError:
> > >              raise DuplicateMailError(msgid=msgid)
> > >  
> > > -        # if we don't have a series marker, we will never have an existing
> > > -        # series to match against.
> > > -        series = None
> > > -        if n:
> > > -            series = find_series(project, mail, author)
> > > +        for attempt in range(1, 11):  # arbitrary retry count
> > > +            try:
> > > +                with transaction.atomic():
> > > +                    # if we don't have a series marker, we will never have an
> > > +                    # existing series to match against.
> > > +                    series = None
> > > +                    if n:
> > > +                        series = find_series(project, mail, author)
> > > +                    else:
> > > +                        x = n = 1
> > > +
> > > +                    # We will create a new series if:
> > > +                    # - there is no existing series to assign this patch to, or
> > > +                    # - there is an existing series, but it already has a patch
> > > +                    #   with this number in it
> > > +                    if not series or Patch.objects.filter(
> > > +                            series=series, number=x).count():
> > > +                        series = Series.objects.create(
> > > +                            project=project,
> > > +                            date=date,
> > > +                            submitter=author,
> > > +                            version=version,
> > > +                            total=n)
> > > +
> > > +                        # NOTE(stephenfin) We must save references for series.
> > > +                        # We do this to handle the case where a later patch is
> > > +                        # received first. Without storing references, it would
> > > +                        # not be possible to identify the relationship between
> > > +                        # patches as the earlier patch does not reference the
> > > +                        # later one.
> > > +                        for ref in refs + [msgid]:
> > > +                            ref = ref[:255]
> > > +                            # we don't want duplicates
> > > +                            try:
> > > +                                # we could have a ref to a previous series.
> > > +                                # (For example, a series sent in reply to
> > > +                                # another series.) That should not create a
> > > +                                # series ref for this series, so check for the
> > > +                                # msg-id only, not the msg-id/series pair.
> > > +                                SeriesReference.objects.get(
> > > +                                    msgid=ref, project=project)
> > > +                            except SeriesReference.DoesNotExist:
> > > +                                SeriesReference.objects.create(
> > > +                                    msgid=ref, project=project, series=series)
> > > +                    break
> > > +            except IntegrityError:
> > > +                # we lost the race so go again
> > > +                logger.warning('Conflict while saving series. This is '
> > > +                               'probably because multiple patches belonging '
> > > +                               'to the same series have been received at '
> > > +                               'once. Trying again (attempt %02d/10)',
> > > +                               attempt)
> > >          else:
> > > -            x = n = 1
> > > -
> > > -        # We will create a new series if:
> > > -        # - there is no existing series to assign this patch to, or
> > > -        # - there is an existing series, but it already has a patch with this
> > > -        #   number in it
> > > -        if not series or Patch.objects.filter(series=series, number=x).count():
> > > -            series = Series.objects.create(
> > > -                project=project,
> > > -                date=date,
> > > -                submitter=author,
> > > -                version=version,
> > > -                total=n)
> > > -
> > > -            # NOTE(stephenfin) We must save references for series. We
> > > -            # do this to handle the case where a later patch is
> > > -            # received first. Without storing references, it would not
> > > -            # be possible to identify the relationship between patches
> > > -            # as the earlier patch does not reference the later one.
> > > -            for ref in refs + [msgid]:
> > > -                ref = ref[:255]
> > > -                # we don't want duplicates
> > > -                try:
> > > -                    # we could have a ref to a previous series. (For
> > > -                    # example, a series sent in reply to another
> > > -                    # series.) That should not create a series ref
> > > -                    # for this series, so check for the msg-id only,
> > > -                    # not the msg-id/series pair.
> > > -                    SeriesReference.objects.get(msgid=ref,
> > > -                                                series__project=project)
> > > -                except SeriesReference.DoesNotExist:
> > > -                    SeriesReference.objects.create(series=series, msgid=ref)
> > > -                except SeriesReference.MultipleObjectsReturned:
> > > -                    logger.error("Multiple SeriesReferences for %s"
> > > -                                 " in project %s!" % (ref, project.name))
> > > +            # we failed to save the series so return the series-less patch
> > > +            logger.warning('Failed to save series. Your patch with message ID '
> > > +                           '%s has been saved but this should not happen. '
> > > +                           'Please report this as a bug and include logs',
> > > +                           msgid)
> > > +            return patch
> > >  
> > >          # add to a series if we have found one, and we have a numbered
> > >          # patch. Don't add unnumbered patches (for example diffs sent
> > > @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
> > >              # message
> > >              try:
> > >                  series = SeriesReference.objects.get(
> > > -                    msgid=msgid, series__project=project).series
> > > +                    msgid=msgid, project=project).series
> > >              except SeriesReference.DoesNotExist:
> > >                  series = None
> > > -            except SeriesReference.MultipleObjectsReturned:
> > > -                logger.error("Multiple SeriesReferences for %s"
> > > -                             " in project %s!" % (msgid, project.name))
> > > -                series = SeriesReference.objects.filter(
> > > -                    msgid=msgid, series__project=project).first().series
> > >  
> > >              if not series:
> > >                  series = Series.objects.create(
> > > @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
> > >                  # we don't save the in-reply-to or references fields
> > >                  # for a cover letter, as they can't refer to the same
> > >                  # series
> > > -                try:
> > > -                    SeriesReference.objects.get_or_create(series=series,
> > > -                                                          msgid=msgid)
> > > -                except SeriesReference.MultipleObjectsReturned:
> > > -                    logger.error("Multiple SeriesReferences for %s"
> > > -                                 " in project %s!" % (msgid, project.name))
> > > +                SeriesReference.objects.create(
> > > +                    msgid=msgid, project=project, series=series)
> > >  
> > >              try:
> > >                  cover_letter = CoverLetter.objects.create(
> > > diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> > > index 0bf71585..0edbd87a 100644
> > > --- a/patchwork/tests/test_parser.py
> > > +++ b/patchwork/tests/test_parser.py
> > > @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
> > >          msgids = [make_msgid()]
> > >          project = create_project()
> > >          series_v1 = create_series(project=project)
> > > -        create_series_reference(msgid=msgids[0], series=series_v1)
> > > +        create_series_reference(msgid=msgids[0], series=series_v1,
> > > +                                project=project)
> > >  
> > >          # ...and three patches
> > >          for i in range(3):
> > >              msgids.append(make_msgid())
> > > -            create_series_reference(msgid=msgids[-1], series=series_v1)
> > > +            create_series_reference(msgid=msgids[-1], series=series_v1,
> > > +                                    project=project)
> > >  
> > >          # now create a new series with "cover letter"
> > >          msgids.append(make_msgid())
> > >          series_v2 = create_series(project=project)
> > > -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
> > > +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
> > > +                                         project=project)
> > >  
> > >          # ...and the "first patch" of this new series
> > >          msgid = make_msgid()
> > > diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
> > > index 577183d0..4ead1781 100644
> > > --- a/patchwork/tests/utils.py
> > > +++ b/patchwork/tests/utils.py
> > > @@ -282,8 +282,12 @@ def create_series(**kwargs):
> > >  
> > >  def create_series_reference(**kwargs):
> > >      """Create 'SeriesReference' object."""
> > > +    project = kwargs.pop('project', create_project())
> > > +    series = kwargs.pop('series', create_series(project=project))
> > > +
> > >      values = {
> > > -        'series': create_series() if 'series' not in kwargs else None,
> > > +        'series': series,
> > > +        'project': project,
> > >          'msgid': make_msgid(),
> > >      }
> > >      values.update(**kwargs)
> > > -- 
> > > 2.23.0
Daniel Axtens Dec. 8, 2019, 10:27 p.m. UTC | #5
Stephen Finucane <stephen@that.guru> writes:

> On Sun, 2019-12-08 at 22:14 +1100, Daniel Axtens wrote:
>> Daniel Axtens <dja@axtens.net> writes:
>> 
>> > > Currently, the 'SeriesReference' object has a unique constraint on the
>> > > two fields it has, 'series', which is a foreign key to 'Series', and
>> > > 'msgid'. This is the wrong constraint. What we actually want to enforce
>> > > is that a patch, cover letter or comment is referenced by a single
>> > > series, or rather a single series per project the submission appears on.
>> > > As such, we should be enforcing uniqueness on the msgid and the project
>> > > that the patch, cover letter or comment belongs to.
>> > 
>> > OK, so I've been working on this and making good progress.
>> > 
>> > My fundamental issue with the current design, and this extension of it,
>> > is that we're ignoring or working against the tools that databases
>> > provide for concurrency. This patch is an improvement, no doubt, but I'd
>> > really like to solve it properly.
>> > 
>> > Because we have multiple processes operating on the database - processes
>> > without external syncronisation - we need to rely on the database to
>> > provide the synchronisation primitive. Currently we're doing things like
>> > asking the database to check if something exists and then creating it if
>> > it does not. This is embeds TOCTOU problems, similar to the ones I fixed
>> > for Person creation a while back. What we should be doing is
>> > get_or_create style - basically more aggressively attempting to create
>> > the thing we need and if that fails, then do the lookup for the existing
>> > object.
>> > 
>> > This design pervades the series parser. It doesn't generally bite us
>> > because we generally, more or less, parse in series.
>> > 
>> > I'm basically refactoring the parser to be
>> > 'try-creating-and-catch-failures' rather than
>> > 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
>> > least 80% of the way there.
>> 
>> The more I think about this patch, the more I see it does actually move
>> us in the right direction. Let's run with it for now.
>
> That was going to be my response to your previous reply: does anything
> I'm doing in that series wrt *the models* not make sense, and if not,
> should we apply it anyway since the parser is easily reworked again in
> the future, but model changes are a little more permanent.

I had wanted to make it so that you couldn't insert multiple series for
the same thread by adding the concept of a 'head message' to a
series. Then you could create a unique_together with (head message id,
project) and that would catch all the problems much earlier. I did end
up coding that, and it mostly works, but there are still edgecases and
bugs that I was working through.

>> As for my rewrite, well, series parsing is really hard. I'm also working
>> from a data set that has a whole set of bizzare things including series
>> of patches in-reply-to a cover letter, a series of patches that's half
>> threaded and half unthreaded and a patch series that contains - by
>> design - multiple patches per "number" (i.e. 3 patches each numbered
>> 1/N, 3 patches numbered 2/N...), so that means I'm forever fighting
>> weird edge-cases of how people used git a decade ago. So I'm going to
>> pass on trying to rewrite it for now.
>
> Any chance you can share this dataset? The only one I have to work with
> is the Patchwork one which isn't that large. I did request the DPDK
> archives some time back but I haven't checked whether what I received
> even works.

I have an mid-2017 copy of Ubuntu kernel-team mailing list archive,
which I probably picked because I worked there at the time. There's
still a link to download it online, and now it's even bigger:
https://lists.ubuntu.com/archives/kernel-team/ - right at the top of the
page there's a link to 'download the full raw archive'.

Some things to note is that the archive does have a bunch of messages
without Message-Ids(!) - looks like some poorly coded bots for regular
messages about kernel builds. There's also the usual browsable mailman
archive which I find very useful. I use it with my parallel parser,
although I've reduce the runahead distance to 20.

Some problems with working on this archive are that it's just _so_ large
it's hard to manually check things, and some of the really broken
behaviours are now over a decade old. (Not all though, check out 'Wily
[PATCH 00/22]- Please enable kconfig X86_LEGACY_VM86 for i386' at
https://lists.ubuntu.com/archives/kernel-team/2015-October/thread.html )

>> I'll go over it with a finer-toothed comb and do a proper review, but I
>> withdraw my overall objection.

Still going to do this, but I was thinking about it this morning and I
think you also need to wrap the series/seriesreference manipulation in
the coverletter path in an atomic context... you've made some changes to
that path, was there a reason not to do the same atomic context wrapping?

Regards,
Daniel


>> Regards,
>> Daniel
>> > I would be happy to leave my series parsing rework and Mete's work to a
>> > 2.3, or I'd be happy to try and grind through, get these done, and make
>> > 2.2 the final release in the 2.x series. What do you think?
>> > 
>> > Regards,
>> > Daniel
>> > 
>> > Stephen Finucane <stephen@that.guru> writes:
>> > 
>> > > This requires adding a new field to the object, 'project', since it's
>> > > not possible to do something like the following:
>> > > 
>> > >   unique_together = [('msgid', 'series__project')]
>> > > 
>> > > This is detailed here [1]. In addition, the migration needs a precursor
>> > > migration step to merge any broken series.
>> > > 
>> > > [1] https://stackoverflow.com/a/4440189/613428
>> > > 
>> > > Signed-off-by: Stephen Finucane <stephen@that.guru>
>> > > Closes: #241
>> > > Cc: Daniel Axtens <dja@axtens.net>
>> > > Cc: Petr Vorel <petr.vorel@gmail.com>
>> > > ---
>> > > v4:
>> > > - Further tweaks to logging
>> > > v3:
>> > > - Improve logging
>> > > ---
>> > >  .../0038_unique_series_references.py          | 121 +++++++++++++++++
>> > >  patchwork/models.py                           |   6 +-
>> > >  patchwork/parser.py                           | 123 +++++++++---------
>> > >  patchwork/tests/test_parser.py                |   9 +-
>> > >  patchwork/tests/utils.py                      |   6 +-
>> > >  5 files changed, 199 insertions(+), 66 deletions(-)
>> > >  create mode 100644 patchwork/migrations/0038_unique_series_references.py
>> > > 
>> > > diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
>> > > new file mode 100644
>> > > index 00000000..91799020
>> > > --- /dev/null
>> > > +++ b/patchwork/migrations/0038_unique_series_references.py
>> > > @@ -0,0 +1,121 @@
>> > > +from django.db import connection, migrations, models
>> > > +from django.db.models import Count
>> > > +import django.db.models.deletion
>> > > +
>> > > +
>> > > +def merge_duplicate_series(apps, schema_editor):
>> > > +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> > > +    Patch = apps.get_model('patchwork', 'Patch')
>> > > +
>> > > +    msgid_seriesrefs = {}
>> > > +
>> > > +    # find all SeriesReference that share a msgid but point to different series
>> > > +    # and decide which of the series is going to be the authoritative one
>> > > +    msgid_counts = (
>> > > +        SeriesReference.objects.values('msgid')
>> > > +        .annotate(count=Count('msgid'))
>> > > +        .filter(count__gt=1)
>> > > +    )
>> > > +    for msgid_count in msgid_counts:
>> > > +        msgid = msgid_count['msgid']
>> > > +        chosen_ref = None
>> > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> > > +            if series_ref.series.cover_letter:
>> > > +                if chosen_ref:
>> > > +                    # I don't think this can happen, but explode if it does
>> > > +                    raise Exception(
>> > > +                        "Looks like you've got two or more series that share "
>> > > +                        "some patches but do not share a cover letter. Unable "
>> > > +                        "to auto-resolve."
>> > > +                    )
>> > > +
>> > > +                # if a series has a cover letter, that's the one we'll group
>> > > +                # everything under
>> > > +                chosen_ref = series_ref
>> > > +
>> > > +        if not chosen_ref:
>> > > +            # if none of the series have cover letters, simply use the last
>> > > +            # one (hint: this relies on Python's weird scoping for for loops
>> > > +            # where 'series_ref' is still accessible outside the loop)
>> > > +            chosen_ref = series_ref
>> > > +
>> > > +        msgid_seriesrefs[msgid] = chosen_ref
>> > > +
>> > > +    # reassign any patches referring to non-authoritative series to point to
>> > > +    # the authoritative one, and delete the other series; we do this separately
>> > > +    # to allow us a chance to raise the exception above if necessary
>> > > +    for msgid, chosen_ref in msgid_seriesrefs.items():
>> > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
>> > > +            if series_ref == chosen_ref:
>> > > +                continue
>> > > +
>> > > +            # update the patches to point to our chosen series instead, on the
>> > > +            # assumption that all other metadata is correct
>> > > +            for patch in Patch.objects.filter(series=series_ref.series):
>> > > +                patch.series = chosen_ref.series
>> > > +                patch.save()
>> > > +
>> > > +            # delete the other series (which will delete the series ref)
>> > > +            series_ref.series.delete()
>> > > +
>> > > +
>> > > +def copy_project_field(apps, schema_editor):
>> > > +    if connection.vendor == 'postgresql':
>> > > +        schema_editor.execute(
>> > > +            """
>> > > +            UPDATE patchwork_seriesreference
>> > > +              SET project_id = patchwork_series.project_id
>> > > +            FROM patchwork_series
>> > > +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> > > +        """
>> > > +        )
>> > > +    elif connection.vendor == 'mysql':
>> > > +        schema_editor.execute(
>> > > +            """
>> > > +            UPDATE patchwork_seriesreference, patchwork_series
>> > > +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
>> > > +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
>> > > +        """
>> > > +        )  # noqa
>> > > +    else:
>> > > +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
>> > > +
>> > > +        for series_ref in SeriesReference.objects.all().select_related(
>> > > +            'series'
>> > > +        ):
>> > > +            series_ref.project = series_ref.series.project
>> > > +            series_ref.save()
>> > > +
>> > > +
>> > > +class Migration(migrations.Migration):
>> > > +
>> > > +    dependencies = [('patchwork', '0037_event_actor')]
>> > > +
>> > > +    operations = [
>> > > +        migrations.RunPython(
>> > > +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
>> > > +        ),
>> > > +        migrations.AddField(
>> > > +            model_name='seriesreference',
>> > > +            name='project',
>> > > +            field=models.ForeignKey(
>> > > +                null=True,
>> > > +                on_delete=django.db.models.deletion.CASCADE,
>> > > +                to='patchwork.Project',
>> > > +            ),
>> > > +        ),
>> > > +        migrations.AlterUniqueTogether(
>> > > +            name='seriesreference', unique_together={('project', 'msgid')}
>> > > +        ),
>> > > +        migrations.RunPython(
>> > > +            copy_project_field, migrations.RunPython.noop, atomic=False
>> > > +        ),
>> > > +        migrations.AlterField(
>> > > +            model_name='seriesreference',
>> > > +            name='project',
>> > > +            field=models.ForeignKey(
>> > > +                on_delete=django.db.models.deletion.CASCADE,
>> > > +                to='patchwork.Project',
>> > > +            ),
>> > > +        ),
>> > > +    ]
>> > > diff --git a/patchwork/models.py b/patchwork/models.py
>> > > index 7f0efd48..dbaff024 100644
>> > > --- a/patchwork/models.py
>> > > +++ b/patchwork/models.py
>> > > @@ -79,7 +79,8 @@ class Project(models.Model):
>> > >      webscm_url = models.CharField(max_length=2000, blank=True)
>> > >      list_archive_url = models.CharField(max_length=2000, blank=True)
>> > >      list_archive_url_format = models.CharField(
>> > > -        max_length=2000, blank=True,
>> > > +        max_length=2000,
>> > > +        blank=True,
>> > >          help_text="URL format for the list archive's Message-ID redirector. "
>> > >          "{} will be replaced by the Message-ID.")
>> > >      commit_url_format = models.CharField(
>> > > @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
>> > >      required to handle the case whereby one or more patches are
>> > >      received before the cover letter.
>> > >      """
>> > > +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
>> > >      series = models.ForeignKey(Series, related_name='references',
>> > >                                 related_query_name='reference',
>> > >                                 on_delete=models.CASCADE)
>> > > @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
>> > >          return self.msgid
>> > >  
>> > >      class Meta:
>> > > -        unique_together = [('series', 'msgid')]
>> > > +        unique_together = [('project', 'msgid')]
>> > >  
>> > >  
>> > >  class Bundle(models.Model):
>> > > diff --git a/patchwork/parser.py b/patchwork/parser.py
>> > > index c424729b..c0084cc0 100644
>> > > --- a/patchwork/parser.py
>> > > +++ b/patchwork/parser.py
>> > > @@ -16,6 +16,7 @@ import re
>> > >  
>> > >  from django.contrib.auth.models import User
>> > >  from django.db.utils import IntegrityError
>> > > +from django.db import transaction
>> > >  from django.utils import six
>> > >  
>> > >  from patchwork.models import Comment
>> > > @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
>> > >      for ref in refs:
>> > >          try:
>> > >              return SeriesReference.objects.get(
>> > > -                msgid=ref[:255], series__project=project).series
>> > > +                msgid=ref[:255], project=project).series
>> > >          except SeriesReference.DoesNotExist:
>> > >              continue
>> > > -        except SeriesReference.MultipleObjectsReturned:
>> > > -            # FIXME: Open bug: this can happen when we're processing
>> > > -            # messages in parallel. Pick the first and log.
>> > > -            logger.error("Multiple SeriesReferences for %s in project %s!" %
>> > > -                         (ref[:255], project.name))
>> > > -            return SeriesReference.objects.filter(
>> > > -                msgid=ref[:255], series__project=project).first().series
>> > >  
>> > >  
>> > >  def _find_series_by_markers(project, mail, author):
>> > > @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
>> > >          except IntegrityError:
>> > >              raise DuplicateMailError(msgid=msgid)
>> > >  
>> > > -        # if we don't have a series marker, we will never have an existing
>> > > -        # series to match against.
>> > > -        series = None
>> > > -        if n:
>> > > -            series = find_series(project, mail, author)
>> > > +        for attempt in range(1, 11):  # arbitrary retry count
>> > > +            try:
>> > > +                with transaction.atomic():
>> > > +                    # if we don't have a series marker, we will never have an
>> > > +                    # existing series to match against.
>> > > +                    series = None
>> > > +                    if n:
>> > > +                        series = find_series(project, mail, author)
>> > > +                    else:
>> > > +                        x = n = 1
>> > > +
>> > > +                    # We will create a new series if:
>> > > +                    # - there is no existing series to assign this patch to, or
>> > > +                    # - there is an existing series, but it already has a patch
>> > > +                    #   with this number in it
>> > > +                    if not series or Patch.objects.filter(
>> > > +                            series=series, number=x).count():
>> > > +                        series = Series.objects.create(
>> > > +                            project=project,
>> > > +                            date=date,
>> > > +                            submitter=author,
>> > > +                            version=version,
>> > > +                            total=n)
>> > > +
>> > > +                        # NOTE(stephenfin) We must save references for series.
>> > > +                        # We do this to handle the case where a later patch is
>> > > +                        # received first. Without storing references, it would
>> > > +                        # not be possible to identify the relationship between
>> > > +                        # patches as the earlier patch does not reference the
>> > > +                        # later one.
>> > > +                        for ref in refs + [msgid]:
>> > > +                            ref = ref[:255]
>> > > +                            # we don't want duplicates
>> > > +                            try:
>> > > +                                # we could have a ref to a previous series.
>> > > +                                # (For example, a series sent in reply to
>> > > +                                # another series.) That should not create a
>> > > +                                # series ref for this series, so check for the
>> > > +                                # msg-id only, not the msg-id/series pair.
>> > > +                                SeriesReference.objects.get(
>> > > +                                    msgid=ref, project=project)
>> > > +                            except SeriesReference.DoesNotExist:
>> > > +                                SeriesReference.objects.create(
>> > > +                                    msgid=ref, project=project, series=series)
>> > > +                    break
>> > > +            except IntegrityError:
>> > > +                # we lost the race so go again
>> > > +                logger.warning('Conflict while saving series. This is '
>> > > +                               'probably because multiple patches belonging '
>> > > +                               'to the same series have been received at '
>> > > +                               'once. Trying again (attempt %02d/10)',
>> > > +                               attempt)
>> > >          else:
>> > > -            x = n = 1
>> > > -
>> > > -        # We will create a new series if:
>> > > -        # - there is no existing series to assign this patch to, or
>> > > -        # - there is an existing series, but it already has a patch with this
>> > > -        #   number in it
>> > > -        if not series or Patch.objects.filter(series=series, number=x).count():
>> > > -            series = Series.objects.create(
>> > > -                project=project,
>> > > -                date=date,
>> > > -                submitter=author,
>> > > -                version=version,
>> > > -                total=n)
>> > > -
>> > > -            # NOTE(stephenfin) We must save references for series. We
>> > > -            # do this to handle the case where a later patch is
>> > > -            # received first. Without storing references, it would not
>> > > -            # be possible to identify the relationship between patches
>> > > -            # as the earlier patch does not reference the later one.
>> > > -            for ref in refs + [msgid]:
>> > > -                ref = ref[:255]
>> > > -                # we don't want duplicates
>> > > -                try:
>> > > -                    # we could have a ref to a previous series. (For
>> > > -                    # example, a series sent in reply to another
>> > > -                    # series.) That should not create a series ref
>> > > -                    # for this series, so check for the msg-id only,
>> > > -                    # not the msg-id/series pair.
>> > > -                    SeriesReference.objects.get(msgid=ref,
>> > > -                                                series__project=project)
>> > > -                except SeriesReference.DoesNotExist:
>> > > -                    SeriesReference.objects.create(series=series, msgid=ref)
>> > > -                except SeriesReference.MultipleObjectsReturned:
>> > > -                    logger.error("Multiple SeriesReferences for %s"
>> > > -                                 " in project %s!" % (ref, project.name))
>> > > +            # we failed to save the series so return the series-less patch
>> > > +            logger.warning('Failed to save series. Your patch with message ID '
>> > > +                           '%s has been saved but this should not happen. '
>> > > +                           'Please report this as a bug and include logs',
>> > > +                           msgid)
>> > > +            return patch
>> > >  
>> > >          # add to a series if we have found one, and we have a numbered
>> > >          # patch. Don't add unnumbered patches (for example diffs sent
>> > > @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
>> > >              # message
>> > >              try:
>> > >                  series = SeriesReference.objects.get(
>> > > -                    msgid=msgid, series__project=project).series
>> > > +                    msgid=msgid, project=project).series
>> > >              except SeriesReference.DoesNotExist:
>> > >                  series = None
>> > > -            except SeriesReference.MultipleObjectsReturned:
>> > > -                logger.error("Multiple SeriesReferences for %s"
>> > > -                             " in project %s!" % (msgid, project.name))
>> > > -                series = SeriesReference.objects.filter(
>> > > -                    msgid=msgid, series__project=project).first().series
>> > >  
>> > >              if not series:
>> > >                  series = Series.objects.create(
>> > > @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
>> > >                  # we don't save the in-reply-to or references fields
>> > >                  # for a cover letter, as they can't refer to the same
>> > >                  # series
>> > > -                try:
>> > > -                    SeriesReference.objects.get_or_create(series=series,
>> > > -                                                          msgid=msgid)
>> > > -                except SeriesReference.MultipleObjectsReturned:
>> > > -                    logger.error("Multiple SeriesReferences for %s"
>> > > -                                 " in project %s!" % (msgid, project.name))
>> > > +                SeriesReference.objects.create(
>> > > +                    msgid=msgid, project=project, series=series)
>> > >  
>> > >              try:
>> > >                  cover_letter = CoverLetter.objects.create(
>> > > diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
>> > > index 0bf71585..0edbd87a 100644
>> > > --- a/patchwork/tests/test_parser.py
>> > > +++ b/patchwork/tests/test_parser.py
>> > > @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
>> > >          msgids = [make_msgid()]
>> > >          project = create_project()
>> > >          series_v1 = create_series(project=project)
>> > > -        create_series_reference(msgid=msgids[0], series=series_v1)
>> > > +        create_series_reference(msgid=msgids[0], series=series_v1,
>> > > +                                project=project)
>> > >  
>> > >          # ...and three patches
>> > >          for i in range(3):
>> > >              msgids.append(make_msgid())
>> > > -            create_series_reference(msgid=msgids[-1], series=series_v1)
>> > > +            create_series_reference(msgid=msgids[-1], series=series_v1,
>> > > +                                    project=project)
>> > >  
>> > >          # now create a new series with "cover letter"
>> > >          msgids.append(make_msgid())
>> > >          series_v2 = create_series(project=project)
>> > > -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
>> > > +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
>> > > +                                         project=project)
>> > >  
>> > >          # ...and the "first patch" of this new series
>> > >          msgid = make_msgid()
>> > > diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
>> > > index 577183d0..4ead1781 100644
>> > > --- a/patchwork/tests/utils.py
>> > > +++ b/patchwork/tests/utils.py
>> > > @@ -282,8 +282,12 @@ def create_series(**kwargs):
>> > >  
>> > >  def create_series_reference(**kwargs):
>> > >      """Create 'SeriesReference' object."""
>> > > +    project = kwargs.pop('project', create_project())
>> > > +    series = kwargs.pop('series', create_series(project=project))
>> > > +
>> > >      values = {
>> > > -        'series': create_series() if 'series' not in kwargs else None,
>> > > +        'series': series,
>> > > +        'project': project,
>> > >          'msgid': make_msgid(),
>> > >      }
>> > >      values.update(**kwargs)
>> > > -- 
>> > > 2.23.0
Stephen Finucane Dec. 27, 2019, 2:37 p.m. UTC | #6
So I've gone ahead and merged this because I wanted to close it out
before the end of the year. I think the model changes are correct and
the parser changes can be tweaked (or mostly reverted if they're really
wrong). Answers to your questions below.

On Mon, 2019-12-09 at 09:27 +1100, Daniel Axtens wrote:
> Stephen Finucane <stephen@that.guru> writes:
> 
> > On Sun, 2019-12-08 at 22:14 +1100, Daniel Axtens wrote:
> > > Daniel Axtens <dja@axtens.net> writes:
> > > 
> > > > > Currently, the 'SeriesReference' object has a unique constraint on the
> > > > > two fields it has, 'series', which is a foreign key to 'Series', and
> > > > > 'msgid'. This is the wrong constraint. What we actually want to enforce
> > > > > is that a patch, cover letter or comment is referenced by a single
> > > > > series, or rather a single series per project the submission appears on.
> > > > > As such, we should be enforcing uniqueness on the msgid and the project
> > > > > that the patch, cover letter or comment belongs to.
> > > > 
> > > > OK, so I've been working on this and making good progress.
> > > > 
> > > > My fundamental issue with the current design, and this extension of it,
> > > > is that we're ignoring or working against the tools that databases
> > > > provide for concurrency. This patch is an improvement, no doubt, but I'd
> > > > really like to solve it properly.
> > > > 
> > > > Because we have multiple processes operating on the database - processes
> > > > without external syncronisation - we need to rely on the database to
> > > > provide the synchronisation primitive. Currently we're doing things like
> > > > asking the database to check if something exists and then creating it if
> > > > it does not. This is embeds TOCTOU problems, similar to the ones I fixed
> > > > for Person creation a while back. What we should be doing is
> > > > get_or_create style - basically more aggressively attempting to create
> > > > the thing we need and if that fails, then do the lookup for the existing
> > > > object.
> > > > 
> > > > This design pervades the series parser. It doesn't generally bite us
> > > > because we generally, more or less, parse in series.
> > > > 
> > > > I'm basically refactoring the parser to be
> > > > 'try-creating-and-catch-failures' rather than
> > > > 'lookup-and-create-if-missing'. It is a bit tricky but I think I'm at
> > > > least 80% of the way there.
> > > 
> > > The more I think about this patch, the more I see it does actually move
> > > us in the right direction. Let's run with it for now.
> > 
> > That was going to be my response to your previous reply: does anything
> > I'm doing in that series wrt *the models* not make sense, and if not,
> > should we apply it anyway since the parser is easily reworked again in
> > the future, but model changes are a little more permanent.
> 
> I had wanted to make it so that you couldn't insert multiple series for
> the same thread by adding the concept of a 'head message' to a
> series. Then you could create a unique_together with (head message id,
> project) and that would catch all the problems much earlier. I did end
> up coding that, and it mostly works, but there are still edgecases and
> bugs that I was working through.

The issue I had with this approach (which is what the freedesktop fork
does, iirc) is that it's no always possible to figure out what the head
is, since there's at least three different ways to send a message:

- no threading
- shallow threading (all messages point to the parent message, either a
cover letter or the first patch)
- deep threading (each message points to the previous patch)

I don't think deep threading would be possible with this head ID
approach unless 'References' tracked _every_ predecessor, but maybe I'm
wrong.

> > > As for my rewrite, well, series parsing is really hard. I'm also working
> > > from a data set that has a whole set of bizzare things including series
> > > of patches in-reply-to a cover letter, a series of patches that's half
> > > threaded and half unthreaded and a patch series that contains - by
> > > design - multiple patches per "number" (i.e. 3 patches each numbered
> > > 1/N, 3 patches numbered 2/N...), so that means I'm forever fighting
> > > weird edge-cases of how people used git a decade ago. So I'm going to
> > > pass on trying to rewrite it for now.
> > 
> > Any chance you can share this dataset? The only one I have to work with
> > is the Patchwork one which isn't that large. I did request the DPDK
> > archives some time back but I haven't checked whether what I received
> > even works.
> 
> I have an mid-2017 copy of Ubuntu kernel-team mailing list archive,
> which I probably picked because I worked there at the time. There's
> still a link to download it online, and now it's even bigger:
> https://lists.ubuntu.com/archives/kernel-team/ - right at the top of the
> page there's a link to 'download the full raw archive'.
> 
> Some things to note is that the archive does have a bunch of messages
> without Message-Ids(!) - looks like some poorly coded bots for regular
> messages about kernel builds. There's also the usual browsable mailman
> archive which I find very useful. I use it with my parallel parser,
> although I've reduce the runahead distance to 20.
> 
> Some problems with working on this archive are that it's just _so_ large
> it's hard to manually check things, and some of the really broken
> behaviours are now over a decade old. (Not all though, check out 'Wily
> [PATCH 00/22]- Please enable kconfig X86_LEGACY_VM86 for i386' at
> https://lists.ubuntu.com/archives/kernel-team/2015-October/thread.html )
> 
> > > I'll go over it with a finer-toothed comb and do a proper review, but I
> > > withdraw my overall objection.
> 
> Still going to do this, but I was thinking about it this morning and I
> think you also need to wrap the series/seriesreference manipulation in
> the coverletter path in an atomic context... you've made some changes to
> that path, was there a reason not to do the same atomic context wrapping?

Don't think this is an issue since we never attempt to find its series
by reference, as documented at [1].

[1] https://github.com/getpatchwork/patchwork/blob/350116effb/patchwork/parser.py#L1208-L1211

Stephen

> Regards,
> Daniel
> 
> 
> > > Regards,
> > > Daniel
> > > > I would be happy to leave my series parsing rework and Mete's work to a
> > > > 2.3, or I'd be happy to try and grind through, get these done, and make
> > > > 2.2 the final release in the 2.x series. What do you think?
> > > > 
> > > > Regards,
> > > > Daniel
> > > > 
> > > > Stephen Finucane <stephen@that.guru> writes:
> > > > 
> > > > > This requires adding a new field to the object, 'project', since it's
> > > > > not possible to do something like the following:
> > > > > 
> > > > >   unique_together = [('msgid', 'series__project')]
> > > > > 
> > > > > This is detailed here [1]. In addition, the migration needs a precursor
> > > > > migration step to merge any broken series.
> > > > > 
> > > > > [1] https://stackoverflow.com/a/4440189/613428
> > > > > 
> > > > > Signed-off-by: Stephen Finucane <stephen@that.guru>
> > > > > Closes: #241
> > > > > Cc: Daniel Axtens <dja@axtens.net>
> > > > > Cc: Petr Vorel <petr.vorel@gmail.com>
> > > > > ---
> > > > > v4:
> > > > > - Further tweaks to logging
> > > > > v3:
> > > > > - Improve logging
> > > > > ---
> > > > >  .../0038_unique_series_references.py          | 121 +++++++++++++++++
> > > > >  patchwork/models.py                           |   6 +-
> > > > >  patchwork/parser.py                           | 123 +++++++++---------
> > > > >  patchwork/tests/test_parser.py                |   9 +-
> > > > >  patchwork/tests/utils.py                      |   6 +-
> > > > >  5 files changed, 199 insertions(+), 66 deletions(-)
> > > > >  create mode 100644 patchwork/migrations/0038_unique_series_references.py
> > > > > 
> > > > > diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
> > > > > new file mode 100644
> > > > > index 00000000..91799020
> > > > > --- /dev/null
> > > > > +++ b/patchwork/migrations/0038_unique_series_references.py
> > > > > @@ -0,0 +1,121 @@
> > > > > +from django.db import connection, migrations, models
> > > > > +from django.db.models import Count
> > > > > +import django.db.models.deletion
> > > > > +
> > > > > +
> > > > > +def merge_duplicate_series(apps, schema_editor):
> > > > > +    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> > > > > +    Patch = apps.get_model('patchwork', 'Patch')
> > > > > +
> > > > > +    msgid_seriesrefs = {}
> > > > > +
> > > > > +    # find all SeriesReference that share a msgid but point to different series
> > > > > +    # and decide which of the series is going to be the authoritative one
> > > > > +    msgid_counts = (
> > > > > +        SeriesReference.objects.values('msgid')
> > > > > +        .annotate(count=Count('msgid'))
> > > > > +        .filter(count__gt=1)
> > > > > +    )
> > > > > +    for msgid_count in msgid_counts:
> > > > > +        msgid = msgid_count['msgid']
> > > > > +        chosen_ref = None
> > > > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> > > > > +            if series_ref.series.cover_letter:
> > > > > +                if chosen_ref:
> > > > > +                    # I don't think this can happen, but explode if it does
> > > > > +                    raise Exception(
> > > > > +                        "Looks like you've got two or more series that share "
> > > > > +                        "some patches but do not share a cover letter. Unable "
> > > > > +                        "to auto-resolve."
> > > > > +                    )
> > > > > +
> > > > > +                # if a series has a cover letter, that's the one we'll group
> > > > > +                # everything under
> > > > > +                chosen_ref = series_ref
> > > > > +
> > > > > +        if not chosen_ref:
> > > > > +            # if none of the series have cover letters, simply use the last
> > > > > +            # one (hint: this relies on Python's weird scoping for for loops
> > > > > +            # where 'series_ref' is still accessible outside the loop)
> > > > > +            chosen_ref = series_ref
> > > > > +
> > > > > +        msgid_seriesrefs[msgid] = chosen_ref
> > > > > +
> > > > > +    # reassign any patches referring to non-authoritative series to point to
> > > > > +    # the authoritative one, and delete the other series; we do this separately
> > > > > +    # to allow us a chance to raise the exception above if necessary
> > > > > +    for msgid, chosen_ref in msgid_seriesrefs.items():
> > > > > +        for series_ref in SeriesReference.objects.filter(msgid=msgid):
> > > > > +            if series_ref == chosen_ref:
> > > > > +                continue
> > > > > +
> > > > > +            # update the patches to point to our chosen series instead, on the
> > > > > +            # assumption that all other metadata is correct
> > > > > +            for patch in Patch.objects.filter(series=series_ref.series):
> > > > > +                patch.series = chosen_ref.series
> > > > > +                patch.save()
> > > > > +
> > > > > +            # delete the other series (which will delete the series ref)
> > > > > +            series_ref.series.delete()
> > > > > +
> > > > > +
> > > > > +def copy_project_field(apps, schema_editor):
> > > > > +    if connection.vendor == 'postgresql':
> > > > > +        schema_editor.execute(
> > > > > +            """
> > > > > +            UPDATE patchwork_seriesreference
> > > > > +              SET project_id = patchwork_series.project_id
> > > > > +            FROM patchwork_series
> > > > > +              WHERE patchwork_seriesreference.series_id = patchwork_series.id
> > > > > +        """
> > > > > +        )
> > > > > +    elif connection.vendor == 'mysql':
> > > > > +        schema_editor.execute(
> > > > > +            """
> > > > > +            UPDATE patchwork_seriesreference, patchwork_series
> > > > > +              SET patchwork_seriesreference.project_id = patchwork_series.project_id
> > > > > +            WHERE patchwork_seriesreference.series_id = patchwork_series.id
> > > > > +        """
> > > > > +        )  # noqa
> > > > > +    else:
> > > > > +        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
> > > > > +
> > > > > +        for series_ref in SeriesReference.objects.all().select_related(
> > > > > +            'series'
> > > > > +        ):
> > > > > +            series_ref.project = series_ref.series.project
> > > > > +            series_ref.save()
> > > > > +
> > > > > +
> > > > > +class Migration(migrations.Migration):
> > > > > +
> > > > > +    dependencies = [('patchwork', '0037_event_actor')]
> > > > > +
> > > > > +    operations = [
> > > > > +        migrations.RunPython(
> > > > > +            merge_duplicate_series, migrations.RunPython.noop, atomic=False
> > > > > +        ),
> > > > > +        migrations.AddField(
> > > > > +            model_name='seriesreference',
> > > > > +            name='project',
> > > > > +            field=models.ForeignKey(
> > > > > +                null=True,
> > > > > +                on_delete=django.db.models.deletion.CASCADE,
> > > > > +                to='patchwork.Project',
> > > > > +            ),
> > > > > +        ),
> > > > > +        migrations.AlterUniqueTogether(
> > > > > +            name='seriesreference', unique_together={('project', 'msgid')}
> > > > > +        ),
> > > > > +        migrations.RunPython(
> > > > > +            copy_project_field, migrations.RunPython.noop, atomic=False
> > > > > +        ),
> > > > > +        migrations.AlterField(
> > > > > +            model_name='seriesreference',
> > > > > +            name='project',
> > > > > +            field=models.ForeignKey(
> > > > > +                on_delete=django.db.models.deletion.CASCADE,
> > > > > +                to='patchwork.Project',
> > > > > +            ),
> > > > > +        ),
> > > > > +    ]
> > > > > diff --git a/patchwork/models.py b/patchwork/models.py
> > > > > index 7f0efd48..dbaff024 100644
> > > > > --- a/patchwork/models.py
> > > > > +++ b/patchwork/models.py
> > > > > @@ -79,7 +79,8 @@ class Project(models.Model):
> > > > >      webscm_url = models.CharField(max_length=2000, blank=True)
> > > > >      list_archive_url = models.CharField(max_length=2000, blank=True)
> > > > >      list_archive_url_format = models.CharField(
> > > > > -        max_length=2000, blank=True,
> > > > > +        max_length=2000,
> > > > > +        blank=True,
> > > > >          help_text="URL format for the list archive's Message-ID redirector. "
> > > > >          "{} will be replaced by the Message-ID.")
> > > > >      commit_url_format = models.CharField(
> > > > > @@ -787,6 +788,7 @@ class SeriesReference(models.Model):
> > > > >      required to handle the case whereby one or more patches are
> > > > >      received before the cover letter.
> > > > >      """
> > > > > +    project = models.ForeignKey(Project, on_delete=models.CASCADE)
> > > > >      series = models.ForeignKey(Series, related_name='references',
> > > > >                                 related_query_name='reference',
> > > > >                                 on_delete=models.CASCADE)
> > > > > @@ -796,7 +798,7 @@ class SeriesReference(models.Model):
> > > > >          return self.msgid
> > > > >  
> > > > >      class Meta:
> > > > > -        unique_together = [('series', 'msgid')]
> > > > > +        unique_together = [('project', 'msgid')]
> > > > >  
> > > > >  
> > > > >  class Bundle(models.Model):
> > > > > diff --git a/patchwork/parser.py b/patchwork/parser.py
> > > > > index c424729b..c0084cc0 100644
> > > > > --- a/patchwork/parser.py
> > > > > +++ b/patchwork/parser.py
> > > > > @@ -16,6 +16,7 @@ import re
> > > > >  
> > > > >  from django.contrib.auth.models import User
> > > > >  from django.db.utils import IntegrityError
> > > > > +from django.db import transaction
> > > > >  from django.utils import six
> > > > >  
> > > > >  from patchwork.models import Comment
> > > > > @@ -256,16 +257,9 @@ def _find_series_by_references(project, mail):
> > > > >      for ref in refs:
> > > > >          try:
> > > > >              return SeriesReference.objects.get(
> > > > > -                msgid=ref[:255], series__project=project).series
> > > > > +                msgid=ref[:255], project=project).series
> > > > >          except SeriesReference.DoesNotExist:
> > > > >              continue
> > > > > -        except SeriesReference.MultipleObjectsReturned:
> > > > > -            # FIXME: Open bug: this can happen when we're processing
> > > > > -            # messages in parallel. Pick the first and log.
> > > > > -            logger.error("Multiple SeriesReferences for %s in project %s!" %
> > > > > -                         (ref[:255], project.name))
> > > > > -            return SeriesReference.objects.filter(
> > > > > -                msgid=ref[:255], series__project=project).first().series
> > > > >  
> > > > >  
> > > > >  def _find_series_by_markers(project, mail, author):
> > > > > @@ -1092,47 +1086,65 @@ def parse_mail(mail, list_id=None):
> > > > >          except IntegrityError:
> > > > >              raise DuplicateMailError(msgid=msgid)
> > > > >  
> > > > > -        # if we don't have a series marker, we will never have an existing
> > > > > -        # series to match against.
> > > > > -        series = None
> > > > > -        if n:
> > > > > -            series = find_series(project, mail, author)
> > > > > +        for attempt in range(1, 11):  # arbitrary retry count
> > > > > +            try:
> > > > > +                with transaction.atomic():
> > > > > +                    # if we don't have a series marker, we will never have an
> > > > > +                    # existing series to match against.
> > > > > +                    series = None
> > > > > +                    if n:
> > > > > +                        series = find_series(project, mail, author)
> > > > > +                    else:
> > > > > +                        x = n = 1
> > > > > +
> > > > > +                    # We will create a new series if:
> > > > > +                    # - there is no existing series to assign this patch to, or
> > > > > +                    # - there is an existing series, but it already has a patch
> > > > > +                    #   with this number in it
> > > > > +                    if not series or Patch.objects.filter(
> > > > > +                            series=series, number=x).count():
> > > > > +                        series = Series.objects.create(
> > > > > +                            project=project,
> > > > > +                            date=date,
> > > > > +                            submitter=author,
> > > > > +                            version=version,
> > > > > +                            total=n)
> > > > > +
> > > > > +                        # NOTE(stephenfin) We must save references for series.
> > > > > +                        # We do this to handle the case where a later patch is
> > > > > +                        # received first. Without storing references, it would
> > > > > +                        # not be possible to identify the relationship between
> > > > > +                        # patches as the earlier patch does not reference the
> > > > > +                        # later one.
> > > > > +                        for ref in refs + [msgid]:
> > > > > +                            ref = ref[:255]
> > > > > +                            # we don't want duplicates
> > > > > +                            try:
> > > > > +                                # we could have a ref to a previous series.
> > > > > +                                # (For example, a series sent in reply to
> > > > > +                                # another series.) That should not create a
> > > > > +                                # series ref for this series, so check for the
> > > > > +                                # msg-id only, not the msg-id/series pair.
> > > > > +                                SeriesReference.objects.get(
> > > > > +                                    msgid=ref, project=project)
> > > > > +                            except SeriesReference.DoesNotExist:
> > > > > +                                SeriesReference.objects.create(
> > > > > +                                    msgid=ref, project=project, series=series)
> > > > > +                    break
> > > > > +            except IntegrityError:
> > > > > +                # we lost the race so go again
> > > > > +                logger.warning('Conflict while saving series. This is '
> > > > > +                               'probably because multiple patches belonging '
> > > > > +                               'to the same series have been received at '
> > > > > +                               'once. Trying again (attempt %02d/10)',
> > > > > +                               attempt)
> > > > >          else:
> > > > > -            x = n = 1
> > > > > -
> > > > > -        # We will create a new series if:
> > > > > -        # - there is no existing series to assign this patch to, or
> > > > > -        # - there is an existing series, but it already has a patch with this
> > > > > -        #   number in it
> > > > > -        if not series or Patch.objects.filter(series=series, number=x).count():
> > > > > -            series = Series.objects.create(
> > > > > -                project=project,
> > > > > -                date=date,
> > > > > -                submitter=author,
> > > > > -                version=version,
> > > > > -                total=n)
> > > > > -
> > > > > -            # NOTE(stephenfin) We must save references for series. We
> > > > > -            # do this to handle the case where a later patch is
> > > > > -            # received first. Without storing references, it would not
> > > > > -            # be possible to identify the relationship between patches
> > > > > -            # as the earlier patch does not reference the later one.
> > > > > -            for ref in refs + [msgid]:
> > > > > -                ref = ref[:255]
> > > > > -                # we don't want duplicates
> > > > > -                try:
> > > > > -                    # we could have a ref to a previous series. (For
> > > > > -                    # example, a series sent in reply to another
> > > > > -                    # series.) That should not create a series ref
> > > > > -                    # for this series, so check for the msg-id only,
> > > > > -                    # not the msg-id/series pair.
> > > > > -                    SeriesReference.objects.get(msgid=ref,
> > > > > -                                                series__project=project)
> > > > > -                except SeriesReference.DoesNotExist:
> > > > > -                    SeriesReference.objects.create(series=series, msgid=ref)
> > > > > -                except SeriesReference.MultipleObjectsReturned:
> > > > > -                    logger.error("Multiple SeriesReferences for %s"
> > > > > -                                 " in project %s!" % (ref, project.name))
> > > > > +            # we failed to save the series so return the series-less patch
> > > > > +            logger.warning('Failed to save series. Your patch with message ID '
> > > > > +                           '%s has been saved but this should not happen. '
> > > > > +                           'Please report this as a bug and include logs',
> > > > > +                           msgid)
> > > > > +            return patch
> > > > >  
> > > > >          # add to a series if we have found one, and we have a numbered
> > > > >          # patch. Don't add unnumbered patches (for example diffs sent
> > > > > @@ -1170,14 +1182,9 @@ def parse_mail(mail, list_id=None):
> > > > >              # message
> > > > >              try:
> > > > >                  series = SeriesReference.objects.get(
> > > > > -                    msgid=msgid, series__project=project).series
> > > > > +                    msgid=msgid, project=project).series
> > > > >              except SeriesReference.DoesNotExist:
> > > > >                  series = None
> > > > > -            except SeriesReference.MultipleObjectsReturned:
> > > > > -                logger.error("Multiple SeriesReferences for %s"
> > > > > -                             " in project %s!" % (msgid, project.name))
> > > > > -                series = SeriesReference.objects.filter(
> > > > > -                    msgid=msgid, series__project=project).first().series
> > > > >  
> > > > >              if not series:
> > > > >                  series = Series.objects.create(
> > > > > @@ -1190,12 +1197,8 @@ def parse_mail(mail, list_id=None):
> > > > >                  # we don't save the in-reply-to or references fields
> > > > >                  # for a cover letter, as they can't refer to the same
> > > > >                  # series
> > > > > -                try:
> > > > > -                    SeriesReference.objects.get_or_create(series=series,
> > > > > -                                                          msgid=msgid)
> > > > > -                except SeriesReference.MultipleObjectsReturned:
> > > > > -                    logger.error("Multiple SeriesReferences for %s"
> > > > > -                                 " in project %s!" % (msgid, project.name))
> > > > > +                SeriesReference.objects.create(
> > > > > +                    msgid=msgid, project=project, series=series)
> > > > >  
> > > > >              try:
> > > > >                  cover_letter = CoverLetter.objects.create(
> > > > > diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
> > > > > index 0bf71585..0edbd87a 100644
> > > > > --- a/patchwork/tests/test_parser.py
> > > > > +++ b/patchwork/tests/test_parser.py
> > > > > @@ -421,17 +421,20 @@ class SeriesCorrelationTest(TestCase):
> > > > >          msgids = [make_msgid()]
> > > > >          project = create_project()
> > > > >          series_v1 = create_series(project=project)
> > > > > -        create_series_reference(msgid=msgids[0], series=series_v1)
> > > > > +        create_series_reference(msgid=msgids[0], series=series_v1,
> > > > > +                                project=project)
> > > > >  
> > > > >          # ...and three patches
> > > > >          for i in range(3):
> > > > >              msgids.append(make_msgid())
> > > > > -            create_series_reference(msgid=msgids[-1], series=series_v1)
> > > > > +            create_series_reference(msgid=msgids[-1], series=series_v1,
> > > > > +                                    project=project)
> > > > >  
> > > > >          # now create a new series with "cover letter"
> > > > >          msgids.append(make_msgid())
> > > > >          series_v2 = create_series(project=project)
> > > > > -        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
> > > > > +        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
> > > > > +                                         project=project)
> > > > >  
> > > > >          # ...and the "first patch" of this new series
> > > > >          msgid = make_msgid()
> > > > > diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
> > > > > index 577183d0..4ead1781 100644
> > > > > --- a/patchwork/tests/utils.py
> > > > > +++ b/patchwork/tests/utils.py
> > > > > @@ -282,8 +282,12 @@ def create_series(**kwargs):
> > > > >  
> > > > >  def create_series_reference(**kwargs):
> > > > >      """Create 'SeriesReference' object."""
> > > > > +    project = kwargs.pop('project', create_project())
> > > > > +    series = kwargs.pop('series', create_series(project=project))
> > > > > +
> > > > >      values = {
> > > > > -        'series': create_series() if 'series' not in kwargs else None,
> > > > > +        'series': series,
> > > > > +        'project': project,
> > > > >          'msgid': make_msgid(),
> > > > >      }
> > > > >      values.update(**kwargs)
> > > > > -- 
> > > > > 2.23.0

Patch
diff mbox series

diff --git a/patchwork/migrations/0038_unique_series_references.py b/patchwork/migrations/0038_unique_series_references.py
new file mode 100644
index 00000000..91799020
--- /dev/null
+++ b/patchwork/migrations/0038_unique_series_references.py
@@ -0,0 +1,121 @@ 
+from django.db import connection, migrations, models
+from django.db.models import Count
+import django.db.models.deletion
+
+
+def merge_duplicate_series(apps, schema_editor):
+    SeriesReference = apps.get_model('patchwork', 'SeriesReference')
+    Patch = apps.get_model('patchwork', 'Patch')
+
+    msgid_seriesrefs = {}
+
+    # find all SeriesReference that share a msgid but point to different series
+    # and decide which of the series is going to be the authoritative one
+    msgid_counts = (
+        SeriesReference.objects.values('msgid')
+        .annotate(count=Count('msgid'))
+        .filter(count__gt=1)
+    )
+    for msgid_count in msgid_counts:
+        msgid = msgid_count['msgid']
+        chosen_ref = None
+        for series_ref in SeriesReference.objects.filter(msgid=msgid):
+            if series_ref.series.cover_letter:
+                if chosen_ref:
+                    # I don't think this can happen, but explode if it does
+                    raise Exception(
+                        "Looks like you've got two or more series that share "
+                        "some patches but do not share a cover letter. Unable "
+                        "to auto-resolve."
+                    )
+
+                # if a series has a cover letter, that's the one we'll group
+                # everything under
+                chosen_ref = series_ref
+
+        if not chosen_ref:
+            # if none of the series have cover letters, simply use the last
+            # one (hint: this relies on Python's weird scoping for for loops
+            # where 'series_ref' is still accessible outside the loop)
+            chosen_ref = series_ref
+
+        msgid_seriesrefs[msgid] = chosen_ref
+
+    # reassign any patches referring to non-authoritative series to point to
+    # the authoritative one, and delete the other series; we do this separately
+    # to allow us a chance to raise the exception above if necessary
+    for msgid, chosen_ref in msgid_seriesrefs.items():
+        for series_ref in SeriesReference.objects.filter(msgid=msgid):
+            if series_ref == chosen_ref:
+                continue
+
+            # update the patches to point to our chosen series instead, on the
+            # assumption that all other metadata is correct
+            for patch in Patch.objects.filter(series=series_ref.series):
+                patch.series = chosen_ref.series
+                patch.save()
+
+            # delete the other series (which will delete the series ref)
+            series_ref.series.delete()
+
+
+def copy_project_field(apps, schema_editor):
+    if connection.vendor == 'postgresql':
+        schema_editor.execute(
+            """
+            UPDATE patchwork_seriesreference
+              SET project_id = patchwork_series.project_id
+            FROM patchwork_series
+              WHERE patchwork_seriesreference.series_id = patchwork_series.id
+        """
+        )
+    elif connection.vendor == 'mysql':
+        schema_editor.execute(
+            """
+            UPDATE patchwork_seriesreference, patchwork_series
+              SET patchwork_seriesreference.project_id = patchwork_series.project_id
+            WHERE patchwork_seriesreference.series_id = patchwork_series.id
+        """
+        )  # noqa
+    else:
+        SeriesReference = apps.get_model('patchwork', 'SeriesReference')
+
+        for series_ref in SeriesReference.objects.all().select_related(
+            'series'
+        ):
+            series_ref.project = series_ref.series.project
+            series_ref.save()
+
+
+class Migration(migrations.Migration):
+
+    dependencies = [('patchwork', '0037_event_actor')]
+
+    operations = [
+        migrations.RunPython(
+            merge_duplicate_series, migrations.RunPython.noop, atomic=False
+        ),
+        migrations.AddField(
+            model_name='seriesreference',
+            name='project',
+            field=models.ForeignKey(
+                null=True,
+                on_delete=django.db.models.deletion.CASCADE,
+                to='patchwork.Project',
+            ),
+        ),
+        migrations.AlterUniqueTogether(
+            name='seriesreference', unique_together={('project', 'msgid')}
+        ),
+        migrations.RunPython(
+            copy_project_field, migrations.RunPython.noop, atomic=False
+        ),
+        migrations.AlterField(
+            model_name='seriesreference',
+            name='project',
+            field=models.ForeignKey(
+                on_delete=django.db.models.deletion.CASCADE,
+                to='patchwork.Project',
+            ),
+        ),
+    ]
diff --git a/patchwork/models.py b/patchwork/models.py
index 7f0efd48..dbaff024 100644
--- a/patchwork/models.py
+++ b/patchwork/models.py
@@ -79,7 +79,8 @@  class Project(models.Model):
     webscm_url = models.CharField(max_length=2000, blank=True)
     list_archive_url = models.CharField(max_length=2000, blank=True)
     list_archive_url_format = models.CharField(
-        max_length=2000, blank=True,
+        max_length=2000,
+        blank=True,
         help_text="URL format for the list archive's Message-ID redirector. "
         "{} will be replaced by the Message-ID.")
     commit_url_format = models.CharField(
@@ -787,6 +788,7 @@  class SeriesReference(models.Model):
     required to handle the case whereby one or more patches are
     received before the cover letter.
     """
+    project = models.ForeignKey(Project, on_delete=models.CASCADE)
     series = models.ForeignKey(Series, related_name='references',
                                related_query_name='reference',
                                on_delete=models.CASCADE)
@@ -796,7 +798,7 @@  class SeriesReference(models.Model):
         return self.msgid
 
     class Meta:
-        unique_together = [('series', 'msgid')]
+        unique_together = [('project', 'msgid')]
 
 
 class Bundle(models.Model):
diff --git a/patchwork/parser.py b/patchwork/parser.py
index c424729b..c0084cc0 100644
--- a/patchwork/parser.py
+++ b/patchwork/parser.py
@@ -16,6 +16,7 @@  import re
 
 from django.contrib.auth.models import User
 from django.db.utils import IntegrityError
+from django.db import transaction
 from django.utils import six
 
 from patchwork.models import Comment
@@ -256,16 +257,9 @@  def _find_series_by_references(project, mail):
     for ref in refs:
         try:
             return SeriesReference.objects.get(
-                msgid=ref[:255], series__project=project).series
+                msgid=ref[:255], project=project).series
         except SeriesReference.DoesNotExist:
             continue
-        except SeriesReference.MultipleObjectsReturned:
-            # FIXME: Open bug: this can happen when we're processing
-            # messages in parallel. Pick the first and log.
-            logger.error("Multiple SeriesReferences for %s in project %s!" %
-                         (ref[:255], project.name))
-            return SeriesReference.objects.filter(
-                msgid=ref[:255], series__project=project).first().series
 
 
 def _find_series_by_markers(project, mail, author):
@@ -1092,47 +1086,65 @@  def parse_mail(mail, list_id=None):
         except IntegrityError:
             raise DuplicateMailError(msgid=msgid)
 
-        # if we don't have a series marker, we will never have an existing
-        # series to match against.
-        series = None
-        if n:
-            series = find_series(project, mail, author)
+        for attempt in range(1, 11):  # arbitrary retry count
+            try:
+                with transaction.atomic():
+                    # if we don't have a series marker, we will never have an
+                    # existing series to match against.
+                    series = None
+                    if n:
+                        series = find_series(project, mail, author)
+                    else:
+                        x = n = 1
+
+                    # We will create a new series if:
+                    # - there is no existing series to assign this patch to, or
+                    # - there is an existing series, but it already has a patch
+                    #   with this number in it
+                    if not series or Patch.objects.filter(
+                            series=series, number=x).count():
+                        series = Series.objects.create(
+                            project=project,
+                            date=date,
+                            submitter=author,
+                            version=version,
+                            total=n)
+
+                        # NOTE(stephenfin) We must save references for series.
+                        # We do this to handle the case where a later patch is
+                        # received first. Without storing references, it would
+                        # not be possible to identify the relationship between
+                        # patches as the earlier patch does not reference the
+                        # later one.
+                        for ref in refs + [msgid]:
+                            ref = ref[:255]
+                            # we don't want duplicates
+                            try:
+                                # we could have a ref to a previous series.
+                                # (For example, a series sent in reply to
+                                # another series.) That should not create a
+                                # series ref for this series, so check for the
+                                # msg-id only, not the msg-id/series pair.
+                                SeriesReference.objects.get(
+                                    msgid=ref, project=project)
+                            except SeriesReference.DoesNotExist:
+                                SeriesReference.objects.create(
+                                    msgid=ref, project=project, series=series)
+                    break
+            except IntegrityError:
+                # we lost the race so go again
+                logger.warning('Conflict while saving series. This is '
+                               'probably because multiple patches belonging '
+                               'to the same series have been received at '
+                               'once. Trying again (attempt %02d/10)',
+                               attempt)
         else:
-            x = n = 1
-
-        # We will create a new series if:
-        # - there is no existing series to assign this patch to, or
-        # - there is an existing series, but it already has a patch with this
-        #   number in it
-        if not series or Patch.objects.filter(series=series, number=x).count():
-            series = Series.objects.create(
-                project=project,
-                date=date,
-                submitter=author,
-                version=version,
-                total=n)
-
-            # NOTE(stephenfin) We must save references for series. We
-            # do this to handle the case where a later patch is
-            # received first. Without storing references, it would not
-            # be possible to identify the relationship between patches
-            # as the earlier patch does not reference the later one.
-            for ref in refs + [msgid]:
-                ref = ref[:255]
-                # we don't want duplicates
-                try:
-                    # we could have a ref to a previous series. (For
-                    # example, a series sent in reply to another
-                    # series.) That should not create a series ref
-                    # for this series, so check for the msg-id only,
-                    # not the msg-id/series pair.
-                    SeriesReference.objects.get(msgid=ref,
-                                                series__project=project)
-                except SeriesReference.DoesNotExist:
-                    SeriesReference.objects.create(series=series, msgid=ref)
-                except SeriesReference.MultipleObjectsReturned:
-                    logger.error("Multiple SeriesReferences for %s"
-                                 " in project %s!" % (ref, project.name))
+            # we failed to save the series so return the series-less patch
+            logger.warning('Failed to save series. Your patch with message ID '
+                           '%s has been saved but this should not happen. '
+                           'Please report this as a bug and include logs',
+                           msgid)
+            return patch
 
         # add to a series if we have found one, and we have a numbered
         # patch. Don't add unnumbered patches (for example diffs sent
@@ -1170,14 +1182,9 @@  def parse_mail(mail, list_id=None):
             # message
             try:
                 series = SeriesReference.objects.get(
-                    msgid=msgid, series__project=project).series
+                    msgid=msgid, project=project).series
             except SeriesReference.DoesNotExist:
                 series = None
-            except SeriesReference.MultipleObjectsReturned:
-                logger.error("Multiple SeriesReferences for %s"
-                             " in project %s!" % (msgid, project.name))
-                series = SeriesReference.objects.filter(
-                    msgid=msgid, series__project=project).first().series
 
             if not series:
                 series = Series.objects.create(
@@ -1190,12 +1197,8 @@  def parse_mail(mail, list_id=None):
                 # we don't save the in-reply-to or references fields
                 # for a cover letter, as they can't refer to the same
                 # series
-                try:
-                    SeriesReference.objects.get_or_create(series=series,
-                                                          msgid=msgid)
-                except SeriesReference.MultipleObjectsReturned:
-                    logger.error("Multiple SeriesReferences for %s"
-                                 " in project %s!" % (msgid, project.name))
+                SeriesReference.objects.create(
+                    msgid=msgid, project=project, series=series)
 
             try:
                 cover_letter = CoverLetter.objects.create(
diff --git a/patchwork/tests/test_parser.py b/patchwork/tests/test_parser.py
index 0bf71585..0edbd87a 100644
--- a/patchwork/tests/test_parser.py
+++ b/patchwork/tests/test_parser.py
@@ -421,17 +421,20 @@  class SeriesCorrelationTest(TestCase):
         msgids = [make_msgid()]
         project = create_project()
         series_v1 = create_series(project=project)
-        create_series_reference(msgid=msgids[0], series=series_v1)
+        create_series_reference(msgid=msgids[0], series=series_v1,
+                                project=project)
 
         # ...and three patches
         for i in range(3):
             msgids.append(make_msgid())
-            create_series_reference(msgid=msgids[-1], series=series_v1)
+            create_series_reference(msgid=msgids[-1], series=series_v1,
+                                    project=project)
 
         # now create a new series with "cover letter"
         msgids.append(make_msgid())
         series_v2 = create_series(project=project)
-        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2)
+        ref_v2 = create_series_reference(msgid=msgids[-1], series=series_v2,
+                                         project=project)
 
         # ...and the "first patch" of this new series
         msgid = make_msgid()
diff --git a/patchwork/tests/utils.py b/patchwork/tests/utils.py
index 577183d0..4ead1781 100644
--- a/patchwork/tests/utils.py
+++ b/patchwork/tests/utils.py
@@ -282,8 +282,12 @@  def create_series(**kwargs):
 
 def create_series_reference(**kwargs):
     """Create 'SeriesReference' object."""
+    project = kwargs.pop('project', create_project())
+    series = kwargs.pop('series', create_series(project=project))
+
     values = {
-        'series': create_series() if 'series' not in kwargs else None,
+        'series': series,
+        'project': project,
         'msgid': make_msgid(),
     }
     values.update(**kwargs)