Message ID | 20180224145020.15181-10-dja@axtens.net |
---|---|
State | Accepted |
Headers | show |
Series | Tools and fixes for parallel parsing | expand |
Context | Check | Description |
---|---|---|
dja/snowpatch-0_1_0 | success | master/apply_patch Successfully applied |
dja/snowpatch-snowpatch_job_snowpatch-patchwork | success | Test snowpatch/job/snowpatch-patchwork on branch master |
On Sun, 2018-02-25 at 01:50 +1100, Daniel Axtens wrote: > Parallel parsing would occasonally fail with: > > patchwork.models.MultipleObjectsReturned: get() returned more than one SeriesReference -- it returned 2! > > I think these are happening if you have different processes parsing > e.g. 1/3 and 2/3 simultaneously: both will have a reference to 1/3, > in the case of 1 it will be the msgid, in the case of 2 it will be > in References. So when we come to parse 3/3, .get() finds 2 and > throws the exception. > > This does not fix the creation of multiple series references; it > just causes them to be ignored. We still have serious race conditions > with series creation, but I don't yet have clear answers for them. > With this patch, they will at least not stop patches from being > processed - they'll just lead to wonky series, which we already have. > > Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > Signed-off-by: Daniel Axtens <dja@axtens.net> Correct me if I'm wrong, but this seems to highlight two issues: * The race condition between searching for a series reference and using it * The fact we expect a series to be unique for a given msgid and project, yet the 'unique_together' is for a given msgid and *series*. The first of these is caught here (yet not fixed, as it's a non-trivial thing to resolve). The second should still be fixed though. Should we do this here, as part of this series? Stephen > --- > patchwork/parser.py | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/patchwork/parser.py b/patchwork/parser.py > index 56dc7006c811..5a7344cee93c 100644 > --- a/patchwork/parser.py > +++ b/patchwork/parser.py > @@ -240,6 +240,13 @@ def _find_series_by_references(project, mail): > msgid=ref[:255], series__project=project).series > except SeriesReference.DoesNotExist: > continue > + except SeriesReference.MultipleObjectsReturned: > + # FIXME: Open bug: this can happen when we're processing > + # messages in parallel. Pick the first and log. > + logger.error("Multiple SeriesReferences for %s in project %s!" % > + (ref[:255], project.name)) > + return SeriesReference.objects.filter( > + msgid=ref[:255], series__project=project).first().series > > > def _find_series_by_markers(project, mail, author): > @@ -1037,6 +1044,9 @@ def parse_mail(mail, list_id=None): > series__project=project) > except SeriesReference.DoesNotExist: > SeriesReference.objects.create(series=series, msgid=ref) > + except SeriesReference.MultipleObjectsReturned: > + logger.error("Multiple SeriesReferences for %s" > + " in project %s!" % (ref, project.name)) > > # add to a series if we have found one, and we have a numbered > # patch. Don't add unnumbered patches (for example diffs sent > @@ -1075,6 +1085,11 @@ def parse_mail(mail, list_id=None): > msgid=msgid, series__project=project).series > except SeriesReference.DoesNotExist: > series = None > + except SeriesReference.MultipleObjectsReturned: > + logger.error("Multiple SeriesReferences for %s" > + " in project %s!" % (msgid, project.name)) > + series = SeriesReference.objects.filter( > + msgid=msgid, series__project=project).first().series > > if not series: > series = Series(project=project, > @@ -1087,8 +1102,12 @@ def parse_mail(mail, list_id=None): > # we don't save the in-reply-to or references fields > # for a cover letter, as they can't refer to the same > # series > - SeriesReference.objects.get_or_create(series=series, > - msgid=msgid) > + try: > + SeriesReference.objects.get_or_create(series=series, > + msgid=msgid) > + except SeriesReference.MultipleObjectsReturned: > + logger.error("Multiple SeriesReferences for %s" > + " in project %s!" % (msgid, project.name)) > > cover_letter = CoverLetter( > msgid=msgid,
Stephen Finucane <stephen@that.guru> writes: > On Sun, 2018-02-25 at 01:50 +1100, Daniel Axtens wrote: >> Parallel parsing would occasonally fail with: >> >> patchwork.models.MultipleObjectsReturned: get() returned more than one SeriesReference -- it returned 2! >> >> I think these are happening if you have different processes parsing >> e.g. 1/3 and 2/3 simultaneously: both will have a reference to 1/3, >> in the case of 1 it will be the msgid, in the case of 2 it will be >> in References. So when we come to parse 3/3, .get() finds 2 and >> throws the exception. >> >> This does not fix the creation of multiple series references; it >> just causes them to be ignored. We still have serious race conditions >> with series creation, but I don't yet have clear answers for them. >> With this patch, they will at least not stop patches from being >> processed - they'll just lead to wonky series, which we already have. >> >> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> >> Signed-off-by: Daniel Axtens <dja@axtens.net> > > Correct me if I'm wrong, but this seems to highlight two issues: > > * The race condition between searching for a series reference and > using it > * The fact we expect a series to be unique for a given msgid and > project, yet the 'unique_together' is for a given msgid and > *series*. > > The first of these is caught here (yet not fixed, as it's a non-trivial > thing to resolve). The second should still be fixed though. Should we > do this here, as part of this series? My thinking is no, for these reasons: - I want a patch I can backport to stable without requiring a database migration, because: * that seems like a very unusual thing for a stable update * I don't trust my ability to get it right for a stable release * relatedly, it will require a lot of testing for me to trust it, and I want something sooner rather than later so that the Buildroot people can get on with their lives. * The different numbering of migrations will make the different upgrade paths a nightmare to describe, yet alone test - I have a long-term ongoing effort to move 'project' up into the various tables that use it so as to avoid JOINs and this would be a good thing to do in that effort. (It's the next big push on my list, it's just a nightmare to get the compatibility and migration code going.) Regards, Daniel > > Stephen > >> --- >> patchwork/parser.py | 23 +++++++++++++++++++++-- >> 1 file changed, 21 insertions(+), 2 deletions(-) >> >> diff --git a/patchwork/parser.py b/patchwork/parser.py >> index 56dc7006c811..5a7344cee93c 100644 >> --- a/patchwork/parser.py >> +++ b/patchwork/parser.py >> @@ -240,6 +240,13 @@ def _find_series_by_references(project, mail): >> msgid=ref[:255], series__project=project).series >> except SeriesReference.DoesNotExist: >> continue >> + except SeriesReference.MultipleObjectsReturned: >> + # FIXME: Open bug: this can happen when we're processing >> + # messages in parallel. Pick the first and log. >> + logger.error("Multiple SeriesReferences for %s in project %s!" % >> + (ref[:255], project.name)) >> + return SeriesReference.objects.filter( >> + msgid=ref[:255], series__project=project).first().series >> >> >> def _find_series_by_markers(project, mail, author): >> @@ -1037,6 +1044,9 @@ def parse_mail(mail, list_id=None): >> series__project=project) >> except SeriesReference.DoesNotExist: >> SeriesReference.objects.create(series=series, msgid=ref) >> + except SeriesReference.MultipleObjectsReturned: >> + logger.error("Multiple SeriesReferences for %s" >> + " in project %s!" % (ref, project.name)) >> >> # add to a series if we have found one, and we have a numbered >> # patch. Don't add unnumbered patches (for example diffs sent >> @@ -1075,6 +1085,11 @@ def parse_mail(mail, list_id=None): >> msgid=msgid, series__project=project).series >> except SeriesReference.DoesNotExist: >> series = None >> + except SeriesReference.MultipleObjectsReturned: >> + logger.error("Multiple SeriesReferences for %s" >> + " in project %s!" % (msgid, project.name)) >> + series = SeriesReference.objects.filter( >> + msgid=msgid, series__project=project).first().series >> >> if not series: >> series = Series(project=project, >> @@ -1087,8 +1102,12 @@ def parse_mail(mail, list_id=None): >> # we don't save the in-reply-to or references fields >> # for a cover letter, as they can't refer to the same >> # series >> - SeriesReference.objects.get_or_create(series=series, >> - msgid=msgid) >> + try: >> + SeriesReference.objects.get_or_create(series=series, >> + msgid=msgid) >> + except SeriesReference.MultipleObjectsReturned: >> + logger.error("Multiple SeriesReferences for %s" >> + " in project %s!" % (msgid, project.name)) >> >> cover_letter = CoverLetter( >> msgid=msgid,
On Mon, 2018-02-26 at 15:55 +1100, Daniel Axtens wrote: > Stephen Finucane <stephen@that.guru> writes: > > > On Sun, 2018-02-25 at 01:50 +1100, Daniel Axtens wrote: > > > Parallel parsing would occasonally fail with: > > > > > > patchwork.models.MultipleObjectsReturned: get() returned more than one SeriesReference -- it returned 2! > > > > > > I think these are happening if you have different processes parsing > > > e.g. 1/3 and 2/3 simultaneously: both will have a reference to 1/3, > > > in the case of 1 it will be the msgid, in the case of 2 it will be > > > in References. So when we come to parse 3/3, .get() finds 2 and > > > throws the exception. > > > > > > This does not fix the creation of multiple series references; it > > > just causes them to be ignored. We still have serious race conditions > > > with series creation, but I don't yet have clear answers for them. > > > With this patch, they will at least not stop patches from being > > > processed - they'll just lead to wonky series, which we already have. > > > > > > Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > > > Signed-off-by: Daniel Axtens <dja@axtens.net> > > > > Correct me if I'm wrong, but this seems to highlight two issues: > > > > * The race condition between searching for a series reference and > > using it > > * The fact we expect a series to be unique for a given msgid and > > project, yet the 'unique_together' is for a given msgid and > > *series*. > > > > The first of these is caught here (yet not fixed, as it's a non-trivial > > thing to resolve). The second should still be fixed though. Should we > > do this here, as part of this series? > > My thinking is no, for these reasons: > > - I want a patch I can backport to stable without requiring a database > migration, because: > > * that seems like a very unusual thing for a stable update > > * I don't trust my ability to get it right for a stable release > > * relatedly, it will require a lot of testing for me to trust it, > and I want something sooner rather than later so that the > Buildroot people can get on with their lives. > > * The different numbering of migrations will make the different > upgrade paths a nightmare to describe, yet alone test Yup, I had no considered the backportability aspects of this. That's a very valid point. Let's do this separately. > - I have a long-term ongoing effort to move 'project' up into the > various tables that use it so as to avoid JOINs and this would be a > good thing to do in that effort. (It's the next big push on my list, > it's just a nightmare to get the compatibility and migration code > going.) We should probably discuss this separately, but is this a significant part of the issue? Far as I could see, there were two different issues with the DB causing our pain: * The number of JOINs on Event, which I have an RFC waiting on reviews for. * The need to JOIN for Patch and CoverLetter, which could probably be resolved by flattening this back down into Submission. Have I missed something? Stephen > Regards, > Daniel > > > > > Stephen > > > > > --- > > > patchwork/parser.py | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/patchwork/parser.py b/patchwork/parser.py > > > index 56dc7006c811..5a7344cee93c 100644 > > > --- a/patchwork/parser.py > > > +++ b/patchwork/parser.py > > > @@ -240,6 +240,13 @@ def _find_series_by_references(project, mail): > > > msgid=ref[:255], series__project=project).series > > > except SeriesReference.DoesNotExist: > > > continue > > > + except SeriesReference.MultipleObjectsReturned: > > > + # FIXME: Open bug: this can happen when we're processing > > > + # messages in parallel. Pick the first and log. > > > + logger.error("Multiple SeriesReferences for %s in project %s!" % > > > + (ref[:255], project.name)) > > > + return SeriesReference.objects.filter( > > > + msgid=ref[:255], series__project=project).first().series > > > > > > > > > def _find_series_by_markers(project, mail, author): > > > @@ -1037,6 +1044,9 @@ def parse_mail(mail, list_id=None): > > > series__project=project) > > > except SeriesReference.DoesNotExist: > > > SeriesReference.objects.create(series=series, msgid=ref) > > > + except SeriesReference.MultipleObjectsReturned: > > > + logger.error("Multiple SeriesReferences for %s" > > > + " in project %s!" % (ref, project.name)) > > > > > > # add to a series if we have found one, and we have a numbered > > > # patch. Don't add unnumbered patches (for example diffs sent > > > @@ -1075,6 +1085,11 @@ def parse_mail(mail, list_id=None): > > > msgid=msgid, series__project=project).series > > > except SeriesReference.DoesNotExist: > > > series = None > > > + except SeriesReference.MultipleObjectsReturned: > > > + logger.error("Multiple SeriesReferences for %s" > > > + " in project %s!" % (msgid, project.name)) > > > + series = SeriesReference.objects.filter( > > > + msgid=msgid, series__project=project).first().series > > > > > > if not series: > > > series = Series(project=project, > > > @@ -1087,8 +1102,12 @@ def parse_mail(mail, list_id=None): > > > # we don't save the in-reply-to or references fields > > > # for a cover letter, as they can't refer to the same > > > # series > > > - SeriesReference.objects.get_or_create(series=series, > > > - msgid=msgid) > > > + try: > > > + SeriesReference.objects.get_or_create(series=series, > > > + msgid=msgid) > > > + except SeriesReference.MultipleObjectsReturned: > > > + logger.error("Multiple SeriesReferences for %s" > > > + " in project %s!" % (msgid, project.name)) > > > > > > cover_letter = CoverLetter( > > > msgid=msgid,
Stephen Finucane <stephen@that.guru> writes: > On Mon, 2018-02-26 at 15:55 +1100, Daniel Axtens wrote: >> Stephen Finucane <stephen@that.guru> writes: >> >> > On Sun, 2018-02-25 at 01:50 +1100, Daniel Axtens wrote: >> > > Parallel parsing would occasonally fail with: >> > > >> > > patchwork.models.MultipleObjectsReturned: get() returned more than one SeriesReference -- it returned 2! >> > > >> > > I think these are happening if you have different processes parsing >> > > e.g. 1/3 and 2/3 simultaneously: both will have a reference to 1/3, >> > > in the case of 1 it will be the msgid, in the case of 2 it will be >> > > in References. So when we come to parse 3/3, .get() finds 2 and >> > > throws the exception. >> > > >> > > This does not fix the creation of multiple series references; it >> > > just causes them to be ignored. We still have serious race conditions >> > > with series creation, but I don't yet have clear answers for them. >> > > With this patch, they will at least not stop patches from being >> > > processed - they'll just lead to wonky series, which we already have. >> > > >> > > Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> >> > > Signed-off-by: Daniel Axtens <dja@axtens.net> >> > >> > Correct me if I'm wrong, but this seems to highlight two issues: >> > >> > * The race condition between searching for a series reference and >> > using it >> > * The fact we expect a series to be unique for a given msgid and >> > project, yet the 'unique_together' is for a given msgid and >> > *series*. >> > >> > The first of these is caught here (yet not fixed, as it's a non-trivial >> > thing to resolve). The second should still be fixed though. Should we >> > do this here, as part of this series? >> >> My thinking is no, for these reasons: >> >> - I want a patch I can backport to stable without requiring a database >> migration, because: >> >> * that seems like a very unusual thing for a stable update >> >> * I don't trust my ability to get it right for a stable release >> >> * relatedly, it will require a lot of testing for me to trust it, >> and I want something sooner rather than later so that the >> Buildroot people can get on with their lives. >> >> * The different numbering of migrations will make the different >> upgrade paths a nightmare to describe, yet alone test > > Yup, I had no considered the backportability aspects of this. That's a > very valid point. Let's do this separately. > >> - I have a long-term ongoing effort to move 'project' up into the >> various tables that use it so as to avoid JOINs and this would be a >> good thing to do in that effort. (It's the next big push on my list, >> it's just a nightmare to get the compatibility and migration code >> going.) > > We should probably discuss this separately, but is this a significant > part of the issue? Far as I could see, there were two different issues > with the DB causing our pain: > > * The number of JOINs on Event, which I have an RFC waiting on reviews > for. > * The need to JOIN for Patch and CoverLetter, which could probably be > resolved by flattening this back down into Submission. > > Have I missed something? I think for Patch and CoverLetter, we need to pull project out of the submission table and in to patch/coverletter tables. This is my top priority after merging the various patches that people post, but it's surprisingly tricky. If we put it in both Submission and Patch, we need to migrate the data and then keep the duplicate data in sync. If we tear it out of submission entirely, we end up doing one big (very big!) migration and I haven't been able to figure that out yet. For this issue, we also need to pull project into seriesreference, either in addition to or instead of in Series. I guess that's at least isolated from the change to submission but I haven't tried yet. Regards, Daniel > Stephen > >> Regards, >> Daniel >> >> > >> > Stephen >> > >> > > --- >> > > patchwork/parser.py | 23 +++++++++++++++++++++-- >> > > 1 file changed, 21 insertions(+), 2 deletions(-) >> > > >> > > diff --git a/patchwork/parser.py b/patchwork/parser.py >> > > index 56dc7006c811..5a7344cee93c 100644 >> > > --- a/patchwork/parser.py >> > > +++ b/patchwork/parser.py >> > > @@ -240,6 +240,13 @@ def _find_series_by_references(project, mail): >> > > msgid=ref[:255], series__project=project).series >> > > except SeriesReference.DoesNotExist: >> > > continue >> > > + except SeriesReference.MultipleObjectsReturned: >> > > + # FIXME: Open bug: this can happen when we're processing >> > > + # messages in parallel. Pick the first and log. >> > > + logger.error("Multiple SeriesReferences for %s in project %s!" % >> > > + (ref[:255], project.name)) >> > > + return SeriesReference.objects.filter( >> > > + msgid=ref[:255], series__project=project).first().series >> > > >> > > >> > > def _find_series_by_markers(project, mail, author): >> > > @@ -1037,6 +1044,9 @@ def parse_mail(mail, list_id=None): >> > > series__project=project) >> > > except SeriesReference.DoesNotExist: >> > > SeriesReference.objects.create(series=series, msgid=ref) >> > > + except SeriesReference.MultipleObjectsReturned: >> > > + logger.error("Multiple SeriesReferences for %s" >> > > + " in project %s!" % (ref, project.name)) >> > > >> > > # add to a series if we have found one, and we have a numbered >> > > # patch. Don't add unnumbered patches (for example diffs sent >> > > @@ -1075,6 +1085,11 @@ def parse_mail(mail, list_id=None): >> > > msgid=msgid, series__project=project).series >> > > except SeriesReference.DoesNotExist: >> > > series = None >> > > + except SeriesReference.MultipleObjectsReturned: >> > > + logger.error("Multiple SeriesReferences for %s" >> > > + " in project %s!" % (msgid, project.name)) >> > > + series = SeriesReference.objects.filter( >> > > + msgid=msgid, series__project=project).first().series >> > > >> > > if not series: >> > > series = Series(project=project, >> > > @@ -1087,8 +1102,12 @@ def parse_mail(mail, list_id=None): >> > > # we don't save the in-reply-to or references fields >> > > # for a cover letter, as they can't refer to the same >> > > # series >> > > - SeriesReference.objects.get_or_create(series=series, >> > > - msgid=msgid) >> > > + try: >> > > + SeriesReference.objects.get_or_create(series=series, >> > > + msgid=msgid) >> > > + except SeriesReference.MultipleObjectsReturned: >> > > + logger.error("Multiple SeriesReferences for %s" >> > > + " in project %s!" % (msgid, project.name)) >> > > >> > > cover_letter = CoverLetter( >> > > msgid=msgid,
diff --git a/patchwork/parser.py b/patchwork/parser.py index 56dc7006c811..5a7344cee93c 100644 --- a/patchwork/parser.py +++ b/patchwork/parser.py @@ -240,6 +240,13 @@ def _find_series_by_references(project, mail): msgid=ref[:255], series__project=project).series except SeriesReference.DoesNotExist: continue + except SeriesReference.MultipleObjectsReturned: + # FIXME: Open bug: this can happen when we're processing + # messages in parallel. Pick the first and log. + logger.error("Multiple SeriesReferences for %s in project %s!" % + (ref[:255], project.name)) + return SeriesReference.objects.filter( + msgid=ref[:255], series__project=project).first().series def _find_series_by_markers(project, mail, author): @@ -1037,6 +1044,9 @@ def parse_mail(mail, list_id=None): series__project=project) except SeriesReference.DoesNotExist: SeriesReference.objects.create(series=series, msgid=ref) + except SeriesReference.MultipleObjectsReturned: + logger.error("Multiple SeriesReferences for %s" + " in project %s!" % (ref, project.name)) # add to a series if we have found one, and we have a numbered # patch. Don't add unnumbered patches (for example diffs sent @@ -1075,6 +1085,11 @@ def parse_mail(mail, list_id=None): msgid=msgid, series__project=project).series except SeriesReference.DoesNotExist: series = None + except SeriesReference.MultipleObjectsReturned: + logger.error("Multiple SeriesReferences for %s" + " in project %s!" % (msgid, project.name)) + series = SeriesReference.objects.filter( + msgid=msgid, series__project=project).first().series if not series: series = Series(project=project, @@ -1087,8 +1102,12 @@ def parse_mail(mail, list_id=None): # we don't save the in-reply-to or references fields # for a cover letter, as they can't refer to the same # series - SeriesReference.objects.get_or_create(series=series, - msgid=msgid) + try: + SeriesReference.objects.get_or_create(series=series, + msgid=msgid) + except SeriesReference.MultipleObjectsReturned: + logger.error("Multiple SeriesReferences for %s" + " in project %s!" % (msgid, project.name)) cover_letter = CoverLetter( msgid=msgid,