mbox series

[RFC,0/2,REBASE] Rework tagging infrastructure

Message ID 20180412152413.12849-1-vkabatov@redhat.com
Headers show
Series Rework tagging infrastructure | expand

Message

Veronika Kabatova April 12, 2018, 3:24 p.m. UTC
From: Veronika Kabatova <vkabatov@redhat.com>

(TL;DR at the end)

This RFC describes an approach to rework tagging. It attempts to solve GitHub
issues #57 [1] and #113 [2] as well as some other things we encountered. I'm
sending the incomplete version (eg I haven't fixed the tests) to discuss the
approach first.

Right now, tags are extracted from all the comments on the patch and the patch
itself, and they are reextracted from all the sources every time a comment is
added or removed. It makes saving slower, and might contribute to races with
writes to database when we are parsing multiple emails at the same time. This
gets even more prevalent if we want to solve the issue #113 (tags on cover
letter should increment counters on every patch in series) -- for each added
comment on the cover letter, we would reextract tags from all the other sources,
for each patch in series; and for a change on comments related to the patch
directly, we would need to take the tags on the cover letter and it's comments
into account as well (I implemented this solution in my fork but I really
don't like it).

The current approach has several other issues as well, some of which are
mentioned in the issue #57, eg duplicate tags are counted more times. Taking
into account the tags on cover letters would also be easier if we could store
tags against them and just query them on demand, instead of reextracting
everything all over again.

Solutions for some other things we found missing solve the issues mentioned
above too. If we want to determine if the tag is duplicate, we need to save the
associated value. Having the value would help us to use arbitrary strings as
tags (for example links to issue trackers, like `Bugzilla: <link>` if the patch
solves a known bug). The key-values approach to storing tags is mentioned [3],
this email additionally mentions a comments REST API (currently worked on). For
the comments API we would also find it very useful to have the tags extracted
from the comments available directly so we can query for them, which means we
would either need to reextract the tags on every API call, or we could store
the tags against the comments as they are extracted and only query them as
needed. Keeping tags related to comments where they originated from avoids the
need to reextract tags with addition of new comments, as well as gets rid of
the `patch_responses` property used when converting comments to mbox (we
finally get all the custom tags there instead of only the few hardcoded ones).

TL;DR:
Our goals:
- Avoid tag reextraction with each added comment
- Fix issues #57 and #113
- Prepare tags addition to comments in the API
- Add tags to patch (currently returns {}) and cover letter APIs

If you agree with the general idea, I'd like to get some help with DB query
optimizations where determined needed. Tags are more distributed, so counting
them in PatchQuerySet is harder. I wanted to use annotation with Django's
conditional selection, but it doesn't seem to work very well and generated
invalid SQL in my attempts (hence my solution with counting the tags on the fly
only for the patches that are being displayed). That said, I'm not a DB expert
so any more optimized solutions are more than welcome.


Thoughts? Ideas?


[1] https://github.com/getpatchwork/patchwork/issues/57
[2] https://github.com/getpatchwork/patchwork/issues/113
[3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html

Veronika Kabatova (2):
  Rework tagging infrastructure
  Add migration for tagging changes

Comments

Stephen Finucane April 16, 2018, 10:49 a.m. UTC | #1
On Thu, 2018-04-12 at 17:24 +0200, vkabatov@redhat.com wrote:
> From: Veronika Kabatova <vkabatov@redhat.com>
> 
> (TL;DR at the end)
> 
> This RFC describes an approach to rework tagging. It attempts to solve GitHub
> issues #57 [1] and #113 [2] as well as some other things we encountered. I'm
> sending the incomplete version (eg I haven't fixed the tests) to discuss the
> approach first.
> 
> Right now, tags are extracted from all the comments on the patch and the patch
> itself, and they are reextracted from all the sources every time a comment is
> added or removed. It makes saving slower, and might contribute to races with
> writes to database when we are parsing multiple emails at the same time. This
> gets even more prevalent if we want to solve the issue #113 (tags on cover
> letter should increment counters on every patch in series) -- for each added
> comment on the cover letter, we would reextract tags from all the other sources,
> for each patch in series; and for a change on comments related to the patch
> directly, we would need to take the tags on the cover letter and it's comments
> into account as well (I implemented this solution in my fork but I really
> don't like it).
> 
> The current approach has several other issues as well, some of which are
> mentioned in the issue #57, eg duplicate tags are counted more times. Taking
> into account the tags on cover letters would also be easier if we could store
> tags against them and just query them on demand, instead of reextracting
> everything all over again.
> 
> Solutions for some other things we found missing solve the issues mentioned
> above too. If we want to determine if the tag is duplicate, we need to save the
> associated value. Having the value would help us to use arbitrary strings as
> tags (for example links to issue trackers, like `Bugzilla: <link>` if the patch
> solves a known bug). The key-values approach to storing tags is mentioned [3],
> this email additionally mentions a comments REST API (currently worked on). For
> the comments API we would also find it very useful to have the tags extracted
> from the comments available directly so we can query for them, which means we
> would either need to reextract the tags on every API call, or we could store
> the tags against the comments as they are extracted and only query them as
> needed. Keeping tags related to comments where they originated from avoids the
> need to reextract tags with addition of new comments, as well as gets rid of
> the `patch_responses` property used when converting comments to mbox (we
> finally get all the custom tags there instead of only the few hardcoded ones).
> 
> TL;DR:
> Our goals:
> - Avoid tag reextraction with each added comment
> - Fix issues #57 and #113
> - Prepare tags addition to comments in the API
> - Add tags to patch (currently returns {}) and cover letter APIs
> 
> If you agree with the general idea, I'd like to get some help with DB query
> optimizations where determined needed. Tags are more distributed, so counting
> them in PatchQuerySet is harder. I wanted to use annotation with Django's
> conditional selection, but it doesn't seem to work very well and generated
> invalid SQL in my attempts (hence my solution with counting the tags on the fly
> only for the patches that are being displayed). That said, I'm not a DB expert
> so any more optimized solutions are more than welcome.
> 
> 
> Thoughts? Ideas?

Took at look at this. As things stand, I'm not sure how we could
improve the performance of this substantially. The problem is that
we're trying to treat these as both a generic thing (cover letters,
patches _and_ comments can all have associated tags) but also somewhat
interrelated (a patch need to know about the tags attached to its
series' cover letter and its comments). This leads to the kind of
convoluted queries which impact your performance.

Let's take a look at the things we need to solve the goals you
highlighted above.

 * Keep track of where we found of the tag (patch, follow-up comment,
   series/cover letter)
    * This would resolve our tag re-extraction issue and potentially
      offer a path to resolving #113 (A/R/T on cover letter should
      increment the counters of every patch)

 * Keep track of the value-side of the equation (i.e. an email for
   something like 'Reviewed-by:')
    * This would solve #57 (Tags should be stored on a per-Person
      basis)

As noted above, we have a clash between specificness and genericness of
this field. I think we should double down on the former, even if it
means some level of duplication. To this end, what about linking 'Tag'
to 'Patch' via a ManyToMany field and use a "through" table? If we do
this, we can track optional 'cover' and 'comment' attributes for
identifying the source of the tag, if it wasn't included in the patch
itself. We could also store a value here. This would result in some
duplication (e.g. for a 10 series patch, a reply to the cover letter
would generate 10 PatchTag entries with series set for each) but it
would be simple to query. Does this make sense to you?

Stephen

> 
> [1] https://github.com/getpatchwork/patchwork/issues/57
> [2] https://github.com/getpatchwork/patchwork/issues/113
> [3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html
> 
> Veronika Kabatova (2):
>   Rework tagging infrastructure
>   Add migration for tagging changes
Veronika Kabatova April 16, 2018, 1:06 p.m. UTC | #2
----- Original Message -----
> From: "Stephen Finucane" <stephen@that.guru>
> To: vkabatov@redhat.com, patchwork@lists.ozlabs.org
> Sent: Monday, April 16, 2018 12:49:03 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> On Thu, 2018-04-12 at 17:24 +0200, vkabatov@redhat.com wrote:
> > From: Veronika Kabatova <vkabatov@redhat.com>
> > 
> > (TL;DR at the end)
> > 
> > This RFC describes an approach to rework tagging. It attempts to solve
> > GitHub
> > issues #57 [1] and #113 [2] as well as some other things we encountered.
> > I'm
> > sending the incomplete version (eg I haven't fixed the tests) to discuss
> > the
> > approach first.
> > 
> > Right now, tags are extracted from all the comments on the patch and the
> > patch
> > itself, and they are reextracted from all the sources every time a comment
> > is
> > added or removed. It makes saving slower, and might contribute to races
> > with
> > writes to database when we are parsing multiple emails at the same time.
> > This
> > gets even more prevalent if we want to solve the issue #113 (tags on cover
> > letter should increment counters on every patch in series) -- for each
> > added
> > comment on the cover letter, we would reextract tags from all the other
> > sources,
> > for each patch in series; and for a change on comments related to the patch
> > directly, we would need to take the tags on the cover letter and it's
> > comments
> > into account as well (I implemented this solution in my fork but I really
> > don't like it).
> > 
> > The current approach has several other issues as well, some of which are
> > mentioned in the issue #57, eg duplicate tags are counted more times.
> > Taking
> > into account the tags on cover letters would also be easier if we could
> > store
> > tags against them and just query them on demand, instead of reextracting
> > everything all over again.
> > 
> > Solutions for some other things we found missing solve the issues mentioned
> > above too. If we want to determine if the tag is duplicate, we need to save
> > the
> > associated value. Having the value would help us to use arbitrary strings
> > as
> > tags (for example links to issue trackers, like `Bugzilla: <link>` if the
> > patch
> > solves a known bug). The key-values approach to storing tags is mentioned
> > [3],
> > this email additionally mentions a comments REST API (currently worked on).
> > For
> > the comments API we would also find it very useful to have the tags
> > extracted
> > from the comments available directly so we can query for them, which means
> > we
> > would either need to reextract the tags on every API call, or we could
> > store
> > the tags against the comments as they are extracted and only query them as
> > needed. Keeping tags related to comments where they originated from avoids
> > the
> > need to reextract tags with addition of new comments, as well as gets rid
> > of
> > the `patch_responses` property used when converting comments to mbox (we
> > finally get all the custom tags there instead of only the few hardcoded
> > ones).
> > 
> > TL;DR:
> > Our goals:
> > - Avoid tag reextraction with each added comment
> > - Fix issues #57 and #113
> > - Prepare tags addition to comments in the API
> > - Add tags to patch (currently returns {}) and cover letter APIs
> > 
> > If you agree with the general idea, I'd like to get some help with DB query
> > optimizations where determined needed. Tags are more distributed, so
> > counting
> > them in PatchQuerySet is harder. I wanted to use annotation with Django's
> > conditional selection, but it doesn't seem to work very well and generated
> > invalid SQL in my attempts (hence my solution with counting the tags on the
> > fly
> > only for the patches that are being displayed). That said, I'm not a DB
> > expert
> > so any more optimized solutions are more than welcome.
> > 
> > 
> > Thoughts? Ideas?
> 
> Took at look at this. As things stand, I'm not sure how we could
> improve the performance of this substantially. The problem is that
> we're trying to treat these as both a generic thing (cover letters,
> patches _and_ comments can all have associated tags) but also somewhat
> interrelated (a patch need to know about the tags attached to its
> series' cover letter and its comments). This leads to the kind of
> convoluted queries which impact your performance.
> 
> Let's take a look at the things we need to solve the goals you
> highlighted above.
> 
>  * Keep track of where we found of the tag (patch, follow-up comment,
>    series/cover letter)
>     * This would resolve our tag re-extraction issue and potentially
>       offer a path to resolving #113 (A/R/T on cover letter should
>       increment the counters of every patch)
> 
>  * Keep track of the value-side of the equation (i.e. an email for
>    something like 'Reviewed-by:')
>     * This would solve #57 (Tags should be stored on a per-Person
>       basis)
> 
> As noted above, we have a clash between specificness and genericness of
> this field. I think we should double down on the former, even if it
> means some level of duplication. To this end, what about linking 'Tag'
> to 'Patch' via a ManyToMany field and use a "through" table? If we do
> this, we can track optional 'cover' and 'comment' attributes for
> identifying the source of the tag, if it wasn't included in the patch
> itself. We could also store a value here. This would result in some
> duplication (e.g. for a 10 series patch, a reply to the cover letter
> would generate 10 PatchTag entries with series set for each) but it
> would be simple to query. Does this make sense to you?
> 

Hi,

I need to take a proper look later but for the first look, this solution
might work out. I'll reimplement it and see if it works well for us and
either send v2 for the RFC or let you know if I find any caveats.

Thanks,
Veronika

> Stephen
> 
> > 
> > [1] https://github.com/getpatchwork/patchwork/issues/57
> > [2] https://github.com/getpatchwork/patchwork/issues/113
> > [3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html
> > 
> > Veronika Kabatova (2):
> >   Rework tagging infrastructure
> >   Add migration for tagging changes
>
Veronika Kabatova April 17, 2018, 1:50 p.m. UTC | #3
----- Original Message -----
> From: "Veronika Kabatova" <vkabatov@redhat.com>
> To: "Stephen Finucane" <stephen@that.guru>
> Cc: patchwork@lists.ozlabs.org
> Sent: Monday, April 16, 2018 3:06:34 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> ----- Original Message -----
> > From: "Stephen Finucane" <stephen@that.guru>
> > To: vkabatov@redhat.com, patchwork@lists.ozlabs.org
> > Sent: Monday, April 16, 2018 12:49:03 PM
> > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > 
> > On Thu, 2018-04-12 at 17:24 +0200, vkabatov@redhat.com wrote:
> > > From: Veronika Kabatova <vkabatov@redhat.com>
> > > 
> > > TL;DR:
> > > Our goals:
> > > - Avoid tag reextraction with each added comment
> > > - Fix issues #57 and #113
> > > - Prepare tags addition to comments in the API
> > > - Add tags to patch (currently returns {}) and cover letter APIs
> > > 
> > > If you agree with the general idea, I'd like to get some help with DB
> > > query
> > > optimizations where determined needed. Tags are more distributed, so
> > > counting
> > > them in PatchQuerySet is harder. I wanted to use annotation with Django's
> > > conditional selection, but it doesn't seem to work very well and
> > > generated
> > > invalid SQL in my attempts (hence my solution with counting the tags on
> > > the
> > > fly
> > > only for the patches that are being displayed). That said, I'm not a DB
> > > expert
> > > so any more optimized solutions are more than welcome.
> > > 
> > > 
> > > Thoughts? Ideas?
> > 
> > Took at look at this. As things stand, I'm not sure how we could
> > improve the performance of this substantially. The problem is that
> > we're trying to treat these as both a generic thing (cover letters,
> > patches _and_ comments can all have associated tags) but also somewhat
> > interrelated (a patch need to know about the tags attached to its
> > series' cover letter and its comments). This leads to the kind of
> > convoluted queries which impact your performance.
> > 
> > Let's take a look at the things we need to solve the goals you
> > highlighted above.
> > 
> >  * Keep track of where we found of the tag (patch, follow-up comment,
> >    series/cover letter)
> >     * This would resolve our tag re-extraction issue and potentially
> >       offer a path to resolving #113 (A/R/T on cover letter should
> >       increment the counters of every patch)
> > 
> >  * Keep track of the value-side of the equation (i.e. an email for
> >    something like 'Reviewed-by:')
> >     * This would solve #57 (Tags should be stored on a per-Person
> >       basis)
> > 
> > As noted above, we have a clash between specificness and genericness of
> > this field. I think we should double down on the former, even if it
> > means some level of duplication. To this end, what about linking 'Tag'
> > to 'Patch' via a ManyToMany field and use a "through" table? If we do
> > this, we can track optional 'cover' and 'comment' attributes for
> > identifying the source of the tag, if it wasn't included in the patch
> > itself. We could also store a value here. This would result in some
> > duplication (e.g. for a 10 series patch, a reply to the cover letter
> > would generate 10 PatchTag entries with series set for each) but it
> > would be simple to query. Does this make sense to you?
> > 
> 
> Hi,
> 
> I need to take a proper look later but for the first look, this solution
> might work out. I'll reimplement it and see if it works well for us and
> either send v2 for the RFC or let you know if I find any caveats.
> 

So I finally started digging into this alternative solution and run into
an issue with tags extracted from cover letters (we - and maybe others do
that too - often put eg. links to issue trackers in the cover letter
instead of putting them into each patch).

Usually, the cover letter is received before the patches from series. This
means if we only had the link between Tag and Patch (with the additional
information in the intermediate model), there will be no place to save tags
from cover letter at the time of parsing.

Unless I'm overlooking something, we'd need to have the link from Tag to
both Patch and CoverLetter. This should still have much better performance
than my original solution (and will get rid of the duplication of yours).

Does this proposal make sense, or am I missing something?

Thanks,
Veronika
> 
> > Stephen
> >
Stephen Finucane April 17, 2018, 3:17 p.m. UTC | #4
On Tue, 2018-04-17 at 09:50 -0400, Veronika Kabatova wrote:
> ----- Original Message -----
> > From: "Veronika Kabatova" <vkabatov@redhat.com>
> > To: "Stephen Finucane" <stephen@that.guru>
> > Cc: patchwork@lists.ozlabs.org
> > Sent: Monday, April 16, 2018 3:06:34 PM
> > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > 
> > ----- Original Message -----
> > > From: "Stephen Finucane" <stephen@that.guru>
> > > To: vkabatov@redhat.com, patchwork@lists.ozlabs.org
> > > Sent: Monday, April 16, 2018 12:49:03 PM
> > > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > > 
> > > On Thu, 2018-04-12 at 17:24 +0200, vkabatov@redhat.com wrote:
> > > > From: Veronika Kabatova <vkabatov@redhat.com>
> > > > 
> > > > TL;DR:
> > > > Our goals:
> > > > - Avoid tag reextraction with each added comment
> > > > - Fix issues #57 and #113
> > > > - Prepare tags addition to comments in the API
> > > > - Add tags to patch (currently returns {}) and cover letter APIs
> > > > 
> > > > If you agree with the general idea, I'd like to get some help with DB
> > > > query
> > > > optimizations where determined needed. Tags are more distributed, so
> > > > counting
> > > > them in PatchQuerySet is harder. I wanted to use annotation with Django's
> > > > conditional selection, but it doesn't seem to work very well and
> > > > generated
> > > > invalid SQL in my attempts (hence my solution with counting the tags on
> > > > the
> > > > fly
> > > > only for the patches that are being displayed). That said, I'm not a DB
> > > > expert
> > > > so any more optimized solutions are more than welcome.
> > > > 
> > > > 
> > > > Thoughts? Ideas?
> > > 
> > > Took at look at this. As things stand, I'm not sure how we could
> > > improve the performance of this substantially. The problem is that
> > > we're trying to treat these as both a generic thing (cover letters,
> > > patches _and_ comments can all have associated tags) but also somewhat
> > > interrelated (a patch need to know about the tags attached to its
> > > series' cover letter and its comments). This leads to the kind of
> > > convoluted queries which impact your performance.
> > > 
> > > Let's take a look at the things we need to solve the goals you
> > > highlighted above.
> > > 
> > >  * Keep track of where we found of the tag (patch, follow-up comment,
> > >    series/cover letter)
> > >     * This would resolve our tag re-extraction issue and potentially
> > >       offer a path to resolving #113 (A/R/T on cover letter should
> > >       increment the counters of every patch)
> > > 
> > >  * Keep track of the value-side of the equation (i.e. an email for
> > >    something like 'Reviewed-by:')
> > >     * This would solve #57 (Tags should be stored on a per-Person
> > >       basis)
> > > 
> > > As noted above, we have a clash between specificness and genericness of
> > > this field. I think we should double down on the former, even if it
> > > means some level of duplication. To this end, what about linking 'Tag'
> > > to 'Patch' via a ManyToMany field and use a "through" table? If we do
> > > this, we can track optional 'cover' and 'comment' attributes for
> > > identifying the source of the tag, if it wasn't included in the patch
> > > itself. We could also store a value here. This would result in some
> > > duplication (e.g. for a 10 series patch, a reply to the cover letter
> > > would generate 10 PatchTag entries with series set for each) but it
> > > would be simple to query. Does this make sense to you?
> > > 
> > 
> > Hi,
> > 
> > I need to take a proper look later but for the first look, this solution
> > might work out. I'll reimplement it and see if it works well for us and
> > either send v2 for the RFC or let you know if I find any caveats.
> > 
> 
> So I finally started digging into this alternative solution and run into
> an issue with tags extracted from cover letters (we - and maybe others do
> that too - often put eg. links to issue trackers in the cover letter
> instead of putting them into each patch).
> 
> Usually, the cover letter is received before the patches from series. This
> means if we only had the link between Tag and Patch (with the additional
> information in the intermediate model), there will be no place to save tags
> from cover letter at the time of parsing.

Oh, very good point. I'd missed that.

> Unless I'm overlooking something, we'd need to have the link from Tag to
> both Patch and CoverLetter. This should still have much better performance
> than my original solution (and will get rid of the duplication of yours).
> 
> Does this proposal make sense, or am I missing something?

That mostly makes sense. My main concern is what happens when you want
to show tags for a patch when those tags were created again the cover
letter. If that's the case, are we going to have to query on
'patch.series.cover_letter.tags'? I imagine that's going to be slow
(lots of JOINs). We could store it on the series instead, but I'm not
sure how much that would improve things. Any ideas how to work around
this?

Stephen
Veronika Kabatova April 17, 2018, 3:35 p.m. UTC | #5
----- Original Message -----
> From: "Stephen Finucane" <stephen@that.guru>
> To: "Veronika Kabatova" <vkabatov@redhat.com>
> Cc: patchwork@lists.ozlabs.org
> Sent: Tuesday, April 17, 2018 5:17:19 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> On Tue, 2018-04-17 at 09:50 -0400, Veronika Kabatova wrote:
> > ----- Original Message -----
> > > From: "Veronika Kabatova" <vkabatov@redhat.com>
> > > To: "Stephen Finucane" <stephen@that.guru>
> > > Cc: patchwork@lists.ozlabs.org
> > > Sent: Monday, April 16, 2018 3:06:34 PM
> > > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > > 
> > > ----- Original Message -----
> > > > From: "Stephen Finucane" <stephen@that.guru>
> > > > To: vkabatov@redhat.com, patchwork@lists.ozlabs.org
> > > > Sent: Monday, April 16, 2018 12:49:03 PM
> > > > Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> > > > 
> > > > On Thu, 2018-04-12 at 17:24 +0200, vkabatov@redhat.com wrote:
> > > > > From: Veronika Kabatova <vkabatov@redhat.com>
> > > > > 
> > > > > TL;DR:
> > > > > Our goals:
> > > > > - Avoid tag reextraction with each added comment
> > > > > - Fix issues #57 and #113
> > > > > - Prepare tags addition to comments in the API
> > > > > - Add tags to patch (currently returns {}) and cover letter APIs
> > > > > 
> > > > > If you agree with the general idea, I'd like to get some help with DB
> > > > > query
> > > > > optimizations where determined needed. Tags are more distributed, so
> > > > > counting
> > > > > them in PatchQuerySet is harder. I wanted to use annotation with
> > > > > Django's
> > > > > conditional selection, but it doesn't seem to work very well and
> > > > > generated
> > > > > invalid SQL in my attempts (hence my solution with counting the tags
> > > > > on
> > > > > the
> > > > > fly
> > > > > only for the patches that are being displayed). That said, I'm not a
> > > > > DB
> > > > > expert
> > > > > so any more optimized solutions are more than welcome.
> > > > > 
> > > > > 
> > > > > Thoughts? Ideas?
> > > > 
> > > > Took at look at this. As things stand, I'm not sure how we could
> > > > improve the performance of this substantially. The problem is that
> > > > we're trying to treat these as both a generic thing (cover letters,
> > > > patches _and_ comments can all have associated tags) but also somewhat
> > > > interrelated (a patch need to know about the tags attached to its
> > > > series' cover letter and its comments). This leads to the kind of
> > > > convoluted queries which impact your performance.
> > > > 
> > > > Let's take a look at the things we need to solve the goals you
> > > > highlighted above.
> > > > 
> > > >  * Keep track of where we found of the tag (patch, follow-up comment,
> > > >    series/cover letter)
> > > >     * This would resolve our tag re-extraction issue and potentially
> > > >       offer a path to resolving #113 (A/R/T on cover letter should
> > > >       increment the counters of every patch)
> > > > 
> > > >  * Keep track of the value-side of the equation (i.e. an email for
> > > >    something like 'Reviewed-by:')
> > > >     * This would solve #57 (Tags should be stored on a per-Person
> > > >       basis)
> > > > 
> > > > As noted above, we have a clash between specificness and genericness of
> > > > this field. I think we should double down on the former, even if it
> > > > means some level of duplication. To this end, what about linking 'Tag'
> > > > to 'Patch' via a ManyToMany field and use a "through" table? If we do
> > > > this, we can track optional 'cover' and 'comment' attributes for
> > > > identifying the source of the tag, if it wasn't included in the patch
> > > > itself. We could also store a value here. This would result in some
> > > > duplication (e.g. for a 10 series patch, a reply to the cover letter
> > > > would generate 10 PatchTag entries with series set for each) but it
> > > > would be simple to query. Does this make sense to you?
> > > > 
> > > 
> > > Hi,
> > > 
> > > I need to take a proper look later but for the first look, this solution
> > > might work out. I'll reimplement it and see if it works well for us and
> > > either send v2 for the RFC or let you know if I find any caveats.
> > > 
> > 
> > So I finally started digging into this alternative solution and run into
> > an issue with tags extracted from cover letters (we - and maybe others do
> > that too - often put eg. links to issue trackers in the cover letter
> > instead of putting them into each patch).
> > 
> > Usually, the cover letter is received before the patches from series. This
> > means if we only had the link between Tag and Patch (with the additional
> > information in the intermediate model), there will be no place to save tags
> > from cover letter at the time of parsing.
> 
> Oh, very good point. I'd missed that.
> 
> > Unless I'm overlooking something, we'd need to have the link from Tag to
> > both Patch and CoverLetter. This should still have much better performance
> > than my original solution (and will get rid of the duplication of yours).
> > 
> > Does this proposal make sense, or am I missing something?
> 
> That mostly makes sense. My main concern is what happens when you want
> to show tags for a patch when those tags were created again the cover
> letter. If that's the case, are we going to have to query on
> 'patch.series.cover_letter.tags'? I imagine that's going to be slow
> (lots of JOINs). We could store it on the series instead, but I'm not
> sure how much that would improve things. Any ideas how to work around
> this?
> 

I was thinking about filtering on the SubmissionTag (or whatever the
intermediate model will be named) based on submission IDs of the patch
and cover (or comment IDs in case of comments API), instead of going
through the relations. That said, my database knowledge is very...
abstract... so I have no idea how much it helps with the underlying
queries.

If you (or whoever else) can offer any insight that would be great!
Veronika


> Stephen
>
Stephen Finucane April 17, 2018, 3:46 p.m. UTC | #6
On Tue, 2018-04-17 at 11:35 -0400, Veronika Kabatova wrote:
> > > Unless I'm overlooking something, we'd need to have the link from Tag to
> > > both Patch and CoverLetter. This should still have much better performance
> > > than my original solution (and will get rid of the duplication of yours).
> > > 
> > > Does this proposal make sense, or am I missing something?
> > 
> > That mostly makes sense. My main concern is what happens when you want
> > to show tags for a patch when those tags were created again the cover
> > letter. If that's the case, are we going to have to query on
> > 'patch.series.cover_letter.tags'? I imagine that's going to be slow
> > (lots of JOINs). We could store it on the series instead, but I'm not
> > sure how much that would improve things. Any ideas how to work around
> > this?
> > 
> 
> I was thinking about filtering on the SubmissionTag (or whatever the
> intermediate model will be named) based on submission IDs of the patch
> and cover (or comment IDs in case of comments API), instead of going
> through the relations. That said, my database knowledge is very...
> abstract... so I have no idea how much it helps with the underlying
> queries.
> 
> If you (or whoever else) can offer any insight that would be great!

We'd still need to get information about the cover letter though, and
that requires going through the series (one join). Maybe we already
have that JOIN though, so this warrants some validation.

Another idea I've had is to store a series attribute in addition to the
cover letter, comment and patch attributes. That way we could do
something like this for patches:

   tags = Tag.objects.filter(series=patch.series,
                             Q(patch=patch) | Q(patch=None))

e.g. if the patch is part of our series and doesn't belong to _another_
patch, it must be a series-wide patch? You'd need to do additional
filtering on this for duplicates, of course, but I imagine that's easy
enough. You'd also want to make liberal use of the 'only' and 'defer'
functions to make sure we avoid as many joins as possible, however, I
don't think this would require a join on the 'patchwork_series' table
as we only use the ID column (which we'd have).

Thoughts?
Stephen
Veronika Kabatova April 17, 2018, 4:09 p.m. UTC | #7
----- Original Message -----
> From: "Stephen Finucane" <stephen@that.guru>
> To: "Veronika Kabatova" <vkabatov@redhat.com>
> Cc: patchwork@lists.ozlabs.org
> Sent: Tuesday, April 17, 2018 5:46:12 PM
> Subject: Re: [RFC 0/2 REBASE] Rework tagging infrastructure
> 
> On Tue, 2018-04-17 at 11:35 -0400, Veronika Kabatova wrote:
> > > > Unless I'm overlooking something, we'd need to have the link from Tag
> > > > to
> > > > both Patch and CoverLetter. This should still have much better
> > > > performance
> > > > than my original solution (and will get rid of the duplication of
> > > > yours).
> > > > 
> > > > Does this proposal make sense, or am I missing something?
> > > 
> > > That mostly makes sense. My main concern is what happens when you want
> > > to show tags for a patch when those tags were created again the cover
> > > letter. If that's the case, are we going to have to query on
> > > 'patch.series.cover_letter.tags'? I imagine that's going to be slow
> > > (lots of JOINs). We could store it on the series instead, but I'm not
> > > sure how much that would improve things. Any ideas how to work around
> > > this?
> > > 
> > 
> > I was thinking about filtering on the SubmissionTag (or whatever the
> > intermediate model will be named) based on submission IDs of the patch
> > and cover (or comment IDs in case of comments API), instead of going
> > through the relations. That said, my database knowledge is very...
> > abstract... so I have no idea how much it helps with the underlying
> > queries.
> > 
> > If you (or whoever else) can offer any insight that would be great!
> 
> We'd still need to get information about the cover letter though, and
> that requires going through the series (one join). Maybe we already
> have that JOIN though, so this warrants some validation.
> 

We already have the series in the API. For the view, we are prefetching
them, but only after annotation with tag counts. Will it help to change
the order there?

> Another idea I've had is to store a series attribute in addition to the
> cover letter, comment and patch attributes. That way we could do
> something like this for patches:
> 
>    tags = Tag.objects.filter(series=patch.series,
>                              Q(patch=patch) | Q(patch=None))
> 
> e.g. if the patch is part of our series and doesn't belong to _another_
> patch, it must be a series-wide patch? You'd need to do additional
> filtering on this for duplicates, of course, but I imagine that's easy
> enough. You'd also want to make liberal use of the 'only' and 'defer'
> functions to make sure we avoid as many joins as possible, however, I
> don't think this would require a join on the 'patchwork_series' table
> as we only use the ID column (which we'd have).
> 

I find having separate tables for cover letter and patch tags easier to
wrap around (and avoiding most of the duplication), but in case you think
the above won't help with performance, I'll go this way and see how it
works out. Yeah, getting out only distinct values is the easiest part :)

Veronika

> Thoughts?
> Stephen
>