mbox series

[RFC,0/2] Rework tagging infrastructure

Message ID 20180316143832.27963-1-vkabatov@redhat.com
Headers show
Series Rework tagging infrastructure | expand

Message

Veronika Kabatova March 16, 2018, 2:38 p.m. UTC
From: Veronika Kabatova <vkabatov@redhat.com>

(TL;DR at the end)

This RFC describes an approach to rework tagging. It attempts to solve GitHub
issues #57 [1] and #113 [2] as well as some other things we encountered. I'm
sending the incomplete version (eg I haven't fixed the tests) to discuss the
approach first.

Right now, tags are extracted from all the comments on the patch and the patch
itself, and they are reextracted from all the sources every time a comment is
added or removed. It makes saving slower, and might contribute to races with
writes to database when we are parsing multiple emails at the same time. This
gets even more prevalent if we want to solve the issue #113 (tags on cover
letter should increment counters on every patch in series) -- for each added
comment on the cover letter, we would reextract tags from all the other sources,
for each patch in series; and for a change on comments related to the patch
directly, we would need to take the tags on the cover letter and it's comments
into account as well (I implemented this solution in my fork but I really
don't like it).

The current approach has several other issues as well, some of which are
mentioned in the issue #57, eg duplicate tags are counted more times. Taking
into account the tags on cover letters would also be easier if we could store
tags against them and just query them on demand, instead of reextracting
everything all over again.

Solutions for some other things we found missing solve the issues mentioned
above too. If we want to determine if the tag is duplicate, we need to save the
associated value. Having the value would help us to use arbitrary strings as
tags (for example links to issue trackers, like `Bugzilla: <link>` if the patch
solves a known bug). The key-values approach to storing tags is mentioned [3],
this email additionally mentions a currently nonexistent /comments REST API
which we would like to implement some time in the future. For the comments API
we would also find it very useful to have the tags extracted from the comments
available directly so we can query for them, which means we would either need to
reextract the tags on every API call, or we could store the tags against the
comments as they are extracted and only query them as needed. Keeping tags
related to comments where they originated from avoids the need to reextract
tags with addition of new comments, as well as gets rid of the `patch_responses`
property used when converting comments to mbox (we finally get all the custom
tags there instead of only the few hardcoded ones).

TL;DR:
Our goals:
- Avoid tag reextraction with each added comment
- Fix issues #57 and #113
- Prepare things for addition of /comments API
- Add tags to patch (currently returns {}) and cover letter APIs

If you agree with the general idea, I'd like to get some help with DB query
optimizations. Tags are more distributed, so counting them in PatchQuerySet is
harder. I wanted to use annotation with Django's conditional selection, but it
doesn't seem to work very well and generated invalid SQL in my attempts (hence
my temporary solution with counting the tags on the fly only for the patches
that are being displayed). That said, I'm not a DB expert so any faster working
solutions are more than welcome.


Thoughts? Ideas?


[1] https://github.com/getpatchwork/patchwork/issues/57
[2] https://github.com/getpatchwork/patchwork/issues/113
[3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html

Veronika Kabatova (2):
  Rework tagging infrastructure
  Add migration for tagging changes

 docs/deployment/management.rst              |  16 +--
 patchwork/api/cover.py                      |  32 ++++-
 patchwork/api/patch.py                      |  41 +++++-
 patchwork/management/commands/retag.py      |  14 +-
 patchwork/migrations/0024_rework_tagging.py | 137 ++++++++++++++++++
 patchwork/models.py                         | 210 ++++++++++++++--------------
 patchwork/templatetags/patch.py             |   3 +-
 patchwork/views/__init__.py                 |   3 -
 patchwork/views/utils.py                    |   8 +-
 9 files changed, 331 insertions(+), 133 deletions(-)
 create mode 100644 patchwork/migrations/0024_rework_tagging.py

Comments

Stephen Finucane April 11, 2018, 4:43 p.m. UTC | #1
On Fri, 2018-03-16 at 15:38 +0100, vkabatov@redhat.com wrote:
> From: Veronika Kabatova <vkabatov@redhat.com>

I'm going to review this this week. However, this doesn't apply cleanly
to head of master any more (sorry :(). Any chance you could send
updated versions of these?

Stephen

> (TL;DR at the end)
> 
> This RFC describes an approach to rework tagging. It attempts to solve GitHub
> issues #57 [1] and #113 [2] as well as some other things we encountered. I'm
> sending the incomplete version (eg I haven't fixed the tests) to discuss the
> approach first.
> 
> Right now, tags are extracted from all the comments on the patch and the patch
> itself, and they are reextracted from all the sources every time a comment is
> added or removed. It makes saving slower, and might contribute to races with
> writes to database when we are parsing multiple emails at the same time. This
> gets even more prevalent if we want to solve the issue #113 (tags on cover
> letter should increment counters on every patch in series) -- for each added
> comment on the cover letter, we would reextract tags from all the other sources,
> for each patch in series; and for a change on comments related to the patch
> directly, we would need to take the tags on the cover letter and it's comments
> into account as well (I implemented this solution in my fork but I really
> don't like it).
> 
> The current approach has several other issues as well, some of which are
> mentioned in the issue #57, eg duplicate tags are counted more times. Taking
> into account the tags on cover letters would also be easier if we could store
> tags against them and just query them on demand, instead of reextracting
> everything all over again.
> 
> Solutions for some other things we found missing solve the issues mentioned
> above too. If we want to determine if the tag is duplicate, we need to save the
> associated value. Having the value would help us to use arbitrary strings as
> tags (for example links to issue trackers, like `Bugzilla: <link>` if the patch
> solves a known bug). The key-values approach to storing tags is mentioned [3],
> this email additionally mentions a currently nonexistent /comments REST API
> which we would like to implement some time in the future. For the comments API
> we would also find it very useful to have the tags extracted from the comments
> available directly so we can query for them, which means we would either need to
> reextract the tags on every API call, or we could store the tags against the
> comments as they are extracted and only query them as needed. Keeping tags
> related to comments where they originated from avoids the need to reextract
> tags with addition of new comments, as well as gets rid of the `patch_responses`
> property used when converting comments to mbox (we finally get all the custom
> tags there instead of only the few hardcoded ones).
> 
> TL;DR:
> Our goals:
> - Avoid tag reextraction with each added comment
> - Fix issues #57 and #113
> - Prepare things for addition of /comments API
> - Add tags to patch (currently returns {}) and cover letter APIs
> 
> If you agree with the general idea, I'd like to get some help with DB query
> optimizations. Tags are more distributed, so counting them in PatchQuerySet is
> harder. I wanted to use annotation with Django's conditional selection, but it
> doesn't seem to work very well and generated invalid SQL in my attempts (hence
> my temporary solution with counting the tags on the fly only for the patches
> that are being displayed). That said, I'm not a DB expert so any faster working
> solutions are more than welcome.
> 
> 
> Thoughts? Ideas?
> 
> 
> [1] https://github.com/getpatchwork/patchwork/issues/57
> [2] https://github.com/getpatchwork/patchwork/issues/113
> [3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html
> 
> Veronika Kabatova (2):
>   Rework tagging infrastructure
>   Add migration for tagging changes
> 
>  docs/deployment/management.rst              |  16 +--
>  patchwork/api/cover.py                      |  32 ++++-
>  patchwork/api/patch.py                      |  41 +++++-
>  patchwork/management/commands/retag.py      |  14 +-
>  patchwork/migrations/0024_rework_tagging.py | 137 ++++++++++++++++++
>  patchwork/models.py                         | 210 ++++++++++++++--------------
>  patchwork/templatetags/patch.py             |   3 +-
>  patchwork/views/__init__.py                 |   3 -
>  patchwork/views/utils.py                    |   8 +-
>  9 files changed, 331 insertions(+), 133 deletions(-)
>  create mode 100644 patchwork/migrations/0024_rework_tagging.py
>
Veronika Kabatova April 11, 2018, 5:50 p.m. UTC | #2
----- Original Message -----
> From: "Stephen Finucane" <stephen@that.guru>
> To: vkabatov@redhat.com, patchwork@lists.ozlabs.org
> Sent: Wednesday, April 11, 2018 6:43:25 PM
> Subject: Re: [RFC 0/2] Rework tagging infrastructure
> 
> On Fri, 2018-03-16 at 15:38 +0100, vkabatov@redhat.com wrote:
> > From: Veronika Kabatova <vkabatov@redhat.com>
> 
> I'm going to review this this week. However, this doesn't apply cleanly
> to head of master any more (sorry :(). Any chance you could send
> updated versions of these?
> 

Will rebase and resend shortly.

Veronika

> Stephen
> 
> > (TL;DR at the end)
> > 
> > This RFC describes an approach to rework tagging. It attempts to solve
> > GitHub
> > issues #57 [1] and #113 [2] as well as some other things we encountered.
> > I'm
> > sending the incomplete version (eg I haven't fixed the tests) to discuss
> > the
> > approach first.
> > 
> > Right now, tags are extracted from all the comments on the patch and the
> > patch
> > itself, and they are reextracted from all the sources every time a comment
> > is
> > added or removed. It makes saving slower, and might contribute to races
> > with
> > writes to database when we are parsing multiple emails at the same time.
> > This
> > gets even more prevalent if we want to solve the issue #113 (tags on cover
> > letter should increment counters on every patch in series) -- for each
> > added
> > comment on the cover letter, we would reextract tags from all the other
> > sources,
> > for each patch in series; and for a change on comments related to the patch
> > directly, we would need to take the tags on the cover letter and it's
> > comments
> > into account as well (I implemented this solution in my fork but I really
> > don't like it).
> > 
> > The current approach has several other issues as well, some of which are
> > mentioned in the issue #57, eg duplicate tags are counted more times.
> > Taking
> > into account the tags on cover letters would also be easier if we could
> > store
> > tags against them and just query them on demand, instead of reextracting
> > everything all over again.
> > 
> > Solutions for some other things we found missing solve the issues mentioned
> > above too. If we want to determine if the tag is duplicate, we need to save
> > the
> > associated value. Having the value would help us to use arbitrary strings
> > as
> > tags (for example links to issue trackers, like `Bugzilla: <link>` if the
> > patch
> > solves a known bug). The key-values approach to storing tags is mentioned
> > [3],
> > this email additionally mentions a currently nonexistent /comments REST API
> > which we would like to implement some time in the future. For the comments
> > API
> > we would also find it very useful to have the tags extracted from the
> > comments
> > available directly so we can query for them, which means we would either
> > need to
> > reextract the tags on every API call, or we could store the tags against
> > the
> > comments as they are extracted and only query them as needed. Keeping tags
> > related to comments where they originated from avoids the need to reextract
> > tags with addition of new comments, as well as gets rid of the
> > `patch_responses`
> > property used when converting comments to mbox (we finally get all the
> > custom
> > tags there instead of only the few hardcoded ones).
> > 
> > TL;DR:
> > Our goals:
> > - Avoid tag reextraction with each added comment
> > - Fix issues #57 and #113
> > - Prepare things for addition of /comments API
> > - Add tags to patch (currently returns {}) and cover letter APIs
> > 
> > If you agree with the general idea, I'd like to get some help with DB query
> > optimizations. Tags are more distributed, so counting them in PatchQuerySet
> > is
> > harder. I wanted to use annotation with Django's conditional selection, but
> > it
> > doesn't seem to work very well and generated invalid SQL in my attempts
> > (hence
> > my temporary solution with counting the tags on the fly only for the
> > patches
> > that are being displayed). That said, I'm not a DB expert so any faster
> > working
> > solutions are more than welcome.
> > 
> > 
> > Thoughts? Ideas?
> > 
> > 
> > [1] https://github.com/getpatchwork/patchwork/issues/57
> > [2] https://github.com/getpatchwork/patchwork/issues/113
> > [3] https://lists.ozlabs.org/pipermail/patchwork/2018-January/004741.html
> > 
> > Veronika Kabatova (2):
> >   Rework tagging infrastructure
> >   Add migration for tagging changes
> > 
> >  docs/deployment/management.rst              |  16 +--
> >  patchwork/api/cover.py                      |  32 ++++-
> >  patchwork/api/patch.py                      |  41 +++++-
> >  patchwork/management/commands/retag.py      |  14 +-
> >  patchwork/migrations/0024_rework_tagging.py | 137 ++++++++++++++++++
> >  patchwork/models.py                         | 210
> >  ++++++++++++++--------------
> >  patchwork/templatetags/patch.py             |   3 +-
> >  patchwork/views/__init__.py                 |   3 -
> >  patchwork/views/utils.py                    |   8 +-
> >  9 files changed, 331 insertions(+), 133 deletions(-)
> >  create mode 100644 patchwork/migrations/0024_rework_tagging.py
> > 
> 
>