mbox series

[RFC,v2,0/1] add parserelations management command

Message ID 20200611183455.18240-1-rohitsarkar5398@gmail.com
Headers show
Series add parserelations management command | expand

Message

Rohit Sarkar June 11, 2020, 6:34 p.m. UTC
The parserelations command will be used to bulk import patch relations into
Patchwork using a "patch groups" file.
The patch groups file is structured such that each line contains a
relation.
Eg patch groups file contents.
1 3 4
2
5 9 10

In this case 2 relations will be ingested into Patchwork, (1,3,4) and
(5,9,10). Further group (5,9,10) also points to two upstream commit
references.
Note that before ingesting the relations the existing relations in the
DB are removed.

v1..v2
- remove commit references from the patch groups file.
- Update documentation
- rename some variables to remove the overloading.
- use filter and update instead of get and save to reduce the db calls.
  (Visible performance enhancement)

Leaving the copyright untouched till Ralf and Lukas comment on how to
proceed.

Rohit Sarkar (1):
  management: introduce parserelations command

 docs/deployment/management.rst                | 26 +++++++
 .../management/commands/parserelations.py     | 71 +++++++++++++++++++
 patchwork/tests/test_management.py            |  7 ++
 3 files changed, 104 insertions(+)
 create mode 100644 patchwork/management/commands/parserelations.py

Comments

Rohit Sarkar June 16, 2020, 11:47 a.m. UTC | #1
Hey,
On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
> The parserelations command will be used to bulk import patch relations into
> Patchwork using a "patch groups" file.
> The patch groups file is structured such that each line contains a
> relation.
> Eg patch groups file contents.
> 1 3 4
> 2
> 5 9 10
> 
> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
> (5,9,10). Further group (5,9,10) also points to two upstream commit
> references.
> Note that before ingesting the relations the existing relations in the
> DB are removed.
> 
> v1..v2
> - remove commit references from the patch groups file.
> - Update documentation
> - rename some variables to remove the overloading.
> - use filter and update instead of get and save to reduce the db calls.
>   (Visible performance enhancement)
> 
> Leaving the copyright untouched till Ralf and Lukas comment on how to
> proceed.
> 
> Rohit Sarkar (1):
>   management: introduce parserelations command
> 
>  docs/deployment/management.rst                | 26 +++++++
>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>  patchwork/tests/test_management.py            |  7 ++
>  3 files changed, 104 insertions(+)
>  create mode 100644 patchwork/management/commands/parserelations.py
> 
> -- 
> 2.23.0.385.gbc12974a89
> 

Just wanted to follow up on this. Does this look good?

Thanks,
Rohit
Daniel Axtens June 16, 2020, 9:55 p.m. UTC | #2
Rohit Sarkar <rohitsarkar5398@gmail.com> writes:

> Hey,
> On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
>> The parserelations command will be used to bulk import patch relations into
>> Patchwork using a "patch groups" file.
>> The patch groups file is structured such that each line contains a
>> relation.
>> Eg patch groups file contents.
>> 1 3 4
>> 2
>> 5 9 10
>> 
>> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
>> (5,9,10). Further group (5,9,10) also points to two upstream commit
>> references.
>> Note that before ingesting the relations the existing relations in the
>> DB are removed.
>> 
>> v1..v2
>> - remove commit references from the patch groups file.
>> - Update documentation
>> - rename some variables to remove the overloading.
>> - use filter and update instead of get and save to reduce the db calls.
>>   (Visible performance enhancement)
>> 
>> Leaving the copyright untouched till Ralf and Lukas comment on how to
>> proceed.
>> 
>> Rohit Sarkar (1):
>>   management: introduce parserelations command
>> 
>>  docs/deployment/management.rst                | 26 +++++++
>>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>>  patchwork/tests/test_management.py            |  7 ++
>>  3 files changed, 104 insertions(+)
>>  create mode 100644 patchwork/management/commands/parserelations.py
>> 
>> -- 
>> 2.23.0.385.gbc12974a89
>> 
>
> Just wanted to follow up on this. Does this look good?

I was thinking about this as I washed the dishes last night.

Purging all relations in the database means that if any API-based user
of relations emerges before the big public servers adopt a release with
this, then they'll never be able to use it.

What's the speed impact of doing two passes through the data, with pass
1 collecting all the projects that this touches and pass 2 actually
installing the relations?

Regards,
Daniel

> Thanks,
> Rohit
Rohit Sarkar June 17, 2020, 3:19 p.m. UTC | #3
Hey Daniel,
> I was thinking about this as I washed the dishes last night.
> 
> Purging all relations in the database means that if any API-based user
> of relations emerges before the big public servers adopt a release with
> this, then they'll never be able to use it.
> 
> What's the speed impact of doing two passes through the data, with pass
> 1 collecting all the projects that this touches and pass 2 actually
> installing the relations?
> 
> Regards,
> Daniel

Dont know if you read my previous email [1], where I had essentially
outlined a non destructive approach to this. This way, I dont think we
will need 2 passes.

Thanks,
Rohit

[1]: https://lists.ozlabs.org/pipermail/patchwork/2020-June/006631.html
Rohit Sarkar June 27, 2020, 12:39 p.m. UTC | #4
Hey,
On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> 
> > Hey,
> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
> >> The parserelations command will be used to bulk import patch relations into
> >> Patchwork using a "patch groups" file.
> >> The patch groups file is structured such that each line contains a
> >> relation.
> >> Eg patch groups file contents.
> >> 1 3 4
> >> 2
> >> 5 9 10
> >> 
> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
> >> references.
> >> Note that before ingesting the relations the existing relations in the
> >> DB are removed.
> >> 
> >> v1..v2
> >> - remove commit references from the patch groups file.
> >> - Update documentation
> >> - rename some variables to remove the overloading.
> >> - use filter and update instead of get and save to reduce the db calls.
> >>   (Visible performance enhancement)
> >> 
> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
> >> proceed.
> >> 
> >> Rohit Sarkar (1):
> >>   management: introduce parserelations command
> >> 
> >>  docs/deployment/management.rst                | 26 +++++++
> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
> >>  patchwork/tests/test_management.py            |  7 ++
> >>  3 files changed, 104 insertions(+)
> >>  create mode 100644 patchwork/management/commands/parserelations.py
> >> 
> >> -- 
> >> 2.23.0.385.gbc12974a89
> >> 
> >
> > Just wanted to follow up on this. Does this look good?
> 
> I was thinking about this as I washed the dishes last night.
> 
> Purging all relations in the database means that if any API-based user
> of relations emerges before the big public servers adopt a release with
> this, then they'll never be able to use it.
> 
> What's the speed impact of doing two passes through the data, with pass
> 1 collecting all the projects that this touches and pass 2 actually
> installing the relations?
> 
> Regards,
> Daniel
I wanted to reopen the discussion on this with the aim being to get
everyone on the same page, and either go with this approach or decide on
an alternate approach.

The primary goal is to have a "parserelations" command that can parse a
"patch-groups" file that contains a relation on each line. There are 2
approaches:

a. Refresh the relations table. The parserelations command will remove
the existing relations before inserting the newly parsed relations. This
avoids any conflicts between existing and new relations. However other
users of the API might be affected.

b. Insert relations in a non destructive way. We need a way to handle
conflicts between existing and incoming relations. So the solution is
not really clear in this case. What should have precedence in the case
of conflicts? The parserelations script or the existing relations.


Thanks,
Rohit
> > Thanks,
> > Rohit
Daniel Axtens June 30, 2020, 1:59 a.m. UTC | #5
Rohit Sarkar <rohitsarkar5398@gmail.com> writes:

> Hey,
> On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
>> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> 
>> > Hey,
>> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
>> >> The parserelations command will be used to bulk import patch relations into
>> >> Patchwork using a "patch groups" file.
>> >> The patch groups file is structured such that each line contains a
>> >> relation.
>> >> Eg patch groups file contents.
>> >> 1 3 4
>> >> 2
>> >> 5 9 10
>> >> 
>> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
>> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
>> >> references.
>> >> Note that before ingesting the relations the existing relations in the
>> >> DB are removed.
>> >> 
>> >> v1..v2
>> >> - remove commit references from the patch groups file.
>> >> - Update documentation
>> >> - rename some variables to remove the overloading.
>> >> - use filter and update instead of get and save to reduce the db calls.
>> >>   (Visible performance enhancement)
>> >> 
>> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
>> >> proceed.
>> >> 
>> >> Rohit Sarkar (1):
>> >>   management: introduce parserelations command
>> >> 
>> >>  docs/deployment/management.rst                | 26 +++++++
>> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>> >>  patchwork/tests/test_management.py            |  7 ++
>> >>  3 files changed, 104 insertions(+)
>> >>  create mode 100644 patchwork/management/commands/parserelations.py
>> >> 
>> >> -- 
>> >> 2.23.0.385.gbc12974a89
>> >> 
>> >
>> > Just wanted to follow up on this. Does this look good?
>> 
>> I was thinking about this as I washed the dishes last night.
>> 
>> Purging all relations in the database means that if any API-based user
>> of relations emerges before the big public servers adopt a release with
>> this, then they'll never be able to use it.
>> 
>> What's the speed impact of doing two passes through the data, with pass
>> 1 collecting all the projects that this touches and pass 2 actually
>> installing the relations?
>> 
>> Regards,
>> Daniel
> I wanted to reopen the discussion on this with the aim being to get
> everyone on the same page, and either go with this approach or decide on
> an alternate approach.
>
> The primary goal is to have a "parserelations" command that can parse a
> "patch-groups" file that contains a relation on each line. There are 2
> approaches:
>
> a. Refresh the relations table. The parserelations command will remove
> the existing relations before inserting the newly parsed relations. This
> avoids any conflicts between existing and new relations. However other
> users of the API might be affected.
>
> b. Insert relations in a non destructive way. We need a way to handle
> conflicts between existing and incoming relations. So the solution is
> not really clear in this case. What should have precedence in the case
> of conflicts? The parserelations script or the existing relations.
>


Right, sorry for dropping this, work got a bit crazy.

It occurs to me that I have an embedded assumption that may not be true
here. Are you trying to build a set of relations that includes
cross-project relations, or are you just trying to build a set of
relations entirely or mostly within a single patchwork project?

I've been assuming that your relations are entirely or predominantly
intra-project, but I'm starting to suspect that maybe my understanding
is not right.

If you are expecting a non-trivial quantity of cross-project relations,
then my suggested approach earlier (caller of the script specifies the
projects that patches may belong to) doesn't make sense. It probably
makes most sense to go with approach (a) in that case - it's really
difficult to figure out how to do (b) in a way that preserves the
meaning that any other API user, or the PaStA tool, is trying to create.

If we do go down route (a), I want to make it really hard for users to
accidentally shoot themselves in the foot here and wipe out data. So
probably I would say that parserelations should only insert relations if
the table is empty when the script begins, and there should be a
separate command that with a suitably scary name that purges all
existing relations.


Have I correctly understood things here?

Regards,
Daniel




>
> Thanks,
> Rohit
>> > Thanks,
>> > Rohit
Rohit Sarkar June 30, 2020, 8:31 a.m. UTC | #6
On Tue, Jun 30, 2020 at 11:59:52AM +1000, Daniel Axtens wrote:
> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> 
> > Hey,
> > On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> >> 
> >> > Hey,
> >> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
> >> >> The parserelations command will be used to bulk import patch relations into
> >> >> Patchwork using a "patch groups" file.
> >> >> The patch groups file is structured such that each line contains a
> >> >> relation.
> >> >> Eg patch groups file contents.
> >> >> 1 3 4
> >> >> 2
> >> >> 5 9 10
> >> >> 
> >> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
> >> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
> >> >> references.
> >> >> Note that before ingesting the relations the existing relations in the
> >> >> DB are removed.
> >> >> 
> >> >> v1..v2
> >> >> - remove commit references from the patch groups file.
> >> >> - Update documentation
> >> >> - rename some variables to remove the overloading.
> >> >> - use filter and update instead of get and save to reduce the db calls.
> >> >>   (Visible performance enhancement)
> >> >> 
> >> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
> >> >> proceed.
> >> >> 
> >> >> Rohit Sarkar (1):
> >> >>   management: introduce parserelations command
> >> >> 
> >> >>  docs/deployment/management.rst                | 26 +++++++
> >> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
> >> >>  patchwork/tests/test_management.py            |  7 ++
> >> >>  3 files changed, 104 insertions(+)
> >> >>  create mode 100644 patchwork/management/commands/parserelations.py
> >> >> 
> >> >> -- 
> >> >> 2.23.0.385.gbc12974a89
> >> >> 
> >> >
> >> > Just wanted to follow up on this. Does this look good?
> >> 
> >> I was thinking about this as I washed the dishes last night.
> >> 
> >> Purging all relations in the database means that if any API-based user
> >> of relations emerges before the big public servers adopt a release with
> >> this, then they'll never be able to use it.
> >> 
> >> What's the speed impact of doing two passes through the data, with pass
> >> 1 collecting all the projects that this touches and pass 2 actually
> >> installing the relations?
> >> 
> >> Regards,
> >> Daniel
> > I wanted to reopen the discussion on this with the aim being to get
> > everyone on the same page, and either go with this approach or decide on
> > an alternate approach.
> >
> > The primary goal is to have a "parserelations" command that can parse a
> > "patch-groups" file that contains a relation on each line. There are 2
> > approaches:
> >
> > a. Refresh the relations table. The parserelations command will remove
> > the existing relations before inserting the newly parsed relations. This
> > avoids any conflicts between existing and new relations. However other
> > users of the API might be affected.
> >
> > b. Insert relations in a non destructive way. We need a way to handle
> > conflicts between existing and incoming relations. So the solution is
> > not really clear in this case. What should have precedence in the case
> > of conflicts? The parserelations script or the existing relations.
> >
> 
> 
> Right, sorry for dropping this, work got a bit crazy.
> 
> It occurs to me that I have an embedded assumption that may not be true
> here. Are you trying to build a set of relations that includes
> cross-project relations, or are you just trying to build a set of
> relations entirely or mostly within a single patchwork project?
> 
> I've been assuming that your relations are entirely or predominantly
> intra-project, but I'm starting to suspect that maybe my understanding
> is not right.
The relations can be inter project too. PaStA is configured with a list
of Patchwork projects for which it will pull all the corresponding
patches and perform it's analyses. PaStA can relate patches across
projects too.

> If you are expecting a non-trivial quantity of cross-project relations,
> then my suggested approach earlier (caller of the script specifies the
> projects that patches may belong to) doesn't make sense. It probably
> makes most sense to go with approach (a) in that case - it's really
> difficult to figure out how to do (b) in a way that preserves the
> meaning that any other API user, or the PaStA tool, is trying to create.
> 
> If we do go down route (a), I want to make it really hard for users to
> accidentally shoot themselves in the foot here and wipe out data. So
> probably I would say that parserelations should only insert relations if
> the table is empty when the script begins, and there should be a
> separate command that with a suitably scary name that purges all
> existing relations.

I agree that we should make parserelations actions very explicit to
prevent any loss of data. However, if we use parserelations to
differentially import relations we will need to be able to insert
relations with a non empty table. Let me explain the complete flow
between PaStA and Patchwork so that things are a bit clear.
We plan to run a script, maybe as a cron job at regular time intervals
which does the following:
1. Exports patches from Patchwork into PaStA
2. Runs PaStA's analyses
3. Runs the parserelations script to parse the patch-groups file
generated by PaStA and inserts the relations into Patchwork. In this
step we refresh the db and insert all the relations afresh. A non
destructive approach would involve thinking of a way to resolve
conflicts between existing and new relations.

> 
> Have I correctly understood things here?
> 
> Regards,
> Daniel
> 
Thanks,
Rohit
Daniel Axtens July 1, 2020, 7:33 a.m. UTC | #7
Rohit Sarkar <rohitsarkar5398@gmail.com> writes:

> On Tue, Jun 30, 2020 at 11:59:52AM +1000, Daniel Axtens wrote:
>> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> 
>> > Hey,
>> > On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
>> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> >> 
>> >> > Hey,
>> >> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
>> >> >> The parserelations command will be used to bulk import patch relations into
>> >> >> Patchwork using a "patch groups" file.
>> >> >> The patch groups file is structured such that each line contains a
>> >> >> relation.
>> >> >> Eg patch groups file contents.
>> >> >> 1 3 4
>> >> >> 2
>> >> >> 5 9 10
>> >> >> 
>> >> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
>> >> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
>> >> >> references.
>> >> >> Note that before ingesting the relations the existing relations in the
>> >> >> DB are removed.
>> >> >> 
>> >> >> v1..v2
>> >> >> - remove commit references from the patch groups file.
>> >> >> - Update documentation
>> >> >> - rename some variables to remove the overloading.
>> >> >> - use filter and update instead of get and save to reduce the db calls.
>> >> >>   (Visible performance enhancement)
>> >> >> 
>> >> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
>> >> >> proceed.
>> >> >> 
>> >> >> Rohit Sarkar (1):
>> >> >>   management: introduce parserelations command
>> >> >> 
>> >> >>  docs/deployment/management.rst                | 26 +++++++
>> >> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>> >> >>  patchwork/tests/test_management.py            |  7 ++
>> >> >>  3 files changed, 104 insertions(+)
>> >> >>  create mode 100644 patchwork/management/commands/parserelations.py
>> >> >> 
>> >> >> -- 
>> >> >> 2.23.0.385.gbc12974a89
>> >> >> 
>> >> >
>> >> > Just wanted to follow up on this. Does this look good?
>> >> 
>> >> I was thinking about this as I washed the dishes last night.
>> >> 
>> >> Purging all relations in the database means that if any API-based user
>> >> of relations emerges before the big public servers adopt a release with
>> >> this, then they'll never be able to use it.
>> >> 
>> >> What's the speed impact of doing two passes through the data, with pass
>> >> 1 collecting all the projects that this touches and pass 2 actually
>> >> installing the relations?
>> >> 
>> >> Regards,
>> >> Daniel
>> > I wanted to reopen the discussion on this with the aim being to get
>> > everyone on the same page, and either go with this approach or decide on
>> > an alternate approach.
>> >
>> > The primary goal is to have a "parserelations" command that can parse a
>> > "patch-groups" file that contains a relation on each line. There are 2
>> > approaches:
>> >
>> > a. Refresh the relations table. The parserelations command will remove
>> > the existing relations before inserting the newly parsed relations. This
>> > avoids any conflicts between existing and new relations. However other
>> > users of the API might be affected.
>> >
>> > b. Insert relations in a non destructive way. We need a way to handle
>> > conflicts between existing and incoming relations. So the solution is
>> > not really clear in this case. What should have precedence in the case
>> > of conflicts? The parserelations script or the existing relations.
>> >
>> 
>> 
>> Right, sorry for dropping this, work got a bit crazy.
>> 
>> It occurs to me that I have an embedded assumption that may not be true
>> here. Are you trying to build a set of relations that includes
>> cross-project relations, or are you just trying to build a set of
>> relations entirely or mostly within a single patchwork project?
>> 
>> I've been assuming that your relations are entirely or predominantly
>> intra-project, but I'm starting to suspect that maybe my understanding
>> is not right.
> The relations can be inter project too. PaStA is configured with a list
> of Patchwork projects for which it will pull all the corresponding
> patches and perform it's analyses. PaStA can relate patches across
> projects too.
>
>> If you are expecting a non-trivial quantity of cross-project relations,
>> then my suggested approach earlier (caller of the script specifies the
>> projects that patches may belong to) doesn't make sense. It probably
>> makes most sense to go with approach (a) in that case - it's really
>> difficult to figure out how to do (b) in a way that preserves the
>> meaning that any other API user, or the PaStA tool, is trying to create.
>> 
>> If we do go down route (a), I want to make it really hard for users to
>> accidentally shoot themselves in the foot here and wipe out data. So
>> probably I would say that parserelations should only insert relations if
>> the table is empty when the script begins, and there should be a
>> separate command that with a suitably scary name that purges all
>> existing relations.
>
> I agree that we should make parserelations actions very explicit to
> prevent any loss of data. However, if we use parserelations to
> differentially import relations we will need to be able to insert
> relations with a non empty table. Let me explain the complete flow
> between PaStA and Patchwork so that things are a bit clear.
> We plan to run a script, maybe as a cron job at regular time intervals
> which does the following:
> 1. Exports patches from Patchwork into PaStA
> 2. Runs PaStA's analyses
> 3. Runs the parserelations script to parse the patch-groups file
> generated by PaStA and inserts the relations into Patchwork. In this
> step we refresh the db and insert all the relations afresh. A non
> destructive approach would involve thinking of a way to resolve
> conflicts between existing and new relations.

OK.

I am happy for you to stick with a destructive approach for the
managment command, so long as it's clear to the user that data is
potentially about to be destroyed. I'm happy for that to be:

 - two commands: a command that purges relations, and then a parse
   command that checks if the relations table is empty and refuses to
   run if it is not empty.

 - a single command that requires you to pass a command line option like
   '--really-delete-old-relations'

 - a single command with a name that captures the idea that data will be
   destroyed. I don't know what that would be, maybe 'replace...'.

I'm not fussy which option you go with, I just don't want a command
named 'parse...' to delete data without the user being properly warned.

(You will at some point need to consider how to do incremental updates
over the API for a publicly usable version, but that can definitely wait
and might be out of scope for your GSoC project.)

Regards,
Daniel

>
>> 
>> Have I correctly understood things here?
>> 
>> Regards,
>> Daniel
>> 
> Thanks,
> Rohit
Rohit Sarkar July 1, 2020, 6:15 p.m. UTC | #8
On Wed, Jul 01, 2020 at 05:33:03PM +1000, Daniel Axtens wrote:
> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> 
> > On Tue, Jun 30, 2020 at 11:59:52AM +1000, Daniel Axtens wrote:
> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> >> 
> >> > Hey,
> >> > On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
> >> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
> >> >> 
> >> >> > Hey,
> >> >> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
> >> >> >> The parserelations command will be used to bulk import patch relations into
> >> >> >> Patchwork using a "patch groups" file.
> >> >> >> The patch groups file is structured such that each line contains a
> >> >> >> relation.
> >> >> >> Eg patch groups file contents.
> >> >> >> 1 3 4
> >> >> >> 2
> >> >> >> 5 9 10
> >> >> >> 
> >> >> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
> >> >> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
> >> >> >> references.
> >> >> >> Note that before ingesting the relations the existing relations in the
> >> >> >> DB are removed.
> >> >> >> 
> >> >> >> v1..v2
> >> >> >> - remove commit references from the patch groups file.
> >> >> >> - Update documentation
> >> >> >> - rename some variables to remove the overloading.
> >> >> >> - use filter and update instead of get and save to reduce the db calls.
> >> >> >>   (Visible performance enhancement)
> >> >> >> 
> >> >> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
> >> >> >> proceed.
> >> >> >> 
> >> >> >> Rohit Sarkar (1):
> >> >> >>   management: introduce parserelations command
> >> >> >> 
> >> >> >>  docs/deployment/management.rst                | 26 +++++++
> >> >> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
> >> >> >>  patchwork/tests/test_management.py            |  7 ++
> >> >> >>  3 files changed, 104 insertions(+)
> >> >> >>  create mode 100644 patchwork/management/commands/parserelations.py
> >> >> >> 
> >> >> >> -- 
> >> >> >> 2.23.0.385.gbc12974a89
> >> >> >> 
> >> >> >
> >> >> > Just wanted to follow up on this. Does this look good?
> >> >> 
> >> >> I was thinking about this as I washed the dishes last night.
> >> >> 
> >> >> Purging all relations in the database means that if any API-based user
> >> >> of relations emerges before the big public servers adopt a release with
> >> >> this, then they'll never be able to use it.
> >> >> 
> >> >> What's the speed impact of doing two passes through the data, with pass
> >> >> 1 collecting all the projects that this touches and pass 2 actually
> >> >> installing the relations?
> >> >> 
> >> >> Regards,
> >> >> Daniel
> >> > I wanted to reopen the discussion on this with the aim being to get
> >> > everyone on the same page, and either go with this approach or decide on
> >> > an alternate approach.
> >> >
> >> > The primary goal is to have a "parserelations" command that can parse a
> >> > "patch-groups" file that contains a relation on each line. There are 2
> >> > approaches:
> >> >
> >> > a. Refresh the relations table. The parserelations command will remove
> >> > the existing relations before inserting the newly parsed relations. This
> >> > avoids any conflicts between existing and new relations. However other
> >> > users of the API might be affected.
> >> >
> >> > b. Insert relations in a non destructive way. We need a way to handle
> >> > conflicts between existing and incoming relations. So the solution is
> >> > not really clear in this case. What should have precedence in the case
> >> > of conflicts? The parserelations script or the existing relations.
> >> >
> >> 
> >> 
> >> Right, sorry for dropping this, work got a bit crazy.
> >> 
> >> It occurs to me that I have an embedded assumption that may not be true
> >> here. Are you trying to build a set of relations that includes
> >> cross-project relations, or are you just trying to build a set of
> >> relations entirely or mostly within a single patchwork project?
> >> 
> >> I've been assuming that your relations are entirely or predominantly
> >> intra-project, but I'm starting to suspect that maybe my understanding
> >> is not right.
> > The relations can be inter project too. PaStA is configured with a list
> > of Patchwork projects for which it will pull all the corresponding
> > patches and perform it's analyses. PaStA can relate patches across
> > projects too.
> >
> >> If you are expecting a non-trivial quantity of cross-project relations,
> >> then my suggested approach earlier (caller of the script specifies the
> >> projects that patches may belong to) doesn't make sense. It probably
> >> makes most sense to go with approach (a) in that case - it's really
> >> difficult to figure out how to do (b) in a way that preserves the
> >> meaning that any other API user, or the PaStA tool, is trying to create.
> >> 
> >> If we do go down route (a), I want to make it really hard for users to
> >> accidentally shoot themselves in the foot here and wipe out data. So
> >> probably I would say that parserelations should only insert relations if
> >> the table is empty when the script begins, and there should be a
> >> separate command that with a suitably scary name that purges all
> >> existing relations.
> >
> > I agree that we should make parserelations actions very explicit to
> > prevent any loss of data. However, if we use parserelations to
> > differentially import relations we will need to be able to insert
> > relations with a non empty table. Let me explain the complete flow
> > between PaStA and Patchwork so that things are a bit clear.
> > We plan to run a script, maybe as a cron job at regular time intervals
> > which does the following:
> > 1. Exports patches from Patchwork into PaStA
> > 2. Runs PaStA's analyses
> > 3. Runs the parserelations script to parse the patch-groups file
> > generated by PaStA and inserts the relations into Patchwork. In this
> > step we refresh the db and insert all the relations afresh. A non
> > destructive approach would involve thinking of a way to resolve
> > conflicts between existing and new relations.
> 
> OK.
> 
> I am happy for you to stick with a destructive approach for the
> managment command, so long as it's clear to the user that data is
> potentially about to be destroyed. I'm happy for that to be:
> 
>  - two commands: a command that purges relations, and then a parse
>    command that checks if the relations table is empty and refuses to
>    run if it is not empty.
> 
>  - a single command that requires you to pass a command line option like
>    '--really-delete-old-relations'
> 
>  - a single command with a name that captures the idea that data will be
>    destroyed. I don't know what that would be, maybe 'replace...'.
> 
> I'm not fussy which option you go with, I just don't want a command
> named 'parse...' to delete data without the user being properly warned.
Yeah, I agree. The name "parserelations" doesnt make it explicit enough.
I am fine with replacerelations.

> (You will at some point need to consider how to do incremental updates
> over the API for a publicly usable version, but that can definitely wait
> and might be out of scope for your GSoC project.)
Agreed.

> Regards,
> Daniel
> 
> >
> >> 
> >> Have I correctly understood things here?
> >> 
> >> Regards,
> >> Daniel
> >> 
> > Thanks,
> > Rohit
Thanks
Rohit
Daniel Axtens July 1, 2020, 10:48 p.m. UTC | #9
Rohit Sarkar <rohitsarkar5398@gmail.com> writes:

> On Wed, Jul 01, 2020 at 05:33:03PM +1000, Daniel Axtens wrote:
>> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> 
>> > On Tue, Jun 30, 2020 at 11:59:52AM +1000, Daniel Axtens wrote:
>> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> >> 
>> >> > Hey,
>> >> > On Wed, Jun 17, 2020 at 07:55:37AM +1000, Daniel Axtens wrote:
>> >> >> Rohit Sarkar <rohitsarkar5398@gmail.com> writes:
>> >> >> 
>> >> >> > Hey,
>> >> >> > On Fri, Jun 12, 2020 at 12:04:54AM +0530, Rohit Sarkar wrote:
>> >> >> >> The parserelations command will be used to bulk import patch relations into
>> >> >> >> Patchwork using a "patch groups" file.
>> >> >> >> The patch groups file is structured such that each line contains a
>> >> >> >> relation.
>> >> >> >> Eg patch groups file contents.
>> >> >> >> 1 3 4
>> >> >> >> 2
>> >> >> >> 5 9 10
>> >> >> >> 
>> >> >> >> In this case 2 relations will be ingested into Patchwork, (1,3,4) and
>> >> >> >> (5,9,10). Further group (5,9,10) also points to two upstream commit
>> >> >> >> references.
>> >> >> >> Note that before ingesting the relations the existing relations in the
>> >> >> >> DB are removed.
>> >> >> >> 
>> >> >> >> v1..v2
>> >> >> >> - remove commit references from the patch groups file.
>> >> >> >> - Update documentation
>> >> >> >> - rename some variables to remove the overloading.
>> >> >> >> - use filter and update instead of get and save to reduce the db calls.
>> >> >> >>   (Visible performance enhancement)
>> >> >> >> 
>> >> >> >> Leaving the copyright untouched till Ralf and Lukas comment on how to
>> >> >> >> proceed.
>> >> >> >> 
>> >> >> >> Rohit Sarkar (1):
>> >> >> >>   management: introduce parserelations command
>> >> >> >> 
>> >> >> >>  docs/deployment/management.rst                | 26 +++++++
>> >> >> >>  .../management/commands/parserelations.py     | 71 +++++++++++++++++++
>> >> >> >>  patchwork/tests/test_management.py            |  7 ++
>> >> >> >>  3 files changed, 104 insertions(+)
>> >> >> >>  create mode 100644 patchwork/management/commands/parserelations.py
>> >> >> >> 
>> >> >> >> -- 
>> >> >> >> 2.23.0.385.gbc12974a89
>> >> >> >> 
>> >> >> >
>> >> >> > Just wanted to follow up on this. Does this look good?
>> >> >> 
>> >> >> I was thinking about this as I washed the dishes last night.
>> >> >> 
>> >> >> Purging all relations in the database means that if any API-based user
>> >> >> of relations emerges before the big public servers adopt a release with
>> >> >> this, then they'll never be able to use it.
>> >> >> 
>> >> >> What's the speed impact of doing two passes through the data, with pass
>> >> >> 1 collecting all the projects that this touches and pass 2 actually
>> >> >> installing the relations?
>> >> >> 
>> >> >> Regards,
>> >> >> Daniel
>> >> > I wanted to reopen the discussion on this with the aim being to get
>> >> > everyone on the same page, and either go with this approach or decide on
>> >> > an alternate approach.
>> >> >
>> >> > The primary goal is to have a "parserelations" command that can parse a
>> >> > "patch-groups" file that contains a relation on each line. There are 2
>> >> > approaches:
>> >> >
>> >> > a. Refresh the relations table. The parserelations command will remove
>> >> > the existing relations before inserting the newly parsed relations. This
>> >> > avoids any conflicts between existing and new relations. However other
>> >> > users of the API might be affected.
>> >> >
>> >> > b. Insert relations in a non destructive way. We need a way to handle
>> >> > conflicts between existing and incoming relations. So the solution is
>> >> > not really clear in this case. What should have precedence in the case
>> >> > of conflicts? The parserelations script or the existing relations.
>> >> >
>> >> 
>> >> 
>> >> Right, sorry for dropping this, work got a bit crazy.
>> >> 
>> >> It occurs to me that I have an embedded assumption that may not be true
>> >> here. Are you trying to build a set of relations that includes
>> >> cross-project relations, or are you just trying to build a set of
>> >> relations entirely or mostly within a single patchwork project?
>> >> 
>> >> I've been assuming that your relations are entirely or predominantly
>> >> intra-project, but I'm starting to suspect that maybe my understanding
>> >> is not right.
>> > The relations can be inter project too. PaStA is configured with a list
>> > of Patchwork projects for which it will pull all the corresponding
>> > patches and perform it's analyses. PaStA can relate patches across
>> > projects too.
>> >
>> >> If you are expecting a non-trivial quantity of cross-project relations,
>> >> then my suggested approach earlier (caller of the script specifies the
>> >> projects that patches may belong to) doesn't make sense. It probably
>> >> makes most sense to go with approach (a) in that case - it's really
>> >> difficult to figure out how to do (b) in a way that preserves the
>> >> meaning that any other API user, or the PaStA tool, is trying to create.
>> >> 
>> >> If we do go down route (a), I want to make it really hard for users to
>> >> accidentally shoot themselves in the foot here and wipe out data. So
>> >> probably I would say that parserelations should only insert relations if
>> >> the table is empty when the script begins, and there should be a
>> >> separate command that with a suitably scary name that purges all
>> >> existing relations.
>> >
>> > I agree that we should make parserelations actions very explicit to
>> > prevent any loss of data. However, if we use parserelations to
>> > differentially import relations we will need to be able to insert
>> > relations with a non empty table. Let me explain the complete flow
>> > between PaStA and Patchwork so that things are a bit clear.
>> > We plan to run a script, maybe as a cron job at regular time intervals
>> > which does the following:
>> > 1. Exports patches from Patchwork into PaStA
>> > 2. Runs PaStA's analyses
>> > 3. Runs the parserelations script to parse the patch-groups file
>> > generated by PaStA and inserts the relations into Patchwork. In this
>> > step we refresh the db and insert all the relations afresh. A non
>> > destructive approach would involve thinking of a way to resolve
>> > conflicts between existing and new relations.
>> 
>> OK.
>> 
>> I am happy for you to stick with a destructive approach for the
>> managment command, so long as it's clear to the user that data is
>> potentially about to be destroyed. I'm happy for that to be:
>> 
>>  - two commands: a command that purges relations, and then a parse
>>    command that checks if the relations table is empty and refuses to
>>    run if it is not empty.
>> 
>>  - a single command that requires you to pass a command line option like
>>    '--really-delete-old-relations'
>> 
>>  - a single command with a name that captures the idea that data will be
>>    destroyed. I don't know what that would be, maybe 'replace...'.
>> 
>> I'm not fussy which option you go with, I just don't want a command
>> named 'parse...' to delete data without the user being properly warned.
> Yeah, I agree. The name "parserelations" doesnt make it explicit enough.
> I am fine with replacerelations.
>

Ok cool! If you send a patch with this name and with tests, then -
assuming I don't have any code-quality nits - I'd be happy to take it.

Regards,
Daneil

>> (You will at some point need to consider how to do incremental updates
>> over the API for a publicly usable version, but that can definitely wait
>> and might be out of scope for your GSoC project.)
> Agreed.
>
>> Regards,
>> Daniel
>> 
>> >
>> >> 
>> >> Have I correctly understood things here?
>> >> 
>> >> Regards,
>> >> Daniel
>> >> 
>> > Thanks,
>> > Rohit
> Thanks
> Rohit