[01/11] Improve patch listing performance (~3x)

Message ID 20180810080106.10714-2-stewart@linux.ibm.com
State New
Headers show
Series
  • Performance for ALL THE THINGS!
Related show

Commit Message

Stewart Smith Aug. 10, 2018, 8 a.m.
There's two main bits that are really expensive when composing the list
of patches for a project: the query getting the list, and the query
finding the series for each patch.

If we look at the query getting the list, it gets a lot of unnesseccary
fields such as 'headers' and 'content', even though we tell Django not
to. It turns out that Django seems to ignore the Submission relationship
and I have no idea how to force it to ignore that thing (defer doesn't
work) but if we go only, then it works okay.

From my import of ~8000 messages for a few projects, my laptop query
time (MySQL, as setup by whatever the docker-compose things do) goes
from:

http://localhost:8000/project/linux-kernel/list/
FROM:
342ms SQL queries cold cache, 268ms warm cache
TO:
118ms SQL queries cold cache, 88ms warm cache

Which is... non trivial to say the least.

The big jump is the patches.only change, and the removal of ordering
on the patchseries takes a further 10ms off. For some strange reason, it
seems rather hard to tell Django that you don't care what order the
results come back in for that query (if we do, then the db server has to
do a sort rather than just return each row)

Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
---
 patchwork/models.py         | 1 -
 patchwork/views/__init__.py | 2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

Comments

Daniel Axtens Aug. 10, 2018, 6:28 p.m. | #1
Stewart Smith <stewart@linux.ibm.com> writes:

> There's two main bits that are really expensive when composing the list
> of patches for a project: the query getting the list, and the query
> finding the series for each patch.
>
> If we look at the query getting the list, it gets a lot of unnesseccary
> fields such as 'headers' and 'content', even though we tell Django not
> to. It turns out that Django seems to ignore the Submission relationship
> and I have no idea how to force it to ignore that thing (defer doesn't
> work) but if we go only, then it works okay.
>
> From my import of ~8000 messages for a few projects, my laptop query
> time (MySQL, as setup by whatever the docker-compose things do) goes
> from:
>
> http://localhost:8000/project/linux-kernel/list/
> FROM:
> 342ms SQL queries cold cache, 268ms warm cache
> TO:
> 118ms SQL queries cold cache, 88ms warm cache
>
> Which is... non trivial to say the least.
>
> The big jump is the patches.only change, and the removal of ordering
> on the patchseries takes a further 10ms off. For some strange reason, it
> seems rather hard to tell Django that you don't care what order the
> results come back in for that query (if we do, then the db server has to
> do a sort rather than just return each row)

Thanks Stewart! It's great to get some real DB experience - feel free to
hang around! :)

So, further to our conversation with Konstantin, I tested this against
Django 2.0. It still saves us some time - it means we no longer load the
following fields:

`patchwork_submission`.`id`, `patchwork_submission`.`msgid`,
`patchwork_patch`.`commit_ref`, `patchwork_patch`.`pull_url`,
`patchwork_patch`.`archived`, `patchwork_patch`.`hash`,
`patchwork_patch`.`patch_project_id`, 

This obviously saves the db some work and communication overhead.

I'm a little nervous that this will slightly complicate some of the
further denormalisation but I think that's probably a price worth paying
unless Stephen objects.

I do still want to test the 'ordering' change a bit more before
committing it though.

Regards,
Daniel
>
> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
> ---
>  patchwork/models.py         | 1 -
>  patchwork/views/__init__.py | 2 ++
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/patchwork/models.py b/patchwork/models.py
> index 6268f5b72e3c..d2b88fc48c91 100644
> --- a/patchwork/models.py
> +++ b/patchwork/models.py
> @@ -747,7 +747,6 @@ class Series(FilenameMixin, models.Model):
>          return self.name if self.name else 'Untitled series #%d' % self.id
>  
>      class Meta:
> -        ordering = ('date',)
>          verbose_name_plural = 'Series'
>  
>  
> diff --git a/patchwork/views/__init__.py b/patchwork/views/__init__.py
> index f8d23a388ac7..96fd0798af5a 100644
> --- a/patchwork/views/__init__.py
> +++ b/patchwork/views/__init__.py
> @@ -287,6 +287,8 @@ def generic_list(request, project, view, view_args=None, filter_settings=None,
>      # rendering the list template
>      patches = patches.select_related('state', 'submitter', 'delegate')
>  
> +    patches = patches.only('state','submitter','delegate','project','name','date')
> +
>      # we also need checks and series
>      patches = patches.prefetch_related('check_set', 'series')
>  
> -- 
> 2.17.1
>
> _______________________________________________
> Patchwork mailing list
> Patchwork@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork
Stewart Smith Aug. 13, 2018, 2:06 a.m. | #2
Daniel Axtens <dja@axtens.net> writes:
> Stewart Smith <stewart@linux.ibm.com> writes:
>
>> There's two main bits that are really expensive when composing the list
>> of patches for a project: the query getting the list, and the query
>> finding the series for each patch.
>>
>> If we look at the query getting the list, it gets a lot of unnesseccary
>> fields such as 'headers' and 'content', even though we tell Django not
>> to. It turns out that Django seems to ignore the Submission relationship
>> and I have no idea how to force it to ignore that thing (defer doesn't
>> work) but if we go only, then it works okay.
>>
>> From my import of ~8000 messages for a few projects, my laptop query
>> time (MySQL, as setup by whatever the docker-compose things do) goes
>> from:
>>
>> http://localhost:8000/project/linux-kernel/list/
>> FROM:
>> 342ms SQL queries cold cache, 268ms warm cache
>> TO:
>> 118ms SQL queries cold cache, 88ms warm cache
>>
>> Which is... non trivial to say the least.
>>
>> The big jump is the patches.only change, and the removal of ordering
>> on the patchseries takes a further 10ms off. For some strange reason, it
>> seems rather hard to tell Django that you don't care what order the
>> results come back in for that query (if we do, then the db server has to
>> do a sort rather than just return each row)
>
> Thanks Stewart! It's great to get some real DB experience - feel free to
> hang around! :)

I'll try to :)

Or at least pop up to make things faster/nicer to the database.

> So, further to our conversation with Konstantin, I tested this against
> Django 2.0. It still saves us some time - it means we no longer load the
> following fields:
>
> `patchwork_submission`.`id`, `patchwork_submission`.`msgid`,
> `patchwork_patch`.`commit_ref`, `patchwork_patch`.`pull_url`,
> `patchwork_patch`.`archived`, `patchwork_patch`.`hash`,
> `patchwork_patch`.`patch_project_id`, 
>
> This obviously saves the db some work and communication overhead.

I found that 'headers' and 'content' from the patch were also being
fetched, and it saved fetching those (which are really quite expensive
to do, as blobs are typically mostly stored on a separate page than the
row itself).

It's the blobs that were the big expense, as they'd likely generate 2
disk seeks and reads per row.

> I'm a little nervous that this will slightly complicate some of the
> further denormalisation but I think that's probably a price worth paying
> unless Stephen objects.

I don't think it should really be too much of a problem, as any
denormalisation occurs, folk are going to have to look at the queries
being produced and executed anyway otherwise there's always a chance for
regressions.

> I do still want to test the 'ordering' change a bit more before
> committing it though.

Yeah, that struck me as a weird one, but it certainly did have an effect
on some things.

Patch

diff --git a/patchwork/models.py b/patchwork/models.py
index 6268f5b72e3c..d2b88fc48c91 100644
--- a/patchwork/models.py
+++ b/patchwork/models.py
@@ -747,7 +747,6 @@  class Series(FilenameMixin, models.Model):
         return self.name if self.name else 'Untitled series #%d' % self.id
 
     class Meta:
-        ordering = ('date',)
         verbose_name_plural = 'Series'
 
 
diff --git a/patchwork/views/__init__.py b/patchwork/views/__init__.py
index f8d23a388ac7..96fd0798af5a 100644
--- a/patchwork/views/__init__.py
+++ b/patchwork/views/__init__.py
@@ -287,6 +287,8 @@  def generic_list(request, project, view, view_args=None, filter_settings=None,
     # rendering the list template
     patches = patches.select_related('state', 'submitter', 'delegate')
 
+    patches = patches.only('state','submitter','delegate','project','name','date')
+
     # we also need checks and series
     patches = patches.prefetch_related('check_set', 'series')