[RFC] REST: Add new setting for maximum API page size

Message ID 20180724051051.10868-1-andrew.donnellan@au1.ibm.com
State Accepted
Headers show
Series
  • [RFC] REST: Add new setting for maximum API page size
Related show

Commit Message

Andrew Donnellan July 24, 2018, 5:10 a.m.
In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
size to the default page size in the settings.

This turns out to be rather restrictive, as we usually want to keep the
default page size low, but an administrator may want to allow API clients
to fetch more than that per request.

Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.

Closes: #202 ("Separate max API page size and default API page size into different settings")
Suggested-by: Stewart Smith <stewart@linux.ibm.com>
Suggested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

---

This is completely untested but should work, sending this as an RFC because
I have no idea what the default should be, thoughts?
---
 docs/deployment/configuration.rst | 8 +++++++-
 patchwork/api/base.py             | 3 ++-
 patchwork/settings/base.py        | 1 +
 3 files changed, 10 insertions(+), 2 deletions(-)

Comments

Andrew Donnellan July 24, 2018, 5:18 a.m. | #1
On 24/07/18 15:10, Andrew Donnellan wrote:
> diff --git a/docs/deployment/configuration.rst b/docs/deployment/configuration.rst
> index 347485636d47..e599522a412b 100644
> --- a/docs/deployment/configuration.rst
> +++ b/docs/deployment/configuration.rst
> @@ -88,7 +88,13 @@ Enable the :doc:`REST API <../api/rest>`.
>   The number of items to include in REST API responses by default. This can be
>   overridden by the ``per_page`` parameter for some endpoints.
>   
> -.. versionadded:: 2.0

This obviously shouldn't have been there, will fix when I send non RFC

> +``MAX_REST_RESULTS_PER_PAGE``
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The maximum number of items that can be requested in a REST API request using
> +the ``per_page`` parameter.
> +
> +.. versionadded:: 2.2
Andrew Donnellan July 25, 2018, 7:01 a.m. | #2
On 24/07/18 15:10, Andrew Donnellan wrote:
> In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
> size to the default page size in the settings.
> 
> This turns out to be rather restrictive, as we usually want to keep the
> default page size low, but an administrator may want to allow API clients
> to fetch more than that per request.
> 
> Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.
> 
> Closes: #202 ("Separate max API page size and default API page size into different settings")
> Suggested-by: Stewart Smith <stewart@linux.ibm.com>
> Suggested-by: Joel Stanley <joel@jms.id.au>
> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>

FWIW we now have this applied on patchwork.ozlabs.org and it appears to 
be working. Would like some more input as to what an appropriate default 
limit is.
Stewart Smith July 26, 2018, 4:37 a.m. | #3
Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:
> On 24/07/18 15:10, Andrew Donnellan wrote:
>> In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
>> size to the default page size in the settings.
>> 
>> This turns out to be rather restrictive, as we usually want to keep the
>> default page size low, but an administrator may want to allow API clients
>> to fetch more than that per request.
>> 
>> Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.
>> 
>> Closes: #202 ("Separate max API page size and default API page size into different settings")
>> Suggested-by: Stewart Smith <stewart@linux.ibm.com>
>> Suggested-by: Joel Stanley <joel@jms.id.au>
>> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>
> FWIW we now have this applied on patchwork.ozlabs.org and it appears to 
> be working. Would like some more input as to what an appropriate default 
> limit is.

From a *consumer* of the API PoV, 500 at a time rather than 30 is at
least a 6x speedup of my use of it, so that was extremely welcome. I
haven't looked at the size of the responses I'm getting back so have no
idea if 500 is a good one or not (I suspect I'd have to start optimizing
my code around 700-1000 responses/call).

My *guess* is that a fresh SQL query is run for each page retrieved, so
maybe 500 is "good enough" while there isn't some way to just stream
everything and not run the query multiple times.
Daniel Axtens July 26, 2018, 1:24 p.m. | #4
Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:

> On 24/07/18 15:10, Andrew Donnellan wrote:
>> In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
>> size to the default page size in the settings.
>> 
>> This turns out to be rather restrictive, as we usually want to keep the
>> default page size low, but an administrator may want to allow API clients
>> to fetch more than that per request.
>> 
>> Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.
>> 
>> Closes: #202 ("Separate max API page size and default API page size into different settings")
>> Suggested-by: Stewart Smith <stewart@linux.ibm.com>
>> Suggested-by: Joel Stanley <joel@jms.id.au>
>> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>
> FWIW we now have this applied on patchwork.ozlabs.org and it appears to 
> be working. Would like some more input as to what an appropriate default 
> limit is.

My completely fact-free feeling/opinion is that if it takes more than
~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
fussed.

Regards,
Daniel

>
> -- 
> Andrew Donnellan              OzLabs, ADL Canberra
> andrew.donnellan@au1.ibm.com  IBM Australia Limited
>
> _______________________________________________
> Patchwork mailing list
> Patchwork@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/patchwork
Andrew Donnellan Aug. 7, 2018, 2:50 a.m. | #5
On 26/07/18 23:24, Daniel Axtens wrote:
> Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:
> 
>> On 24/07/18 15:10, Andrew Donnellan wrote:
>>> In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
>>> size to the default page size in the settings.
>>>
>>> This turns out to be rather restrictive, as we usually want to keep the
>>> default page size low, but an administrator may want to allow API clients
>>> to fetch more than that per request.
>>>
>>> Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.
>>>
>>> Closes: #202 ("Separate max API page size and default API page size into different settings")
>>> Suggested-by: Stewart Smith <stewart@linux.ibm.com>
>>> Suggested-by: Joel Stanley <joel@jms.id.au>
>>> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>>
>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>> be working. Would like some more input as to what an appropriate default
>> limit is.
> 
> My completely fact-free feeling/opinion is that if it takes more than
> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
> fussed.

OzLabs.org is not a fast machine. 1 second round trip time == 100 per 
page. 500 per page == ~2.3 seconds round trip.

A single 500 item load is still under half the time of doing 5 * 100 
item loads...
Stewart Smith Aug. 7, 2018, 6:26 a.m. | #6
Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:
> On 26/07/18 23:24, Daniel Axtens wrote:
>> Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:
>> 
>>> On 24/07/18 15:10, Andrew Donnellan wrote:
>>>> In 41790caf59ad ("REST: Limit max page size") we limited the maximum page
>>>> size to the default page size in the settings.
>>>>
>>>> This turns out to be rather restrictive, as we usually want to keep the
>>>> default page size low, but an administrator may want to allow API clients
>>>> to fetch more than that per request.
>>>>
>>>> Add a new setting, MAX_REST_RESULTS_PER_PAGE, to set the maximum page size.
>>>>
>>>> Closes: #202 ("Separate max API page size and default API page size into different settings")
>>>> Suggested-by: Stewart Smith <stewart@linux.ibm.com>
>>>> Suggested-by: Joel Stanley <joel@jms.id.au>
>>>> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
>>>
>>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>>> be working. Would like some more input as to what an appropriate default
>>> limit is.
>> 
>> My completely fact-free feeling/opinion is that if it takes more than
>> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
>> fussed.
>
> OzLabs.org is not a fast machine. 1 second round trip time == 100 per 
> page. 500 per page == ~2.3 seconds round trip.
>
> A single 500 item load is still under half the time of doing 5 * 100 
> item loads...

I bet MySQL would execute the query quicker :)

Anyway, that sounds really quite slow and we should look at why on earth
it's that slow - is there some kind of horrific hidden join or poor
indexing or query plan or something?
Daniel Axtens Aug. 7, 2018, 7:20 a.m. | #7
>>>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>>>> be working. Would like some more input as to what an appropriate default
>>>> limit is.
>>> 
>>> My completely fact-free feeling/opinion is that if it takes more than
>>> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
>>> fussed.
>>
>> OzLabs.org is not a fast machine. 1 second round trip time == 100 per 
>> page. 500 per page == ~2.3 seconds round trip.
>>
>> A single 500 item load is still under half the time of doing 5 * 100 
>> item loads...

I wonder if we could let authenticated users do slower queries. Maybe
something for later. Let's split the difference at 250 then, I'd be
happy to merge that.

> I bet MySQL would execute the query quicker :)

In general (and I have tested this) mysql 5.7 executes patchwork queries
slower than postgres 9.6 - this is on my laptop on an SSD with a couple
100ks of patches across 2 or 3 projects.

> Anyway, that sounds really quite slow and we should look at why on earth
> it's that slow - is there some kind of horrific hidden join or poor
> indexing or query plan or something?

Yes.

So the 1.0 -> 2.0 transition was not good for performance because it
turns out databases do not map well to object-oriented structures.* I
didn't think to check for performance at the time, so we're
progressively improving that by denormalising stuff. (In fact that was a
major driver of the 2.1 release!) There is still further to go.

Regards,
Daniel

* Yes, I probably should have realised this at the time. something
  something unpaid volunteers something something patch review something
  something.

>
> -- 
> Stewart Smith
> OPAL Architect, IBM.
Andrew Donnellan Aug. 7, 2018, 7:29 a.m. | #8
On 07/08/18 17:20, Daniel Axtens wrote:
> I wonder if we could let authenticated users do slower queries. Maybe
> something for later. Let's split the difference at 250 then, I'd be
> happy to merge that.

If you want to take this patch and s/500/250/g during merge then I'd be 
happy with that.
Stewart Smith Aug. 7, 2018, 9:09 a.m. | #9
Daniel Axtens <dja@axtens.net> writes:
>>>>> FWIW we now have this applied on patchwork.ozlabs.org and it appears to
>>>>> be working. Would like some more input as to what an appropriate default
>>>>> limit is.
>>>> 
>>>> My completely fact-free feeling/opinion is that if it takes more than
>>>> ~1sec to run on Ozlabs, it's probably too high. Apart from that, I'm not
>>>> fussed.
>>>
>>> OzLabs.org is not a fast machine. 1 second round trip time == 100 per 
>>> page. 500 per page == ~2.3 seconds round trip.
>>>
>>> A single 500 item load is still under half the time of doing 5 * 100 
>>> item loads...
>
> I wonder if we could let authenticated users do slower queries. Maybe
> something for later. Let's split the difference at 250 then, I'd be
> happy to merge that.
>
>> I bet MySQL would execute the query quicker :)
>
> In general (and I have tested this) mysql 5.7 executes patchwork queries
> slower than postgres 9.6 - this is on my laptop on an SSD with a couple
> 100ks of patches across 2 or 3 projects.

Huh, somewhat surprising. I wonder what the query plan and cache layout
looks like....

>> Anyway, that sounds really quite slow and we should look at why on earth
>> it's that slow - is there some kind of horrific hidden join or poor
>> indexing or query plan or something?
>
> Yes.
>
> So the 1.0 -> 2.0 transition was not good for performance because it
> turns out databases do not map well to object-oriented structures.* I
> didn't think to check for performance at the time, so we're
> progressively improving that by denormalising stuff. (In fact that was a
> major driver of the 2.1 release!) There is still further to go.

So, looking at the schema, if we were going for the state of a bunch of
patches in a project (which I imagine is  fairly common), the current
patch table is:

CREATE TABLE `patchwork_patch` (
  `submission_ptr_id` int(11) NOT NULL,
  `diff` longtext,
  `commit_ref` varchar(255) DEFAULT NULL,
  `pull_url` varchar(255) DEFAULT NULL,
  `delegate_id` int(11) DEFAULT NULL,
  `state_id` int(11) DEFAULT NULL,
  `archived` tinyint(1) NOT NULL,
  `hash` char(40) DEFAULT NULL,
  `patch_project_id` int(11) NOT NULL,
  PRIMARY KEY (`submission_ptr_id`),
  KEY `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` (`delegate_id`),
  KEY `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` (`state_id`),
  KEY `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` (`patch_project_id`),
  CONSTRAINT `patchwork_patch_delegate_id_8f7ce805_fk_auth_user_id` FOREIGN KEY (`delegate_id`) REFERENCES `auth_user` (`id`),
  CONSTRAINT `patchwork_patch_patch_project_id_aff1ad51_fk_patchwork` FOREIGN KEY (`patch_project_id`) REFERENCES `patchwork_project` (`id`),
  CONSTRAINT `patchwork_patch_state_id_3d9fb500_fk_patchwork_state_id` FOREIGN KEY (`state_id`) REFERENCES `patchwork_state` (`id`),
  CONSTRAINT `patchwork_patch_submission_ptr_id_03216801_fk_patchwork` FOREIGN KEY (`submission_ptr_id`) REFERENCES `patchwork_submission` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

which puts the submission_ptr_id as the key for the clustered index
(i.e. on disk layout will group by this ID).

What I suspect is occuring is that we end up having teh on disk layout
of patchwork_patch (and patchwork_submission) be ordered by time the
patch comes in across *all* projects of the patchwork instance, which is
typically (never?) what is actually queried.

So that when I do queries over, say 'skiboot', I'm pulling into cache
all the submissions/patches that occured to any project around that
time, rather than only things related to skiboot.

If instead, the primary key was ('patch_project_id','submission_ptr_id')
and there was a UNIQUE KEY ('submission_ptr_id') then this would mean
that we'd be pulling in a database page full of the patches for the
project we're searching on, which I think would be more likely to be
queried in the near future.

The patch_project_id index colud be dropped, as all queries on it would
be satisfied by the primary key.

The same goes for patchwork_submission:

CREATE TABLE `patchwork_submission` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `msgid` varchar(255) NOT NULL,
  `name` varchar(255) NOT NULL,
  `date` datetime(6) NOT NULL,
  `headers` longtext NOT NULL,
  `project_id` int(11) NOT NULL,
  `submitter_id` int(11) NOT NULL,
  `content` longtext,
  PRIMARY KEY (`id`),
  UNIQUE KEY `patchwork_patch_msgid_project_id_2e1fe709_uniq` (`msgid`,`project_id`),
  KEY `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` (`project_id`),
  KEY `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` (`submitter_id`),
  CONSTRAINT `patchwork_patch_project_id_162a7a7f_fk_patchwork_project_id` FOREIGN KEY (`project_id`) REFERENCES `patchwork_project` (`id`),
  CONSTRAINT `patchwork_patch_submitter_id_57c66142_fk_patchwork_person_id` FOREIGN KEY (`submitter_id`) REFERENCES `patchwork_person` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

simply getting a list of patches for a project here is going to be a
very expensive query, pulling in a lot of database pages containing data
for patches around that time rather than ones relevant to the query.

Again, I'd say go:
PRIMARY KEY ('project_id', 'id')
UNIQUE KEY ('id')

(the project_id index can then be dropped, as the primary key could be
used there).

Of course, things get better if you query everything looking up
project_id=X and id=Y rather than just id=Y.

For PostgreSQL it gets a bit interesting as I'm not as familiar with how
it does data layout. There *appears* to be a CLUSTER *command* that will
re-organise an existing table but *not* make any future inserts/updates
obey it (which is annoying).

This seems to be a bit of a theme throughout the schema too...

Actually... do we have a good test dataset I could load up locally and
experiment with?

> * Yes, I probably should have realised this at the time. something
>   something unpaid volunteers something something patch review something
>   something.

:)

pretty much neglecting some piece of software that ends up being vital
open source infrastructure is a grand tradition, welcome!
Daniel Axtens Aug. 26, 2018, 7:49 a.m. | #10
Andrew Donnellan <andrew.donnellan@au1.ibm.com> writes:

> On 07/08/18 17:20, Daniel Axtens wrote:
>> I wonder if we could let authenticated users do slower queries. Maybe
>> something for later. Let's split the difference at 250 then, I'd be
>> happy to merge that.
>
> If you want to take this patch and s/500/250/g during merge then I'd be 
> happy with that.

Applied based on my trusting that OzLabs testing has been sufficent.

Regards,
Daniel

>
> -- 
> Andrew Donnellan              OzLabs, ADL Canberra
> andrew.donnellan@au1.ibm.com  IBM Australia Limited

Patch

diff --git a/docs/deployment/configuration.rst b/docs/deployment/configuration.rst
index 347485636d47..e599522a412b 100644
--- a/docs/deployment/configuration.rst
+++ b/docs/deployment/configuration.rst
@@ -88,7 +88,13 @@  Enable the :doc:`REST API <../api/rest>`.
 The number of items to include in REST API responses by default. This can be
 overridden by the ``per_page`` parameter for some endpoints.
 
-.. versionadded:: 2.0
+``MAX_REST_RESULTS_PER_PAGE``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The maximum number of items that can be requested in a REST API request using
+the ``per_page`` parameter.
+
+.. versionadded:: 2.2
 
 ``COMPAT_REDIR``
 ~~~~~~~~~~~~~~~~
diff --git a/patchwork/api/base.py b/patchwork/api/base.py
index 8c38d5a1d5f4..bf452f78b390 100644
--- a/patchwork/api/base.py
+++ b/patchwork/api/base.py
@@ -36,7 +36,8 @@  class LinkHeaderPagination(PageNumberPagination):
        https://tools.ietf.org/html/rfc5988#section-5
        https://developer.github.com/guides/traversing-with-pagination
     """
-    page_size = max_page_size = settings.REST_RESULTS_PER_PAGE
+    page_size = settings.REST_RESULTS_PER_PAGE
+    max_page_size = settings.MAX_REST_RESULTS_PER_PAGE
     page_size_query_param = 'per_page'
 
     def get_paginated_response(self, data):
diff --git a/patchwork/settings/base.py b/patchwork/settings/base.py
index 4b0d5513895a..7dc20041ec42 100644
--- a/patchwork/settings/base.py
+++ b/patchwork/settings/base.py
@@ -220,6 +220,7 @@  ENABLE_XMLRPC = False
 ENABLE_REST_API = True
 
 REST_RESULTS_PER_PAGE = 30
+MAX_REST_RESULTS_PER_PAGE = 500
 
 # Set to True to enable redirections or URLs from previous versions
 # of patchwork