diff mbox

[RFC,V3] net: don't wait for order-3 page allocation

Message ID 0099265406c32b9b9057de100404a4148d602cdd.1434066549.git.shli@fb.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Shaohua Li June 11, 2015, 11:50 p.m. UTC
We saw excessive direct memory compaction triggered by skb_page_frag_refill.
This causes performance issues and add latency. Commit 5640f7685831e0
introduces the order-3 allocation. According to the changelog, the order-3
allocation isn't a must-have but to improve performance. But direct memory
compaction has high overhead. The benefit of order-3 allocation can't
compensate the overhead of direct memory compaction.

This patch makes the order-3 page allocation atomic. If there is no memory
pressure and memory isn't fragmented, the alloction will still success, so we
don't sacrifice the order-3 benefit here. If the atomic allocation fails,
direct memory compaction will not be triggered, skb_page_frag_refill will
fallback to order-0 immediately, hence the direct memory compaction overhead is
avoided. In the allocation failure case, kswapd is waken up and doing
compaction, so chances are allocation could success next time.

alloc_skb_with_frags is the same.

The mellanox driver does similar thing, if this is accepted, we must fix
the driver too.

V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
V2: make the changelog clearer

Cc: Eric Dumazet <edumazet@google.com>
Cc: Chris Mason <clm@fb.com>
Cc: Debabrata Banerjee <dbavatar@gmail.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 net/core/skbuff.c | 2 +-
 net/core/sock.c   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Eric Dumazet June 12, 2015, 12:02 a.m. UTC | #1
On Thu, 2015-06-11 at 16:50 -0700, Shaohua Li wrote:
> We saw excessive direct memory compaction triggered by skb_page_frag_refill.
> This causes performance issues and add latency. Commit 5640f7685831e0
> introduces the order-3 allocation. According to the changelog, the order-3
> allocation isn't a must-have but to improve performance. But direct memory
> compaction has high overhead. The benefit of order-3 allocation can't
> compensate the overhead of direct memory compaction.

> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Chris Mason <clm@fb.com>
> Cc: Debabrata Banerjee <dbavatar@gmail.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller June 12, 2015, 12:34 a.m. UTC | #2
From: Shaohua Li <shli@fb.com>
Date: Thu, 11 Jun 2015 16:50:48 -0700

> We saw excessive direct memory compaction triggered by skb_page_frag_refill.
> This causes performance issues and add latency. Commit 5640f7685831e0
> introduces the order-3 allocation. According to the changelog, the order-3
> allocation isn't a must-have but to improve performance. But direct memory
> compaction has high overhead. The benefit of order-3 allocation can't
> compensate the overhead of direct memory compaction.
> 
> This patch makes the order-3 page allocation atomic. If there is no memory
> pressure and memory isn't fragmented, the alloction will still success, so we
> don't sacrifice the order-3 benefit here. If the atomic allocation fails,
> direct memory compaction will not be triggered, skb_page_frag_refill will
> fallback to order-0 immediately, hence the direct memory compaction overhead is
> avoided. In the allocation failure case, kswapd is waken up and doing
> compaction, so chances are allocation could success next time.
> 
> alloc_skb_with_frags is the same.
> 
> The mellanox driver does similar thing, if this is accepted, we must fix
> the driver too.
> 
> V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
> V2: make the changelog clearer
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Chris Mason <clm@fb.com>
> Cc: Debabrata Banerjee <dbavatar@gmail.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Applied and queued up for -stable, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlastimil Babka June 12, 2015, 9:36 a.m. UTC | #3
On 06/12/2015 01:50 AM, Shaohua Li wrote:
> We saw excessive direct memory compaction triggered by skb_page_frag_refill.
> This causes performance issues and add latency. Commit 5640f7685831e0
> introduces the order-3 allocation. According to the changelog, the order-3
> allocation isn't a must-have but to improve performance. But direct memory
> compaction has high overhead. The benefit of order-3 allocation can't
> compensate the overhead of direct memory compaction.
>
> This patch makes the order-3 page allocation atomic. If there is no memory
> pressure and memory isn't fragmented, the alloction will still success, so we
> don't sacrifice the order-3 benefit here. If the atomic allocation fails,
> direct memory compaction will not be triggered, skb_page_frag_refill will
> fallback to order-0 immediately, hence the direct memory compaction overhead is
> avoided. In the allocation failure case, kswapd is waken up and doing
> compaction, so chances are allocation could success next time.
>
> alloc_skb_with_frags is the same.
>
> The mellanox driver does similar thing, if this is accepted, we must fix
> the driver too.
>
> V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric
> V2: make the changelog clearer
>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Chris Mason <clm@fb.com>
> Cc: Debabrata Banerjee <dbavatar@gmail.com>
> Signed-off-by: Shaohua Li <shli@fb.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   net/core/skbuff.c | 2 +-
>   net/core/sock.c   | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 3cfff2a..41ec022 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>
>   		while (order) {
>   			if (npages >= 1 << order) {
> -				page = alloc_pages(gfp_mask |
> +				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
>   						   __GFP_COMP |
>   						   __GFP_NOWARN |
>   						   __GFP_NORETRY,

Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. 
But it won't hurt anything leaving it there. And you might consider 
__GFP_NO_KSWAPD instead, as I said in the other thread.

> diff --git a/net/core/sock.c b/net/core/sock.c
> index 292f422..e9855a4 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>
>   	pfrag->offset = 0;
>   	if (SKB_FRAG_PAGE_ORDER) {
> -		pfrag->page = alloc_pages(gfp | __GFP_COMP |
> +		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
>   					  __GFP_NOWARN | __GFP_NORETRY,
>   					  SKB_FRAG_PAGE_ORDER);
>   		if (likely(pfrag->page)) {
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes June 17, 2015, 11:02 p.m. UTC | #4
On Fri, 12 Jun 2015, Vlastimil Babka wrote:

> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 3cfff2a..41ec022 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long
> > header_len,
> > 
> >   		while (order) {
> >   			if (npages >= 1 << order) {
> > -				page = alloc_pages(gfp_mask |
> > +				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> >   						   __GFP_COMP |
> >   						   __GFP_NOWARN |
> >   						   __GFP_NORETRY,
> 
> Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. But it
> won't hurt anything leaving it there. And you might consider __GFP_NO_KSWAPD
> instead, as I said in the other thread.
> 

Yeah, I agreed with __GFP_NO_KSWAPD to avoid utilizing memory reserves for 
this.

> > diff --git a/net/core/sock.c b/net/core/sock.c
> > index 292f422..e9855a4 100644
> > --- a/net/core/sock.c
> > +++ b/net/core/sock.c
> > @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct
> > page_frag *pfrag, gfp_t gfp)
> > 
> >   	pfrag->offset = 0;
> >   	if (SKB_FRAG_PAGE_ORDER) {
> > -		pfrag->page = alloc_pages(gfp | __GFP_COMP |
> > +		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
> >   					  __GFP_NOWARN | __GFP_NORETRY,
> >   					  SKB_FRAG_PAGE_ORDER);
> >   		if (likely(pfrag->page)) {
> > 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko June 18, 2015, 2:30 p.m. UTC | #5
On Wed 17-06-15 16:02:59, David Rientjes wrote:
> On Fri, 12 Jun 2015, Vlastimil Babka wrote:
> 
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index 3cfff2a..41ec022 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long
> > > header_len,
> > > 
> > >   		while (order) {
> > >   			if (npages >= 1 << order) {
> > > -				page = alloc_pages(gfp_mask |
> > > +				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> > >   						   __GFP_COMP |
> > >   						   __GFP_NOWARN |
> > >   						   __GFP_NORETRY,
> > 
> > Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. But it
> > won't hurt anything leaving it there. And you might consider __GFP_NO_KSWAPD
> > instead, as I said in the other thread.
> > 
> 
> Yeah, I agreed with __GFP_NO_KSWAPD to avoid utilizing memory reserves for 
> this.

Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the
_current_ implementation of the allocator has this nasty and very subtle
side effect but that doesn't mean it should be abused outside of the mm
proper. Why shouldn't this path wake the kswapd and let it compact
memory on the background to increase the success rate for the later
high order allocations?
Eric Dumazet June 18, 2015, 2:35 p.m. UTC | #6
On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote:

> Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the
> _current_ implementation of the allocator has this nasty and very subtle
> side effect but that doesn't mean it should be abused outside of the mm
> proper. Why shouldn't this path wake the kswapd and let it compact
> memory on the background to increase the success rate for the later
> high order allocations?

I kind of agree.

If kswapd is a problem (is it ???) we should fix it, instead of adding
yet another flag to some random locations attempting
memory allocations.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko June 18, 2015, 2:43 p.m. UTC | #7
On Thu 18-06-15 07:35:53, Eric Dumazet wrote:
> On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote:
> 
> > Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the
> > _current_ implementation of the allocator has this nasty and very subtle
> > side effect but that doesn't mean it should be abused outside of the mm
> > proper. Why shouldn't this path wake the kswapd and let it compact
> > memory on the background to increase the success rate for the later
> > high order allocations?
> 
> I kind of agree.
> 
> If kswapd is a problem (is it ???) we should fix it, instead of adding
> yet another flag to some random locations attempting
> memory allocations.

No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can
access some portion of the memory reserves (see gfp_to_alloc_flags resp.
__zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty
hack to not give that access which was introduced for THP AFAIR.

The implicit access to memory reserves for non sleeping allocation has
been there for ages and it might be not suitable for this particular
path but that doesn't mean another gfp flag with a different side effect
should be hijacked. We should either stop doing that implicit access to
memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But
that is a problem to be solved in the mm proper. Spreading subtle
dependencies outside of mm will just make situation worse.
Vlastimil Babka June 18, 2015, 3:22 p.m. UTC | #8
On 06/18/2015 04:43 PM, Michal Hocko wrote:
> On Thu 18-06-15 07:35:53, Eric Dumazet wrote:
>> On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote:
>>
>>> Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the
>>> _current_ implementation of the allocator has this nasty and very subtle
>>> side effect but that doesn't mean it should be abused outside of the mm
>>> proper. Why shouldn't this path wake the kswapd and let it compact
>>> memory on the background to increase the success rate for the later
>>> high order allocations?
>>
>> I kind of agree.
>>
>> If kswapd is a problem (is it ???) we should fix it, instead of adding
>> yet another flag to some random locations attempting
>> memory allocations.
>
> No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can
> access some portion of the memory reserves (see gfp_to_alloc_flags resp.
> __zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty
> hack to not give that access which was introduced for THP AFAIR.
>
> The implicit access to memory reserves for non sleeping allocation has
> been there for ages and it might be not suitable for this particular
> path but that doesn't mean another gfp flag with a different side effect
> should be hijacked. We should either stop doing that implicit access to
> memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But
> that is a problem to be solved in the mm proper. Spreading subtle
> dependencies outside of mm will just make situation worse.

So you are not proposing to use these __GFP_RESERVE/NORESERVE flag 
outside of mm, right? (besides, we distinguish several kinds of 
reserves, so what exactly would the flag do?) As that would be also 
subtle dependency. The general problem I think is that we should want 
the mm users to specify higher-level intentions (such as GFP_KERNEL) 
which would map to specific directions (__GFP_*) for the allocator, and 
currently it's rather a mess of both kinds of flags. Clearly the 
intention here is "opportunistic allocation that should not 
reclaim/compact, use reserves, wake up kswapd (?) because it's better to 
fall back to smaller pages than wait") and we don't seem to have a 
GFP_OPPORTUNISTIC flag for that. The allocation has to then mask out 
__GFP_WAIT which however looks like an atomic allocation to the 
allocator and give access to reserves, etc...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko June 18, 2015, 3:47 p.m. UTC | #9
On Thu 18-06-15 17:22:40, Vlastimil Babka wrote:
> On 06/18/2015 04:43 PM, Michal Hocko wrote:
> >On Thu 18-06-15 07:35:53, Eric Dumazet wrote:
> >>On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote:
> >>
> >>>Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the
> >>>_current_ implementation of the allocator has this nasty and very subtle
> >>>side effect but that doesn't mean it should be abused outside of the mm
> >>>proper. Why shouldn't this path wake the kswapd and let it compact
> >>>memory on the background to increase the success rate for the later
> >>>high order allocations?
> >>
> >>I kind of agree.
> >>
> >>If kswapd is a problem (is it ???) we should fix it, instead of adding
> >>yet another flag to some random locations attempting
> >>memory allocations.
> >
> >No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can
> >access some portion of the memory reserves (see gfp_to_alloc_flags resp.
> >__zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty
> >hack to not give that access which was introduced for THP AFAIR.
> >
> >The implicit access to memory reserves for non sleeping allocation has
> >been there for ages and it might be not suitable for this particular
> >path but that doesn't mean another gfp flag with a different side effect
> >should be hijacked. We should either stop doing that implicit access to
> >memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But
> >that is a problem to be solved in the mm proper. Spreading subtle
> >dependencies outside of mm will just make situation worse.
> 
> So you are not proposing to use these __GFP_RESERVE/NORESERVE flag outside
> of mm, right? (besides, we distinguish several kinds of reserves, so what
> exactly would the flag do?)

That is to be discussed. Most allocations already express their interest
in memory reserves by __GFP_HIGH directly or by GFP_ATOMIC indirectly.
So maybe we do not need any additional flag here. There are not that
many ~__GFP_WAIT and most of them seem to require it _only_ because the
context doesn't allow for sleeping (e.g. to prevent from deadlocks).

> As that would be also subtle dependency. The
> general problem I think is that we should want the mm users to specify
> higher-level intentions (such as GFP_KERNEL) which would map to specific
> directions (__GFP_*) for the allocator, and currently it's rather a mess of
> both kinds of flags.

I agree. So I think that maybe we should drop that implicit access to
memory reserves for ~__GFP_WAIT allocations and let it do what it is
documented to do.

> Clearly the intention here is "opportunistic allocation that should
> not reclaim/compact, use reserves, wake up kswapd (?) because it's
> better to fall back to smaller pages than wait") and we don't seem to
> have a GFP_OPPORTUNISTIC flag for that. The allocation has to then
> mask out __GFP_WAIT which however looks like an atomic allocation to
> the allocator and give access to reserves, etc...

I think simply dropping GFP_WAIT is a good way to express that. The
fact that the current implementation gives access to memory reserves
implicitly is just a detail and the user of the allocator shouldn't care
about that.
David Rientjes June 30, 2015, 11:49 p.m. UTC | #10
On Thu, 18 Jun 2015, Michal Hocko wrote:

> That is to be discussed. Most allocations already express their interest
> in memory reserves by __GFP_HIGH directly or by GFP_ATOMIC indirectly.
> So maybe we do not need any additional flag here. There are not that
> many ~__GFP_WAIT and most of them seem to require it _only_ because the
> context doesn't allow for sleeping (e.g. to prevent from deadlocks).
> 

We're talking about a patch that is being backported to stable.  
Regardless of what improvements can be made to specify that an allocation 
shouldn't be able to access reserves (and what belongs solely in the page 
allocator proper) independent of __GFP_NO_KSWAPD, that can be cleaned up 
at a later time.  I don't anticipate that cleanup to be backported to 
stable, and my primary concern here is the ability for this allocations to 
now access, and possibly deplete, memory reserves.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 3cfff2a..41ec022 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4398,7 +4398,7 @@  struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 		while (order) {
 			if (npages >= 1 << order) {
-				page = alloc_pages(gfp_mask |
+				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
 						   __GFP_COMP |
 						   __GFP_NOWARN |
 						   __GFP_NORETRY,
diff --git a/net/core/sock.c b/net/core/sock.c
index 292f422..e9855a4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1883,7 +1883,7 @@  bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 
 	pfrag->offset = 0;
 	if (SKB_FRAG_PAGE_ORDER) {
-		pfrag->page = alloc_pages(gfp | __GFP_COMP |
+		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
 					  __GFP_NOWARN | __GFP_NORETRY,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {