Message ID | 0099265406c32b9b9057de100404a4148d602cdd.1434066549.git.shli@fb.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, 2015-06-11 at 16:50 -0700, Shaohua Li wrote: > We saw excessive direct memory compaction triggered by skb_page_frag_refill. > This causes performance issues and add latency. Commit 5640f7685831e0 > introduces the order-3 allocation. According to the changelog, the order-3 > allocation isn't a must-have but to improve performance. But direct memory > compaction has high overhead. The benefit of order-3 allocation can't > compensate the overhead of direct memory compaction. > Cc: Eric Dumazet <edumazet@google.com> > Cc: Chris Mason <clm@fb.com> > Cc: Debabrata Banerjee <dbavatar@gmail.com> > Signed-off-by: Shaohua Li <shli@fb.com> > --- Acked-by: Eric Dumazet <edumazet@google.com> Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Shaohua Li <shli@fb.com> Date: Thu, 11 Jun 2015 16:50:48 -0700 > We saw excessive direct memory compaction triggered by skb_page_frag_refill. > This causes performance issues and add latency. Commit 5640f7685831e0 > introduces the order-3 allocation. According to the changelog, the order-3 > allocation isn't a must-have but to improve performance. But direct memory > compaction has high overhead. The benefit of order-3 allocation can't > compensate the overhead of direct memory compaction. > > This patch makes the order-3 page allocation atomic. If there is no memory > pressure and memory isn't fragmented, the alloction will still success, so we > don't sacrifice the order-3 benefit here. If the atomic allocation fails, > direct memory compaction will not be triggered, skb_page_frag_refill will > fallback to order-0 immediately, hence the direct memory compaction overhead is > avoided. In the allocation failure case, kswapd is waken up and doing > compaction, so chances are allocation could success next time. > > alloc_skb_with_frags is the same. > > The mellanox driver does similar thing, if this is accepted, we must fix > the driver too. > > V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric > V2: make the changelog clearer > > Cc: Eric Dumazet <edumazet@google.com> > Cc: Chris Mason <clm@fb.com> > Cc: Debabrata Banerjee <dbavatar@gmail.com> > Signed-off-by: Shaohua Li <shli@fb.com> Applied and queued up for -stable, thanks! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/12/2015 01:50 AM, Shaohua Li wrote: > We saw excessive direct memory compaction triggered by skb_page_frag_refill. > This causes performance issues and add latency. Commit 5640f7685831e0 > introduces the order-3 allocation. According to the changelog, the order-3 > allocation isn't a must-have but to improve performance. But direct memory > compaction has high overhead. The benefit of order-3 allocation can't > compensate the overhead of direct memory compaction. > > This patch makes the order-3 page allocation atomic. If there is no memory > pressure and memory isn't fragmented, the alloction will still success, so we > don't sacrifice the order-3 benefit here. If the atomic allocation fails, > direct memory compaction will not be triggered, skb_page_frag_refill will > fallback to order-0 immediately, hence the direct memory compaction overhead is > avoided. In the allocation failure case, kswapd is waken up and doing > compaction, so chances are allocation could success next time. > > alloc_skb_with_frags is the same. > > The mellanox driver does similar thing, if this is accepted, we must fix > the driver too. > > V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric > V2: make the changelog clearer > > Cc: Eric Dumazet <edumazet@google.com> > Cc: Chris Mason <clm@fb.com> > Cc: Debabrata Banerjee <dbavatar@gmail.com> > Signed-off-by: Shaohua Li <shli@fb.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > net/core/skbuff.c | 2 +- > net/core/sock.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 3cfff2a..41ec022 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, > > while (order) { > if (npages >= 1 << order) { > - page = alloc_pages(gfp_mask | > + page = alloc_pages((gfp_mask & ~__GFP_WAIT) | > __GFP_COMP | > __GFP_NOWARN | > __GFP_NORETRY, Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. But it won't hurt anything leaving it there. And you might consider __GFP_NO_KSWAPD instead, as I said in the other thread. > diff --git a/net/core/sock.c b/net/core/sock.c > index 292f422..e9855a4 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp) > > pfrag->offset = 0; > if (SKB_FRAG_PAGE_ORDER) { > - pfrag->page = alloc_pages(gfp | __GFP_COMP | > + pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP | > __GFP_NOWARN | __GFP_NORETRY, > SKB_FRAG_PAGE_ORDER); > if (likely(pfrag->page)) { > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 12 Jun 2015, Vlastimil Babka wrote: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 3cfff2a..41ec022 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long > > header_len, > > > > while (order) { > > if (npages >= 1 << order) { > > - page = alloc_pages(gfp_mask | > > + page = alloc_pages((gfp_mask & ~__GFP_WAIT) | > > __GFP_COMP | > > __GFP_NOWARN | > > __GFP_NORETRY, > > Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. But it > won't hurt anything leaving it there. And you might consider __GFP_NO_KSWAPD > instead, as I said in the other thread. > Yeah, I agreed with __GFP_NO_KSWAPD to avoid utilizing memory reserves for this. > > diff --git a/net/core/sock.c b/net/core/sock.c > > index 292f422..e9855a4 100644 > > --- a/net/core/sock.c > > +++ b/net/core/sock.c > > @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct > > page_frag *pfrag, gfp_t gfp) > > > > pfrag->offset = 0; > > if (SKB_FRAG_PAGE_ORDER) { > > - pfrag->page = alloc_pages(gfp | __GFP_COMP | > > + pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP | > > __GFP_NOWARN | __GFP_NORETRY, > > SKB_FRAG_PAGE_ORDER); > > if (likely(pfrag->page)) { > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 17-06-15 16:02:59, David Rientjes wrote: > On Fri, 12 Jun 2015, Vlastimil Babka wrote: > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > > index 3cfff2a..41ec022 100644 > > > --- a/net/core/skbuff.c > > > +++ b/net/core/skbuff.c > > > @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long > > > header_len, > > > > > > while (order) { > > > if (npages >= 1 << order) { > > > - page = alloc_pages(gfp_mask | > > > + page = alloc_pages((gfp_mask & ~__GFP_WAIT) | > > > __GFP_COMP | > > > __GFP_NOWARN | > > > __GFP_NORETRY, > > > > Note that __GFP_NORETRY is weaker than ~__GFP_WAIT and thus redundant. But it > > won't hurt anything leaving it there. And you might consider __GFP_NO_KSWAPD > > instead, as I said in the other thread. > > > > Yeah, I agreed with __GFP_NO_KSWAPD to avoid utilizing memory reserves for > this. Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the _current_ implementation of the allocator has this nasty and very subtle side effect but that doesn't mean it should be abused outside of the mm proper. Why shouldn't this path wake the kswapd and let it compact memory on the background to increase the success rate for the later high order allocations?
On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote: > Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the > _current_ implementation of the allocator has this nasty and very subtle > side effect but that doesn't mean it should be abused outside of the mm > proper. Why shouldn't this path wake the kswapd and let it compact > memory on the background to increase the success rate for the later > high order allocations? I kind of agree. If kswapd is a problem (is it ???) we should fix it, instead of adding yet another flag to some random locations attempting memory allocations. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 18-06-15 07:35:53, Eric Dumazet wrote: > On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote: > > > Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the > > _current_ implementation of the allocator has this nasty and very subtle > > side effect but that doesn't mean it should be abused outside of the mm > > proper. Why shouldn't this path wake the kswapd and let it compact > > memory on the background to increase the success rate for the later > > high order allocations? > > I kind of agree. > > If kswapd is a problem (is it ???) we should fix it, instead of adding > yet another flag to some random locations attempting > memory allocations. No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can access some portion of the memory reserves (see gfp_to_alloc_flags resp. __zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty hack to not give that access which was introduced for THP AFAIR. The implicit access to memory reserves for non sleeping allocation has been there for ages and it might be not suitable for this particular path but that doesn't mean another gfp flag with a different side effect should be hijacked. We should either stop doing that implicit access to memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But that is a problem to be solved in the mm proper. Spreading subtle dependencies outside of mm will just make situation worse.
On 06/18/2015 04:43 PM, Michal Hocko wrote: > On Thu 18-06-15 07:35:53, Eric Dumazet wrote: >> On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote: >> >>> Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the >>> _current_ implementation of the allocator has this nasty and very subtle >>> side effect but that doesn't mean it should be abused outside of the mm >>> proper. Why shouldn't this path wake the kswapd and let it compact >>> memory on the background to increase the success rate for the later >>> high order allocations? >> >> I kind of agree. >> >> If kswapd is a problem (is it ???) we should fix it, instead of adding >> yet another flag to some random locations attempting >> memory allocations. > > No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can > access some portion of the memory reserves (see gfp_to_alloc_flags resp. > __zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty > hack to not give that access which was introduced for THP AFAIR. > > The implicit access to memory reserves for non sleeping allocation has > been there for ages and it might be not suitable for this particular > path but that doesn't mean another gfp flag with a different side effect > should be hijacked. We should either stop doing that implicit access to > memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But > that is a problem to be solved in the mm proper. Spreading subtle > dependencies outside of mm will just make situation worse. So you are not proposing to use these __GFP_RESERVE/NORESERVE flag outside of mm, right? (besides, we distinguish several kinds of reserves, so what exactly would the flag do?) As that would be also subtle dependency. The general problem I think is that we should want the mm users to specify higher-level intentions (such as GFP_KERNEL) which would map to specific directions (__GFP_*) for the allocator, and currently it's rather a mess of both kinds of flags. Clearly the intention here is "opportunistic allocation that should not reclaim/compact, use reserves, wake up kswapd (?) because it's better to fall back to smaller pages than wait") and we don't seem to have a GFP_OPPORTUNISTIC flag for that. The allocation has to then mask out __GFP_WAIT which however looks like an atomic allocation to the allocator and give access to reserves, etc... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 18-06-15 17:22:40, Vlastimil Babka wrote: > On 06/18/2015 04:43 PM, Michal Hocko wrote: > >On Thu 18-06-15 07:35:53, Eric Dumazet wrote: > >>On Thu, Jun 18, 2015 at 7:30 AM, Michal Hocko <mhocko@suse.cz> wrote: > >> > >>>Abusing __GFP_NO_KSWAPD is a wrong way to go IMHO. It is true that the > >>>_current_ implementation of the allocator has this nasty and very subtle > >>>side effect but that doesn't mean it should be abused outside of the mm > >>>proper. Why shouldn't this path wake the kswapd and let it compact > >>>memory on the background to increase the success rate for the later > >>>high order allocations? > >> > >>I kind of agree. > >> > >>If kswapd is a problem (is it ???) we should fix it, instead of adding > >>yet another flag to some random locations attempting > >>memory allocations. > > > >No, kswapd is not a problem. The problem is ~__GFP_WAIT allocation can > >access some portion of the memory reserves (see gfp_to_alloc_flags resp. > >__zone_watermark_ok and ALLOC_HARDER). __GFP_NO_KSWAPD is just a dirty > >hack to not give that access which was introduced for THP AFAIR. > > > >The implicit access to memory reserves for non sleeping allocation has > >been there for ages and it might be not suitable for this particular > >path but that doesn't mean another gfp flag with a different side effect > >should be hijacked. We should either stop doing that implicit access to > >memory reserves and give __GFP_RESERVE or add the __GFP_NORESERVE. But > >that is a problem to be solved in the mm proper. Spreading subtle > >dependencies outside of mm will just make situation worse. > > So you are not proposing to use these __GFP_RESERVE/NORESERVE flag outside > of mm, right? (besides, we distinguish several kinds of reserves, so what > exactly would the flag do?) That is to be discussed. Most allocations already express their interest in memory reserves by __GFP_HIGH directly or by GFP_ATOMIC indirectly. So maybe we do not need any additional flag here. There are not that many ~__GFP_WAIT and most of them seem to require it _only_ because the context doesn't allow for sleeping (e.g. to prevent from deadlocks). > As that would be also subtle dependency. The > general problem I think is that we should want the mm users to specify > higher-level intentions (such as GFP_KERNEL) which would map to specific > directions (__GFP_*) for the allocator, and currently it's rather a mess of > both kinds of flags. I agree. So I think that maybe we should drop that implicit access to memory reserves for ~__GFP_WAIT allocations and let it do what it is documented to do. > Clearly the intention here is "opportunistic allocation that should > not reclaim/compact, use reserves, wake up kswapd (?) because it's > better to fall back to smaller pages than wait") and we don't seem to > have a GFP_OPPORTUNISTIC flag for that. The allocation has to then > mask out __GFP_WAIT which however looks like an atomic allocation to > the allocator and give access to reserves, etc... I think simply dropping GFP_WAIT is a good way to express that. The fact that the current implementation gives access to memory reserves implicitly is just a detail and the user of the allocator shouldn't care about that.
On Thu, 18 Jun 2015, Michal Hocko wrote: > That is to be discussed. Most allocations already express their interest > in memory reserves by __GFP_HIGH directly or by GFP_ATOMIC indirectly. > So maybe we do not need any additional flag here. There are not that > many ~__GFP_WAIT and most of them seem to require it _only_ because the > context doesn't allow for sleeping (e.g. to prevent from deadlocks). > We're talking about a patch that is being backported to stable. Regardless of what improvements can be made to specify that an allocation shouldn't be able to access reserves (and what belongs solely in the page allocator proper) independent of __GFP_NO_KSWAPD, that can be cleaned up at a later time. I don't anticipate that cleanup to be backported to stable, and my primary concern here is the ability for this allocations to now access, and possibly deplete, memory reserves. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 3cfff2a..41ec022 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4398,7 +4398,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len, while (order) { if (npages >= 1 << order) { - page = alloc_pages(gfp_mask | + page = alloc_pages((gfp_mask & ~__GFP_WAIT) | __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY, diff --git a/net/core/sock.c b/net/core/sock.c index 292f422..e9855a4 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1883,7 +1883,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp) pfrag->offset = 0; if (SKB_FRAG_PAGE_ORDER) { - pfrag->page = alloc_pages(gfp | __GFP_COMP | + pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY, SKB_FRAG_PAGE_ORDER); if (likely(pfrag->page)) {
We saw excessive direct memory compaction triggered by skb_page_frag_refill. This causes performance issues and add latency. Commit 5640f7685831e0 introduces the order-3 allocation. According to the changelog, the order-3 allocation isn't a must-have but to improve performance. But direct memory compaction has high overhead. The benefit of order-3 allocation can't compensate the overhead of direct memory compaction. This patch makes the order-3 page allocation atomic. If there is no memory pressure and memory isn't fragmented, the alloction will still success, so we don't sacrifice the order-3 benefit here. If the atomic allocation fails, direct memory compaction will not be triggered, skb_page_frag_refill will fallback to order-0 immediately, hence the direct memory compaction overhead is avoided. In the allocation failure case, kswapd is waken up and doing compaction, so chances are allocation could success next time. alloc_skb_with_frags is the same. The mellanox driver does similar thing, if this is accepted, we must fix the driver too. V3: fix the same issue in alloc_skb_with_frags as pointed out by Eric V2: make the changelog clearer Cc: Eric Dumazet <edumazet@google.com> Cc: Chris Mason <clm@fb.com> Cc: Debabrata Banerjee <dbavatar@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com> --- net/core/skbuff.c | 2 +- net/core/sock.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)