diff mbox

fdtable: Avoid triggering OOMs from alloc_fdmem

Message ID 1391530721.4301.8.camel@edumazet-glaptop2.roam.corp.google.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 4, 2014, 4:18 p.m. UTC
On Mon, 2014-02-03 at 21:26 -0800, Eric W. Biederman wrote:
> Recently due to a spike in connections per second memcached on 3
> separate boxes triggered the OOM killer from accept.  At the time the
> OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
> problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
> hold a bitmap, and there was sufficient fragmentation that the largest
> page available was 8KiB.
> 
> I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
> but I do agree that order 3 allocations are very likely to succeed.
> 
> There are always pathologies where order > 0 allocations can fail when
> there are copious amounts of free memory available.  Using the pigeon
> hole principle it is easy to show that it requires 1 page more than 50%
> of the pages being free to guarantee an order 1 (8KiB) allocation will
> succeed, 1 page more than 75% of the pages being free to guarantee an
> order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
> the pages being free to guarantee an order 3 allocate will succeed.
> 
> A server churning memory with a lot of small requests and replies like
> memcached is a common case that if anything can will skew the odds
> against large pages being available.
> 
> Therefore let's not give external applications a practical way to kill
> linux server applications, and specify __GFP_NORETRY to the kmalloc in
> alloc_fdmem.  Unless I am misreading the code and by the time the code
> reaches should_alloc_retry in __alloc_pages_slowpath (where
> __GFP_NORETRY becomes signification).  We have already tried everything
> reasonable to allocate a page and the only thing left to do is wait.  So
> not waiting and falling back to vmalloc immediately seems like the
> reasonable thing to do even if there wasn't a chance of triggering the
> OOM killer.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/file.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 771578b33fb6..db25c2bdfe46 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -34,7 +34,7 @@ static void *alloc_fdmem(size_t size)
>  	 * vmalloc() if the allocation size will be considered "large" by the VM.
>  	 */
>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY);
>  		if (data != NULL)
>  			return data;
>  	}

Hi Eric

I wrote yesterday a similar patch adding __GFP_NORETRY in following
paths. I feel that alloc_fdmem() is only a part of the problem ;)

What do you think, should we merge our changes or have distinct
patches ?







--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman Feb. 4, 2014, 5:22 p.m. UTC | #1
Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Mon, 2014-02-03 at 21:26 -0800, Eric W. Biederman wrote:
>> Recently due to a spike in connections per second memcached on 3
>> separate boxes triggered the OOM killer from accept.  At the time the
>> OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
>> problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
>> hold a bitmap, and there was sufficient fragmentation that the largest
>> page available was 8KiB.
>> 
>> I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
>> but I do agree that order 3 allocations are very likely to succeed.
>> 
>> There are always pathologies where order > 0 allocations can fail when
>> there are copious amounts of free memory available.  Using the pigeon
>> hole principle it is easy to show that it requires 1 page more than 50%
>> of the pages being free to guarantee an order 1 (8KiB) allocation will
>> succeed, 1 page more than 75% of the pages being free to guarantee an
>> order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
>> the pages being free to guarantee an order 3 allocate will succeed.
>> 
>> A server churning memory with a lot of small requests and replies like
>> memcached is a common case that if anything can will skew the odds
>> against large pages being available.
>> 
>> Therefore let's not give external applications a practical way to kill
>> linux server applications, and specify __GFP_NORETRY to the kmalloc in
>> alloc_fdmem.  Unless I am misreading the code and by the time the code
>> reaches should_alloc_retry in __alloc_pages_slowpath (where
>> __GFP_NORETRY becomes signification).  We have already tried everything
>> reasonable to allocate a page and the only thing left to do is wait.  So
>> not waiting and falling back to vmalloc immediately seems like the
>> reasonable thing to do even if there wasn't a chance of triggering the
>> OOM killer.
>> 
>> Cc: stable@vger.kernel.org
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/file.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>> 
>> diff --git a/fs/file.c b/fs/file.c
>> index 771578b33fb6..db25c2bdfe46 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -34,7 +34,7 @@ static void *alloc_fdmem(size_t size)
>>  	 * vmalloc() if the allocation size will be considered "large" by the VM.
>>  	 */
>>  	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>> -		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> +		void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY);
>>  		if (data != NULL)
>>  			return data;
>>  	}
>
> Hi Eric
>
> I wrote yesterday a similar patch adding __GFP_NORETRY in following
> paths. I feel that alloc_fdmem() is only a part of the problem ;)

These code paths below were triggering OOMs for you?

I looked and didn't see a path flying through the air.

> What do you think, should we merge our changes or have distinct
> patches ?

I don't know about merging changes but certainly looking at the issue
together sounds good.

My gut feel says if there is a code path that has __GFP_NOWARN and
because of PAGE_ALLOC_COSTLY_ORDER we loop forever then there is
something fishy going on.

I would love to hear some people who are more current on the mm
subsystem than I am chime in.  It might be that the darn fix is going to
be to teach __alloc_pages_slowpath to not loop forever, unless order == 0. 
I expect the worst offenders need to have __GFP_NORETRY added so the
knowledge of what is going on spreads, and so we can avoid the danger of
needing to retune the whole mm subsystem that changing
__alloc_pages_slowpath does.

The two code paths below certainly look good canidates for having
__GFP_NORETRY added to them.  The same issues I ran into with
alloc_fdmem are likely to show up there as well.

Eric



> diff --git a/net/core/sock.c b/net/core/sock.c
> index 0c127dcdf6a8..5b6a9431b017 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1775,7 +1775,9 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
>  			while (order) {
>  				if (npages >= 1 << order) {
>  					page = alloc_pages(sk->sk_allocation |
> -							   __GFP_COMP | __GFP_NOWARN,
> +							   __GFP_COMP |
> +							   __GFP_NOWARN |
> +							   __GFP_NORETRY,
>  							   order);
>  					if (page)
>  						goto fill_page;
> @@ -1845,7 +1847,7 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
>  		gfp_t gfp = prio;
>  
>  		if (order)
> -			gfp |= __GFP_COMP | __GFP_NOWARN;
> +			gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
>  		pfrag->page = alloc_pages(gfp, order);
>  		if (likely(pfrag->page)) {
>  			pfrag->offset = 0;
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2014, 6:44 p.m. UTC | #2
On Tue, 2014-02-04 at 09:22 -0800, Eric W. Biederman wrote:

> The two code paths below certainly look good canidates for having
> __GFP_NORETRY added to them.  The same issues I ran into with
> alloc_fdmem are likely to show up there as well.

Yes, this is what I thought : a write into TCP socket should be more
frequent than the alloc_fdmem() case ;)

But then, maybe your workload was only using UDP ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Feb. 4, 2014, 6:57 p.m. UTC | #3
Eric Dumazet <eric.dumazet@gmail.com> writes:

> On Tue, 2014-02-04 at 09:22 -0800, Eric W. Biederman wrote:
>
>> The two code paths below certainly look good canidates for having
>> __GFP_NORETRY added to them.  The same issues I ran into with
>> alloc_fdmem are likely to show up there as well.
>
> Yes, this is what I thought : a write into TCP socket should be more
> frequent than the alloc_fdmem() case ;)
>
> But then, maybe your workload was only using UDP ?

As I have heard it described one tcp connection per small requestion,
and someone goofed and started creating new connections when the server
was bogged down.  But since all of the requests and replies were small I
don't expect even TCP would allocate more than a 4KiB page in that
worload.

I had oodles of 4KiB and 8KiB pages.  What size of memory allocation did
you see failing?  

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Feb. 4, 2014, 9:27 p.m. UTC | #4
On Tue, 4 Feb 2014, Eric W. Biederman wrote:

> My gut feel says if there is a code path that has __GFP_NOWARN and
> because of PAGE_ALLOC_COSTLY_ORDER we loop forever then there is
> something fishy going on.
> 

The __GFP_NOWARN without __GFP_NORETRY in alloc_fdmem() is pointless 
because we already know that the allocation is PAGE_ALLOC_COSTLY_ORDER or 
smaller.  That function encodes specific knowledge of the page allocator's 
implementation so it leads me to believe that __GFP_NOWARN was intended to 
be __GFP_NORETRY from the start.  Otherwise, it's just set pointlessly and 
specifically allows for the oom killing that you're now reporting.  Since 
it can fallback to vmalloc() after exhausting all of the page allocator's 
capabilities, the __GFP_NOWARN|__GFP_NORETRY seems entirely appropriate.

The vmalloc() has never been called in this function because of the 
infinite loop in kmalloc() because of its allocation context, but it 
definitely seems better than oom killing something.

Acked-by: David Rientjes <rientjes@google.com>

> I would love to hear some people who are more current on the mm
> subsystem than I am chime in.  It might be that the darn fix is going to
> be to teach __alloc_pages_slowpath to not loop forever, unless order == 0.

It doesn't loop forever, it will either return NULL because of its 
allocation context or the oom killer will kill something, even for order-3 
allocations.  In the case that you've modified, you have sane fallback 
behavior that can be utilized rather than the oom killer and __GFP_NORETRY 
was reasonable from the start.

The question is simple enough: do we want to change 
PAGE_ALLOC_COSTLY_ORDER to be smaller so that order-3 does return NULL 
without oom killing?  Perhaps there's an argument to be made that does 
exactly that, but by not setting __GFP_NORETRY you are really demanding 
order-3 memory at the time you allocate it and are willing to accept the 
consequences to free that memory.  Should we make everything except for 
order-0 inherently __GFP_NORETRY and introduce a replacement __GFP_RETRY?  
That's doable as well, but it would be a massive effort.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 4, 2014, 10:48 p.m. UTC | #5
On Tue, 2014-02-04 at 10:57 -0800, Eric W. Biederman wrote:

> As I have heard it described one tcp connection per small requestion,
> and someone goofed and started creating new connections when the server
> was bogged down.  But since all of the requests and replies were small I
> don't expect even TCP would allocate more than a 4KiB page in that
> worload.

Right, small writes uses regular skb (no page fragments).

> 
> I had oodles of 4KiB and 8KiB pages.  What size of memory allocation did
> you see failing?  

We got some reports of order-3 allocations failing.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/sock.c b/net/core/sock.c
index 0c127dcdf6a8..5b6a9431b017 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1775,7 +1775,9 @@  struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 			while (order) {
 				if (npages >= 1 << order) {
 					page = alloc_pages(sk->sk_allocation |
-							   __GFP_COMP | __GFP_NOWARN,
+							   __GFP_COMP |
+							   __GFP_NOWARN |
+							   __GFP_NORETRY,
 							   order);
 					if (page)
 						goto fill_page;
@@ -1845,7 +1847,7 @@  bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio)
 		gfp_t gfp = prio;
 
 		if (order)
-			gfp |= __GFP_COMP | __GFP_NOWARN;
+			gfp |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
 		pfrag->page = alloc_pages(gfp, order);
 		if (likely(pfrag->page)) {
 			pfrag->offset = 0;