diff mbox

[mmotm] mm: alloc_large_system_hash check order

Message ID Pine.LNX.4.64.0904292151350.30874@blonde.anvils
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Hugh Dickins April 29, 2009, 9:09 p.m. UTC
On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
had better make its own check on the order.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
Should probably follow
page-allocator-do-not-sanity-check-order-in-the-fast-path-fix.patch

Cc'ed DaveM and netdev, just in case they're surprised it was asking for
so much, or disappointed it's not getting as much as it was asking for.

 mm/page_alloc.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andrew Morton April 29, 2009, 9:28 p.m. UTC | #1
On Wed, 29 Apr 2009 22:09:48 +0100 (BST)
Hugh Dickins <hugh@veritas.com> wrote:

> On an x86_64 with 4GB ram, tcp_init()'s call to alloc_large_system_hash(),
> to allocate tcp_hashinfo.ehash, is now triggering an mmotm WARN_ON_ONCE on
> order >= MAX_ORDER - it's hoping for order 11.  alloc_large_system_hash()
> had better make its own check on the order.
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> ---
> Should probably follow
> page-allocator-do-not-sanity-check-order-in-the-fast-path-fix.patch
> 
> Cc'ed DaveM and netdev, just in case they're surprised it was asking for
> so much, or disappointed it's not getting as much as it was asking for.
> 
>  mm/page_alloc.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- 2.6.30-rc3-mm1/mm/page_alloc.c	2009-04-29 21:01:08.000000000 +0100
> +++ mmotm/mm/page_alloc.c	2009-04-29 21:12:04.000000000 +0100
> @@ -4765,7 +4765,10 @@ void *__init alloc_large_system_hash(con
>  			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
>  		else {
>  			unsigned long order = get_order(size);
> -			table = (void*) __get_free_pages(GFP_ATOMIC, order);
> +
> +			if (order < MAX_ORDER)
> +				table = (void *)__get_free_pages(GFP_ATOMIC,
> +								order);
>  			/*
>  			 * If bucketsize is not a power-of-two, we may free
>  			 * some pages at the end of hash table.

yes, the code is a bit odd:

:	do {
: 		size = bucketsize << log2qty;
: 		if (flags & HASH_EARLY)
: 			table = alloc_bootmem_nopanic(size);
: 		else if (hashdist)
: 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
: 		else {
: 			unsigned long order = get_order(size);
: 			table = (void*) __get_free_pages(GFP_ATOMIC, order);
: 			/*
: 			 * If bucketsize is not a power-of-two, we may free
: 			 * some pages at the end of hash table.
: 			 */
: 			if (table) {
: 				unsigned long alloc_end = (unsigned long)table +
: 						(PAGE_SIZE << order);
: 				unsigned long used = (unsigned long)table +
: 						PAGE_ALIGN(size);
: 				split_page(virt_to_page(table), order);
: 				while (used < alloc_end) {
: 					free_page(used);
: 					used += PAGE_SIZE;
: 				}
: 			}
: 		}
: 	} while (!table && size > PAGE_SIZE && --log2qty);

In the case where it does the __vmalloc(), the order-11 allocation will
succeed.  But in the other cases, the allocation attempt will need to
be shrunk and we end up with a smaller hash table.  Is that sensible?

If we want to regularise all three cases, doing

	size = min(size, MAX_ORDER);

before starting the loop would be suitable, although the huge
__get_free_pages() might still fail.  (But it will then warn, won't it?
 And nobody is reporting that).

I was a bit iffy about adding the warning in the first place, let it go
through due to its potential to lead us to code which isn't doing what
it thinks it's doing, or is being generally peculiar.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 30, 2009, 12:25 a.m. UTC | #2
From: Hugh Dickins <hugh@veritas.com>
Date: Wed, 29 Apr 2009 22:09:48 +0100 (BST)

> Cc'ed DaveM and netdev, just in case they're surprised it was asking for
> so much, or disappointed it's not getting as much as it was asking for.

This is basically what should be happening, thanks for the note.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hugh Dickins May 1, 2009, 1:40 p.m. UTC | #3
On Wed, 29 Apr 2009, Andrew Morton wrote:
> 
> yes, the code is a bit odd:
> 
> :	do {
> : 		size = bucketsize << log2qty;
> : 		if (flags & HASH_EARLY)
> : 			table = alloc_bootmem_nopanic(size);
> : 		else if (hashdist)
> : 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
> : 		else {
> : 			unsigned long order = get_order(size);
> : 			table = (void*) __get_free_pages(GFP_ATOMIC, order);
> : 			/*
> : 			 * If bucketsize is not a power-of-two, we may free
> : 			 * some pages at the end of hash table.
> : 			 */
> : 			if (table) {
> : 				unsigned long alloc_end = (unsigned long)table +
> : 						(PAGE_SIZE << order);
> : 				unsigned long used = (unsigned long)table +
> : 						PAGE_ALIGN(size);
> : 				split_page(virt_to_page(table), order);
> : 				while (used < alloc_end) {
> : 					free_page(used);
> : 					used += PAGE_SIZE;
> : 				}
> : 			}
> : 		}
> : 	} while (!table && size > PAGE_SIZE && --log2qty);
> 
> In the case where it does the __vmalloc(), the order-11 allocation will
> succeed.  But in the other cases, the allocation attempt will need to
> be shrunk and we end up with a smaller hash table.  Is that sensible?

It is a little odd, but the __vmalloc() route is used by default on
64-bit with CONFIG_NUMA, and this route otherwise.  (The hashdist
Doc isn't up-to-date on that, I'll send a patch.)

> 
> If we want to regularise all three cases, doing
> 
> 	size = min(size, MAX_ORDER);

If I take you literally, the resulting hash tables are going to
be rather small ;) but I know what you mean.

> 
> before starting the loop would be suitable, although the huge
> __get_free_pages() might still fail.

Oh, I don't feel a great urge to regularize these cases in such
a way.  I particularly don't feel like limiting 64-bit NUMA to
MAX_ORDER-1 size, if netdev have been happy with more until now.
Could consider a __vmalloc fallback when order is too large,
but let's not do so unless someone actually needs that.

> (But it will then warn, won't it?
>  And nobody is reporting that).

Well, it was hard to report it while mmotm's WARN_ON_ONCE was itself
oopsing.  With that fixed, I've reported it on x86_64 with 4GB
(without CONFIG_NUMA).

> 
> I was a bit iffy about adding the warning in the first place, let it go
> through due to its potential to lead us to code which isn't doing what
> it thinks it's doing, or is being generally peculiar.

DaveM has confirmed that the code is doing what they want it to do.
So I think mmotm wants this patch (for alloc_large_system_hash to
keep away from that warning), plus Mel's improvement on top of it.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- 2.6.30-rc3-mm1/mm/page_alloc.c	2009-04-29 21:01:08.000000000 +0100
+++ mmotm/mm/page_alloc.c	2009-04-29 21:12:04.000000000 +0100
@@ -4765,7 +4765,10 @@  void *__init alloc_large_system_hash(con
 			table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
 		else {
 			unsigned long order = get_order(size);
-			table = (void*) __get_free_pages(GFP_ATOMIC, order);
+
+			if (order < MAX_ORDER)
+				table = (void *)__get_free_pages(GFP_ATOMIC,
+								order);
 			/*
 			 * If bucketsize is not a power-of-two, we may free
 			 * some pages at the end of hash table.