diff mbox

[RFC] mm/init: fix zone boundary creation

Message ID 1462435033-15601-1-git-send-email-oohall@gmail.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Oliver O'Halloran May 5, 2016, 7:57 a.m. UTC
As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for
each zone. ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture. Unfortunately, this special
casing is broken when ZONE_MOVABLE is the not the last zone in the zone
list. The core of the issue is:

	if (i == ZONE_MOVABLE)
		continue;
	arch_zone_lowest_possible_pfn[i] =
		arch_zone_highest_possible_pfn[i-1];

As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone
will be set to zero. This patch fixes this bug by adding explicitly
tracking where the next zone should start rather than relying on the
contents arch_zone_highest_possible_pfn[].

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
---
 mm/page_alloc.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

Comments

Andrew Morton May 26, 2016, 9:21 p.m. UTC | #1
On Thu,  5 May 2016 17:57:13 +1000 "Oliver O'Halloran" <oohall@gmail.com> wrote:

> As a part of memory initialisation the architecture passes an array to
> free_area_init_nodes() which specifies the max PFN of each memory zone.
> This array is not necessarily monotonic (due to unused zones) so this
> array is parsed to build monotonic lists of the min and max PFN for
> each zone. ZONE_MOVABLE is special cased here as its limits are managed by
> the mm subsystem rather than the architecture. Unfortunately, this special
> casing is broken when ZONE_MOVABLE is the not the last zone in the zone
> list. The core of the issue is:
> 
> 	if (i == ZONE_MOVABLE)
> 		continue;
> 	arch_zone_lowest_possible_pfn[i] =
> 		arch_zone_highest_possible_pfn[i-1];
> 
> As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone
> will be set to zero. This patch fixes this bug by adding explicitly
> tracking where the next zone should start rather than relying on the
> contents arch_zone_highest_possible_pfn[].

hm, this is all ten year old Mel code.

What's the priority on this?  What are the user-visible runtime
effects, how many people are affected, etc?


> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5980,15 +5980,18 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
>  				sizeof(arch_zone_lowest_possible_pfn));
>  	memset(arch_zone_highest_possible_pfn, 0,
>  				sizeof(arch_zone_highest_possible_pfn));
> -	arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
> -	arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
> -	for (i = 1; i < MAX_NR_ZONES; i++) {
> +
> +	start_pfn = find_min_pfn_with_active_regions();
> +
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
>  		if (i == ZONE_MOVABLE)
>  			continue;
> -		arch_zone_lowest_possible_pfn[i] =
> -			arch_zone_highest_possible_pfn[i-1];
> -		arch_zone_highest_possible_pfn[i] =
> -			max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
> +
> +		end_pfn = max(max_zone_pfn[i], start_pfn);
> +		arch_zone_lowest_possible_pfn[i] = start_pfn;
> +		arch_zone_highest_possible_pfn[i] = end_pfn;
> +
> +		start_pfn = end_pfn;
>  	}
>  	arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
>  	arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;
Oliver O'Halloran May 27, 2016, 8:03 a.m. UTC | #2
On Fri, May 27, 2016 at 7:21 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> hm, this is all ten year old Mel code.
>
> What's the priority on this?  What are the user-visible runtime
> effects, how many people are affected, etc?

Low priority. To get bitten by this you need to enable a zone that appears
after ZONE_MOVABLE in the zone_type enum. As far as I can tell this means
running a kernel with ZONE_DEVICE or ZONE_CMA enabled, so I can't see this
affecting too many people.

I only noticed this because I've been fiddling with ZONE_DEVICE on powerpc and
4.6 broke my test kernel. This bug, in conjunction with the changes in Taku
Izumi's kernelcore=mirror patch (d91749c1dda71) and powerpc being the odd
architecture which initialises max_zone_pfn[] to ~0ul instead of 0 caused all
of system memory to be placed into ZONE_DEVICE at boot, followed a
panic since device memory cannot be used for kernel allocations. I've already
submitted a patch to fix the powerpc specific bits, but I figured this should
be fixed too.

oliver
Mel Gorman May 30, 2016, 9:15 a.m. UTC | #3
On Thu, May 26, 2016 at 02:21:42PM -0700, Andrew Morton wrote:
> On Thu,  5 May 2016 17:57:13 +1000 "Oliver O'Halloran" <oohall@gmail.com> wrote:
> 
> > As a part of memory initialisation the architecture passes an array to
> > free_area_init_nodes() which specifies the max PFN of each memory zone.
> > This array is not necessarily monotonic (due to unused zones) so this
> > array is parsed to build monotonic lists of the min and max PFN for
> > each zone. ZONE_MOVABLE is special cased here as its limits are managed by
> > the mm subsystem rather than the architecture. Unfortunately, this special
> > casing is broken when ZONE_MOVABLE is the not the last zone in the zone
> > list. The core of the issue is:
> > 
> > 	if (i == ZONE_MOVABLE)
> > 		continue;
> > 	arch_zone_lowest_possible_pfn[i] =
> > 		arch_zone_highest_possible_pfn[i-1];
> > 
> > As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone
> > will be set to zero. This patch fixes this bug by adding explicitly
> > tracking where the next zone should start rather than relying on the
> > contents arch_zone_highest_possible_pfn[].
> 
> hm, this is all ten year old Mel code.
> 

ZONE_MOVABLE at the time always existed at the end of a node during
initialisation time. It was allowed because the memory was always "stolen"
from the end of the node where it could have the same limitations as
ZONE_HIGHMEM if necessary. It was also safe to assume that zones never
overlapped as zones were about addressing limitations.  If ZONE_CMA or
ZONE_DEVICE can overlap with other zones during initialisation time then
there may be a few gremlins hiding in there. Unfortunately I have not
done an audit searching for problems with overlapping zones.
Oliver O'Halloran May 30, 2016, 1:18 p.m. UTC | #4
On Mon, May 30, 2016 at 7:15 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Thu, May 26, 2016 at 02:21:42PM -0700, Andrew Morton wrote:
>> On Thu,  5 May 2016 17:57:13 +1000 "Oliver O'Halloran" <oohall@gmail.com> wrote:
>>
>> > As a part of memory initialisation the architecture passes an array to
>> > free_area_init_nodes() which specifies the max PFN of each memory zone.
>> > This array is not necessarily monotonic (due to unused zones) so this
>> > array is parsed to build monotonic lists of the min and max PFN for
>> > each zone. ZONE_MOVABLE is special cased here as its limits are managed by
>> > the mm subsystem rather than the architecture. Unfortunately, this special
>> > casing is broken when ZONE_MOVABLE is the not the last zone in the zone
>> > list. The core of the issue is:
>> >
>> >     if (i == ZONE_MOVABLE)
>> >             continue;
>> >     arch_zone_lowest_possible_pfn[i] =
>> >             arch_zone_highest_possible_pfn[i-1];
>> >
>> > As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone
>> > will be set to zero. This patch fixes this bug by adding explicitly
>> > tracking where the next zone should start rather than relying on the
>> > contents arch_zone_highest_possible_pfn[].
>>
>> hm, this is all ten year old Mel code.
>>
>
> ZONE_MOVABLE at the time always existed at the end of a node during
> initialisation time. It was allowed because the memory was always "stolen"
> from the end of the node where it could have the same limitations as
> ZONE_HIGHMEM if necessary. It was also safe to assume that zones never
> overlapped as zones were about addressing limitations. If ZONE_CMA or
> ZONE_DEVICE can overlap with other zones during initialisation time then
> there may be a few gremlins hiding in there. Unfortunately I have not
> done an audit searching for problems with overlapping zones.

I think it's still reasonable to assume there is no overlap in early init. The
interface to free_area_init_nodes() ensures that zones are disjoint and as far
as I can tell the only way to get an overlapping zone at that point is to hit
the bug this patch fixes. ZONE_CMA is only populated when core_initcall()s are
processed and ZONE_DEVICE is hotplugged by drivers so it should appear even
later.
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d5d3a3..fc78306ce087 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5980,15 +5980,18 @@  void __init free_area_init_nodes(unsigned long *max_zone_pfn)
 				sizeof(arch_zone_lowest_possible_pfn));
 	memset(arch_zone_highest_possible_pfn, 0,
 				sizeof(arch_zone_highest_possible_pfn));
-	arch_zone_lowest_possible_pfn[0] = find_min_pfn_with_active_regions();
-	arch_zone_highest_possible_pfn[0] = max_zone_pfn[0];
-	for (i = 1; i < MAX_NR_ZONES; i++) {
+
+	start_pfn = find_min_pfn_with_active_regions();
+
+	for (i = 0; i < MAX_NR_ZONES; i++) {
 		if (i == ZONE_MOVABLE)
 			continue;
-		arch_zone_lowest_possible_pfn[i] =
-			arch_zone_highest_possible_pfn[i-1];
-		arch_zone_highest_possible_pfn[i] =
-			max(max_zone_pfn[i], arch_zone_lowest_possible_pfn[i]);
+
+		end_pfn = max(max_zone_pfn[i], start_pfn);
+		arch_zone_lowest_possible_pfn[i] = start_pfn;
+		arch_zone_highest_possible_pfn[i] = end_pfn;
+
+		start_pfn = end_pfn;
 	}
 	arch_zone_lowest_possible_pfn[ZONE_MOVABLE] = 0;
 	arch_zone_highest_possible_pfn[ZONE_MOVABLE] = 0;