diff mbox

[v3,2/3] powerpc: get hugetlbpage handling more generic

Message ID 69b1226ce134761fd7ab81a498bdb85cd737280f.1474441302.git.christophe.leroy@c-s.fr (mailing list archive)
State Superseded
Delegated to: Scott Wood
Headers show

Commit Message

Christophe Leroy Sept. 21, 2016, 8:11 a.m. UTC
Today there are two implementations of hugetlbpages which are managed
by exclusive #ifdefs:
* FSL_BOOKE: several directory entries points to the same single hugepage
* BOOK3S: one upper level directory entry points to a table of hugepages

In preparation of implementation of hugepage support on the 8xx, we
need a mix of the two above solutions, because the 8xx needs both cases
depending on the size of pages:
* In 4k page size mode, each PGD entry covers a 4M bytes area. It means
that 2 PGD entries will be necessary to cover an 8M hugepage while a
single PGD entry will cover 8x 512k hugepages.
* In 16 page size mode, each PGD entry covers a 64M bytes area. It means
that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
hugepages will be covers by one PGD entry.

This patch:
* removes #ifdefs in favor of if/else based on the range sizes
* merges the two huge_pte_alloc() functions as they are pretty similar
* merges the two hugetlbpage_init() functions as they are pretty similar

Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
---
v2: This part is new and results from a split of last patch of v1 serie in
two parts

v3:
- Only allocate hugepte_cache on FSL_BOOKE. Not needed on BOOK3S_64
- Removed the BUG in the unused hugepd_free(), made it
static inline {} instead.

 arch/powerpc/mm/hugetlbpage.c | 188 +++++++++++++++++-------------------------
 1 file changed, 76 insertions(+), 112 deletions(-)

Comments

Aneesh Kumar K.V Sept. 21, 2016, 4:58 p.m. UTC | #1
Christophe Leroy <christophe.leroy@c-s.fr> writes:

> Today there are two implementations of hugetlbpages which are managed
> by exclusive #ifdefs:
> * FSL_BOOKE: several directory entries points to the same single hugepage
> * BOOK3S: one upper level directory entry points to a table of hugepages
>
> In preparation of implementation of hugepage support on the 8xx, we
> need a mix of the two above solutions, because the 8xx needs both cases
> depending on the size of pages:
> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> that 2 PGD entries will be necessary to cover an 8M hugepage while a
> single PGD entry will cover 8x 512k hugepages.
> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> hugepages will be covers by one PGD entry.
>
> This patch:
> * removes #ifdefs in favor of if/else based on the range sizes
> * merges the two huge_pte_alloc() functions as they are pretty similar
> * merges the two hugetlbpage_init() functions as they are pretty similar

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>


>
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> ---

-aneesh
Crystal Wood Nov. 24, 2016, 5:23 a.m. UTC | #2
On Wed, Sep 21, 2016 at 10:11:54AM +0200, Christophe Leroy wrote:
> Today there are two implementations of hugetlbpages which are managed
> by exclusive #ifdefs:
> * FSL_BOOKE: several directory entries points to the same single hugepage
> * BOOK3S: one upper level directory entry points to a table of hugepages
> 
> In preparation of implementation of hugepage support on the 8xx, we
> need a mix of the two above solutions, because the 8xx needs both cases
> depending on the size of pages:
> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> that 2 PGD entries will be necessary to cover an 8M hugepage while a
> single PGD entry will cover 8x 512k hugepages.
> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> hugepages will be covers by one PGD entry.
> 
> This patch:
> * removes #ifdefs in favor of if/else based on the range sizes
> * merges the two huge_pte_alloc() functions as they are pretty similar
> * merges the two hugetlbpage_init() functions as they are pretty similar
> 
> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

With this patch on e6500, running the hugetlb testsuite results in the
system hanging in a storm of OOM killer invocations (I'll try to debug
more deeply later).  This patch also changes the default hugepage size on
FSL book3e from 4M to 16M.

-Scott
Christophe Leroy Nov. 25, 2016, 8:14 a.m. UTC | #3
Le 24/11/2016 à 06:23, Scott Wood a écrit :
> On Wed, Sep 21, 2016 at 10:11:54AM +0200, Christophe Leroy wrote:
>> Today there are two implementations of hugetlbpages which are managed
>> by exclusive #ifdefs:
>> * FSL_BOOKE: several directory entries points to the same single hugepage
>> * BOOK3S: one upper level directory entry points to a table of hugepages
>>
>> In preparation of implementation of hugepage support on the 8xx, we
>> need a mix of the two above solutions, because the 8xx needs both cases
>> depending on the size of pages:
>> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
>> that 2 PGD entries will be necessary to cover an 8M hugepage while a
>> single PGD entry will cover 8x 512k hugepages.
>> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
>> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
>> hugepages will be covers by one PGD entry.
>>
>> This patch:
>> * removes #ifdefs in favor of if/else based on the range sizes
>> * merges the two huge_pte_alloc() functions as they are pretty similar
>> * merges the two hugetlbpage_init() functions as they are pretty similar
>>
>> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
>> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>
> With this patch on e6500, running the hugetlb testsuite results in the
> system hanging in a storm of OOM killer invocations (I'll try to debug
> more deeply later).  This patch also changes the default hugepage size on
> FSL book3e from 4M to 16M.
>

Regarding the default hugepage size, it is a result of the merge of the 
two hugetlbpage_init().
Should I add an ifdef to get 4M on FSL book3e by default ?
What's the reason for selecting different hugepage sizes depending on 
the CPU ? I thought default size was selected based on what was existing.

What testsuite do you run exactly ?

Christophe
Crystal Wood Dec. 5, 2016, 2:58 a.m. UTC | #4
On Fri, 2016-11-25 at 09:14 +0100, Christophe LEROY wrote:
> 
> Le 24/11/2016 à 06:23, Scott Wood a écrit :
> > 
> > On Wed, Sep 21, 2016 at 10:11:54AM +0200, Christophe Leroy wrote:
> > > 
> > > Today there are two implementations of hugetlbpages which are managed
> > > by exclusive #ifdefs:
> > > * FSL_BOOKE: several directory entries points to the same single
> > > hugepage
> > > * BOOK3S: one upper level directory entry points to a table of hugepages
> > > 
> > > In preparation of implementation of hugepage support on the 8xx, we
> > > need a mix of the two above solutions, because the 8xx needs both cases
> > > depending on the size of pages:
> > > * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> > > that 2 PGD entries will be necessary to cover an 8M hugepage while a
> > > single PGD entry will cover 8x 512k hugepages.
> > > * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> > > that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> > > hugepages will be covers by one PGD entry.
> > > 
> > > This patch:
> > > * removes #ifdefs in favor of if/else based on the range sizes
> > > * merges the two huge_pte_alloc() functions as they are pretty similar
> > > * merges the two hugetlbpage_init() functions as they are pretty similar
> > > 
> > > Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
> > > Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > With this patch on e6500, running the hugetlb testsuite results in the
> > system hanging in a storm of OOM killer invocations (I'll try to debug
> > more deeply later).  This patch also changes the default hugepage size on
> > FSL book3e from 4M to 16M.
> > 
> Regarding the default hugepage size, it is a result of the merge of the 
> two hugetlbpage_init().
> Should I add an ifdef to get 4M on FSL book3e by default ?
> What's the reason for selecting different hugepage sizes depending on 
> the CPU ? 

I'm not sure what the original reason was, but a change in defaults could
disturb existing users.

> I thought default size was selected based on what was existing.

...by code that doesn't expect 4M to ever exist.  Both 4M and 16M (and a bunch
of other sizes) are available on FSL book3e.

What is the reason for preferring 16M over 1M, but preferring 1M over 2M?
 Seems arbitrary.

If we want to preserve the exsiting behavior without an ifdef, we could put a
check for 4M before the other sizes, with a comment explaining why we're
making the selection look even more arbitrary.  Or we could try to figure out
what size actually makes a better default.

> What testsuite do you run exactly ?

The one that comes with libhugetlbfs (not a particularly recent version, but
not sure exactly how old).

-Scott
Crystal Wood Dec. 6, 2016, 1:18 a.m. UTC | #5
On Wed, 2016-09-21 at 10:11 +0200, Christophe Leroy wrote:
> Today there are two implementations of hugetlbpages which are managed
> by exclusive #ifdefs:
> * FSL_BOOKE: several directory entries points to the same single hugepage
> * BOOK3S: one upper level directory entry points to a table of hugepages
> 
> In preparation of implementation of hugepage support on the 8xx, we
> need a mix of the two above solutions, because the 8xx needs both cases
> depending on the size of pages:
> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> that 2 PGD entries will be necessary to cover an 8M hugepage while a
> single PGD entry will cover 8x 512k hugepages.
> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> hugepages will be covers by one PGD entry.
> 
> This patch:
> * removes #ifdefs in favor of if/else based on the range sizes
> * merges the two huge_pte_alloc() functions as they are pretty similar
> * merges the two hugetlbpage_init() functions as they are pretty similar
[snip]
> @@ -860,16 +803,34 @@ static int __init hugetlbpage_init(void)
>  		 * if we have pdshift and shift value same, we don't
>  		 * use pgt cache for hugepd.
>  		 */
> -		if (pdshift != shift) {
> +		if (pdshift > shift) {
>  			pgtable_cache_add(pdshift - shift, NULL);
>  			if (!PGT_CACHE(pdshift - shift))
>  				panic("hugetlbpage_init(): could not create
> "
>  				      "pgtable cache for %d bit
> pagesize\n", shift);
>  		}
> +#ifdef CONFIG_PPC_FSL_BOOK3E
> +		else if (!hugepte_cache) {

This else never triggers on book3e, because the way this function calculates
pdshift is wrong for book3e (it uses PyD_SHIFT instead of HUGEPD_PxD_SHIFT).
 We later get OOMs because huge_pte_alloc() calculates pdshift correctly,
tries to use hugepte_cache, and fails.

If the point of this patch is to remove the compile-time decision on whether
to do things the book3e way, why are there still ifdefs such as the ones
controlling the definition of HUGEPD_PxD_SHIFT?  How does what you're doing on
8xx (for certain page sizes) differ from book3e?

-Scott
Christophe Leroy Dec. 6, 2016, 6:34 a.m. UTC | #6
Le 06/12/2016 à 02:18, Scott Wood a écrit :
> On Wed, 2016-09-21 at 10:11 +0200, Christophe Leroy wrote:
>> Today there are two implementations of hugetlbpages which are managed
>> by exclusive #ifdefs:
>> * FSL_BOOKE: several directory entries points to the same single hugepage
>> * BOOK3S: one upper level directory entry points to a table of hugepages
>>
>> In preparation of implementation of hugepage support on the 8xx, we
>> need a mix of the two above solutions, because the 8xx needs both cases
>> depending on the size of pages:
>> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
>> that 2 PGD entries will be necessary to cover an 8M hugepage while a
>> single PGD entry will cover 8x 512k hugepages.
>> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
>> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
>> hugepages will be covers by one PGD entry.
>>
>> This patch:
>> * removes #ifdefs in favor of if/else based on the range sizes
>> * merges the two huge_pte_alloc() functions as they are pretty similar
>> * merges the two hugetlbpage_init() functions as they are pretty similar
> [snip]
>> @@ -860,16 +803,34 @@ static int __init hugetlbpage_init(void)
>>  		 * if we have pdshift and shift value same, we don't
>>  		 * use pgt cache for hugepd.
>>  		 */
>> -		if (pdshift != shift) {
>> +		if (pdshift > shift) {
>>  			pgtable_cache_add(pdshift - shift, NULL);
>>  			if (!PGT_CACHE(pdshift - shift))
>>  				panic("hugetlbpage_init(): could not create
>> "
>>  				      "pgtable cache for %d bit
>> pagesize\n", shift);
>>  		}
>> +#ifdef CONFIG_PPC_FSL_BOOK3E
>> +		else if (!hugepte_cache) {
>
> This else never triggers on book3e, because the way this function calculates
> pdshift is wrong for book3e (it uses PyD_SHIFT instead of HUGEPD_PxD_SHIFT).
>  We later get OOMs because huge_pte_alloc() calculates pdshift correctly,
> tries to use hugepte_cache, and fails.

Ok, I'll check it again, I was expecting it to still work properly on 
book3e, because after applying patch 3 it works properly on the 8xx.

>
> If the point of this patch is to remove the compile-time decision on whether
> to do things the book3e way, why are there still ifdefs such as the ones
> controlling the definition of HUGEPD_PxD_SHIFT?  How does what you're doing on
> 8xx (for certain page sizes) differ from book3e?

Some of the things done for book3e are common to 8xx, but differ from 
book3s. For that reason, in the following patch (3/3), there is in 
several places:
-#ifdef CONFIG_PPC_FSL_BOOK3E
+#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)

Christophe
Crystal Wood Dec. 7, 2016, 1:06 a.m. UTC | #7
On Tue, 2016-12-06 at 07:34 +0100, Christophe LEROY wrote:
> 
> Le 06/12/2016 à 02:18, Scott Wood a écrit :
> > 
> > On Wed, 2016-09-21 at 10:11 +0200, Christophe Leroy wrote:
> > > 
> > > Today there are two implementations of hugetlbpages which are managed
> > > by exclusive #ifdefs:
> > > * FSL_BOOKE: several directory entries points to the same single
> > > hugepage
> > > * BOOK3S: one upper level directory entry points to a table of hugepages
> > > 
> > > In preparation of implementation of hugepage support on the 8xx, we
> > > need a mix of the two above solutions, because the 8xx needs both cases
> > > depending on the size of pages:
> > > * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
> > > that 2 PGD entries will be necessary to cover an 8M hugepage while a
> > > single PGD entry will cover 8x 512k hugepages.
> > > * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
> > > that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
> > > hugepages will be covers by one PGD entry.
> > > 
> > > This patch:
> > > * removes #ifdefs in favor of if/else based on the range sizes
> > > * merges the two huge_pte_alloc() functions as they are pretty similar
> > > * merges the two hugetlbpage_init() functions as they are pretty similar
> > [snip]
> > > 
> > > @@ -860,16 +803,34 @@ static int __init hugetlbpage_init(void)
> > >  		 * if we have pdshift and shift value same, we don't
> > >  		 * use pgt cache for hugepd.
> > >  		 */
> > > -		if (pdshift != shift) {
> > > +		if (pdshift > shift) {
> > >  			pgtable_cache_add(pdshift - shift, NULL);
> > >  			if (!PGT_CACHE(pdshift - shift))
> > >  				panic("hugetlbpage_init(): could not
> > > create
> > > "
> > >  				      "pgtable cache for %d bit
> > > pagesize\n", shift);
> > >  		}
> > > +#ifdef CONFIG_PPC_FSL_BOOK3E
> > > +		else if (!hugepte_cache) {
> > This else never triggers on book3e, because the way this function
> > calculates
> > pdshift is wrong for book3e (it uses PyD_SHIFT instead of
> > HUGEPD_PxD_SHIFT).
> >  We later get OOMs because huge_pte_alloc() calculates pdshift correctly,
> > tries to use hugepte_cache, and fails.
> Ok, I'll check it again, I was expecting it to still work properly on 
> book3e, because after applying patch 3 it works properly on the 8xx.

On 8xx you probably happen to have a page size that yields "pdshift <= shift"
even with the incorrect pdshift calculation, causing hugepte_cache to be
allocated.  The smallest hugepage size on 8xx is 512k compared to 4M on fsl-
book3e.

-Scott
Christophe Leroy Dec. 7, 2016, 6:59 a.m. UTC | #8
Le 07/12/2016 à 02:06, Scott Wood a écrit :
> On Tue, 2016-12-06 at 07:34 +0100, Christophe LEROY wrote:
>>
>> Le 06/12/2016 à 02:18, Scott Wood a écrit :
>>>
>>> On Wed, 2016-09-21 at 10:11 +0200, Christophe Leroy wrote:
>>>>
>>>> Today there are two implementations of hugetlbpages which are managed
>>>> by exclusive #ifdefs:
>>>> * FSL_BOOKE: several directory entries points to the same single
>>>> hugepage
>>>> * BOOK3S: one upper level directory entry points to a table of hugepages
>>>>
>>>> In preparation of implementation of hugepage support on the 8xx, we
>>>> need a mix of the two above solutions, because the 8xx needs both cases
>>>> depending on the size of pages:
>>>> * In 4k page size mode, each PGD entry covers a 4M bytes area. It means
>>>> that 2 PGD entries will be necessary to cover an 8M hugepage while a
>>>> single PGD entry will cover 8x 512k hugepages.
>>>> * In 16 page size mode, each PGD entry covers a 64M bytes area. It means
>>>> that 8x 8M hugepages will be covered by one PGD entry and 64x 512k
>>>> hugepages will be covers by one PGD entry.
>>>>
>>>> This patch:
>>>> * removes #ifdefs in favor of if/else based on the range sizes
>>>> * merges the two huge_pte_alloc() functions as they are pretty similar
>>>> * merges the two hugetlbpage_init() functions as they are pretty similar
>>> [snip]
>>>>
>>>> @@ -860,16 +803,34 @@ static int __init hugetlbpage_init(void)
>>>>  		 * if we have pdshift and shift value same, we don't
>>>>  		 * use pgt cache for hugepd.
>>>>  		 */
>>>> -		if (pdshift != shift) {
>>>> +		if (pdshift > shift) {
>>>>  			pgtable_cache_add(pdshift - shift, NULL);
>>>>  			if (!PGT_CACHE(pdshift - shift))
>>>>  				panic("hugetlbpage_init(): could not
>>>> create
>>>> "
>>>>  				      "pgtable cache for %d bit
>>>> pagesize\n", shift);
>>>>  		}
>>>> +#ifdef CONFIG_PPC_FSL_BOOK3E
>>>> +		else if (!hugepte_cache) {
>>> This else never triggers on book3e, because the way this function
>>> calculates
>>> pdshift is wrong for book3e (it uses PyD_SHIFT instead of
>>> HUGEPD_PxD_SHIFT).
>>>  We later get OOMs because huge_pte_alloc() calculates pdshift correctly,
>>> tries to use hugepte_cache, and fails.
>> Ok, I'll check it again, I was expecting it to still work properly on
>> book3e, because after applying patch 3 it works properly on the 8xx.
>
> On 8xx you probably happen to have a page size that yields "pdshift <= shift"
> even with the incorrect pdshift calculation, causing hugepte_cache to be
> allocated.  The smallest hugepage size on 8xx is 512k compared to 4M on fsl-
> book3e.
>

Indeed it works because on 8xx, PUD_SHIFT == PMD_SHIFT == PGDIR_SHIFT

Christophe
diff mbox

Patch

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index a5d3ecd..a0d049a 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -64,14 +64,16 @@  static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 {
 	struct kmem_cache *cachep;
 	pte_t *new;
-
-#ifdef CONFIG_PPC_FSL_BOOK3E
 	int i;
-	int num_hugepd = 1 << (pshift - pdshift);
-	cachep = hugepte_cache;
-#else
-	cachep = PGT_CACHE(pdshift - pshift);
-#endif
+	int num_hugepd;
+
+	if (pshift >= pdshift) {
+		cachep = hugepte_cache;
+		num_hugepd = 1 << (pshift - pdshift);
+	} else {
+		cachep = PGT_CACHE(pdshift - pshift);
+		num_hugepd = 1;
+	}
 
 	new = kmem_cache_zalloc(cachep, GFP_KERNEL);
 
@@ -89,7 +91,7 @@  static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 	smp_wmb();
 
 	spin_lock(&mm->page_table_lock);
-#ifdef CONFIG_PPC_FSL_BOOK3E
+
 	/*
 	 * We have multiple higher-level entries that point to the same
 	 * actual pte location.  Fill in each as we go and backtrack on error.
@@ -100,8 +102,13 @@  static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 		if (unlikely(!hugepd_none(*hpdp)))
 			break;
 		else
+#ifdef CONFIG_PPC_BOOK3S_64
+			hpdp->pd = __pa(new) |
+				   (shift_to_mmu_psize(pshift) << 2);
+#else
 			/* We use the old format for PPC_FSL_BOOK3E */
 			hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
+#endif
 	}
 	/* If we bailed from the for loop early, an error occurred, clean up */
 	if (i < num_hugepd) {
@@ -109,17 +116,6 @@  static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			hpdp->pd = 0;
 		kmem_cache_free(cachep, new);
 	}
-#else
-	if (!hugepd_none(*hpdp))
-		kmem_cache_free(cachep, new);
-	else {
-#ifdef CONFIG_PPC_BOOK3S_64
-		hpdp->pd = __pa(new) | (shift_to_mmu_psize(pshift) << 2);
-#else
-		hpdp->pd = ((unsigned long)new & ~PD_HUGE) | pshift;
-#endif
-	}
-#endif
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -136,7 +132,6 @@  static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 #define HUGEPD_PUD_SHIFT PMD_SHIFT
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
 /*
  * At this point we do the placement change only for BOOK3S 64. This would
  * possibly work on other subarchs.
@@ -153,6 +148,7 @@  pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 	addr &= ~(sz-1);
 	pg = pgd_offset(mm, addr);
 
+#ifdef CONFIG_PPC_BOOK3S_64
 	if (pshift == PGDIR_SHIFT)
 		/* 16GB huge page */
 		return (pte_t *) pg;
@@ -178,32 +174,7 @@  pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 				hpdp = (hugepd_t *)pm;
 		}
 	}
-	if (!hpdp)
-		return NULL;
-
-	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
-		return NULL;
-
-	return hugepte_offset(*hpdp, addr, pdshift);
-}
-
 #else
-
-pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	unsigned pshift = __ffs(sz);
-	unsigned pdshift = PGDIR_SHIFT;
-
-	addr &= ~(sz-1);
-
-	pg = pgd_offset(mm, addr);
-
 	if (pshift >= HUGEPD_PGD_SHIFT) {
 		hpdp = (hugepd_t *)pg;
 	} else {
@@ -217,7 +188,7 @@  pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 			hpdp = (hugepd_t *)pm;
 		}
 	}
-
+#endif
 	if (!hpdp)
 		return NULL;
 
@@ -228,7 +199,6 @@  pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 
 	return hugepte_offset(*hpdp, addr, pdshift);
 }
-#endif
 
 #ifdef CONFIG_PPC_FSL_BOOK3E
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -310,7 +280,11 @@  static int __init do_gpage_early_setup(char *param, char *val,
 				npages = 0;
 			if (npages > MAX_NUMBER_GPAGES) {
 				pr_warn("MMU: %lu pages requested for page "
+#ifdef CONFIG_PHYS_ADDR_T_64BIT
 					"size %llu KB, limiting to "
+#else
+					"size %u KB, limiting to "
+#endif
 					__stringify(MAX_NUMBER_GPAGES) "\n",
 					npages, size / 1024);
 				npages = MAX_NUMBER_GPAGES;
@@ -442,6 +416,8 @@  static void hugepd_free(struct mmu_gather *tlb, void *hugepte)
 	}
 	put_cpu_var(hugepd_freelist_cur);
 }
+#else
+static inline void hugepd_free(struct mmu_gather *tlb, void *hugepte) {}
 #endif
 
 static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
@@ -453,13 +429,11 @@  static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
 
 	unsigned long pdmask = ~((1UL << pdshift) - 1);
 	unsigned int num_hugepd = 1;
+	unsigned int shift = hugepd_shift(*hpdp);
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
 	/* Note: On fsl the hpdp may be the first of several */
-	num_hugepd = (1 << (hugepd_shift(*hpdp) - pdshift));
-#else
-	unsigned int shift = hugepd_shift(*hpdp);
-#endif
+	if (shift > pdshift)
+		num_hugepd = 1 << (shift - pdshift);
 
 	start &= pdmask;
 	if (start < floor)
@@ -475,11 +449,10 @@  static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshif
 	for (i = 0; i < num_hugepd; i++, hpdp++)
 		hpdp->pd = 0;
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
-	hugepd_free(tlb, hugepte);
-#else
-	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
-#endif
+	if (shift >= pdshift)
+		hugepd_free(tlb, hugepte);
+	else
+		pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -492,6 +465,8 @@  static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 
 	start = addr;
 	do {
+		unsigned long more;
+
 		pmd = pmd_offset(pud, addr);
 		next = pmd_addr_end(addr, end);
 		if (!is_hugepd(__hugepd(pmd_val(*pmd)))) {
@@ -502,15 +477,16 @@  static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 			WARN_ON(!pmd_none_or_clear_bad(pmd));
 			continue;
 		}
-#ifdef CONFIG_PPC_FSL_BOOK3E
 		/*
 		 * Increment next by the size of the huge mapping since
 		 * there may be more than one entry at this level for a
 		 * single hugepage, but all of them point to
 		 * the same kmem cache that holds the hugepte.
 		 */
-		next = addr + (1 << hugepd_shift(*(hugepd_t *)pmd));
-#endif
+		more = addr + (1 << hugepd_shift(*(hugepd_t *)pmd));
+		if (more > next)
+			next = more;
+
 		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
 				  addr, next, floor, ceiling);
 	} while (addr = next, addr != end);
@@ -550,15 +526,17 @@  static void hugetlb_free_pud_range(struct mmu_gather *tlb, pgd_t *pgd,
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
 					       ceiling);
 		} else {
-#ifdef CONFIG_PPC_FSL_BOOK3E
+			unsigned long more;
 			/*
 			 * Increment next by the size of the huge mapping since
 			 * there may be more than one entry at this level for a
 			 * single hugepage, but all of them point to
 			 * the same kmem cache that holds the hugepte.
 			 */
-			next = addr + (1 << hugepd_shift(*(hugepd_t *)pud));
-#endif
+			more = addr + (1 << hugepd_shift(*(hugepd_t *)pud));
+			if (more > next)
+				next = more;
+
 			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
 					  addr, next, floor, ceiling);
 		}
@@ -615,15 +593,17 @@  void hugetlb_free_pgd_range(struct mmu_gather *tlb,
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-#ifdef CONFIG_PPC_FSL_BOOK3E
+			unsigned long more;
 			/*
 			 * Increment next by the size of the huge mapping since
 			 * there may be more than one entry at the pgd level
 			 * for a single hugepage, but all of them point to the
 			 * same kmem cache that holds the hugepte.
 			 */
-			next = addr + (1 << hugepd_shift(*(hugepd_t *)pgd));
-#endif
+			more = addr + (1 << hugepd_shift(*(hugepd_t *)pgd));
+			if (more > next)
+				next = more;
+
 			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
 					  addr, next, floor, ceiling);
 		}
@@ -753,12 +733,13 @@  static int __init add_huge_page_size(unsigned long long size)
 
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable and slice limits. */
+	if (size <= PAGE_SIZE)
+		return -EINVAL;
 #ifdef CONFIG_PPC_FSL_BOOK3E
-	if ((size < PAGE_SIZE) || !is_power_of_4(size))
+	if (!is_power_of_4(size))
 		return -EINVAL;
 #else
-	if (!is_power_of_2(size)
-	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+	if (!is_power_of_2(size) || (shift > SLICE_HIGH_SHIFT))
 		return -EINVAL;
 #endif
 
@@ -791,53 +772,15 @@  static int __init hugepage_setup_sz(char *str)
 }
 __setup("hugepagesz=", hugepage_setup_sz);
 
-#ifdef CONFIG_PPC_FSL_BOOK3E
 struct kmem_cache *hugepte_cache;
 static int __init hugetlbpage_init(void)
 {
 	int psize;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		unsigned shift;
-
-		if (!mmu_psize_defs[psize].shift)
-			continue;
-
-		shift = mmu_psize_to_shift(psize);
-
-		/* Don't treat normal page sizes as huge... */
-		if (shift != PAGE_SHIFT)
-			if (add_huge_page_size(1ULL << shift) < 0)
-				continue;
-	}
-
-	/*
-	 * Create a kmem cache for hugeptes.  The bottom bits in the pte have
-	 * size information encoded in them, so align them to allow this
-	 */
-	hugepte_cache =  kmem_cache_create("hugepte-cache", sizeof(pte_t),
-					   HUGEPD_SHIFT_MASK + 1, 0, NULL);
-	if (hugepte_cache == NULL)
-		panic("%s: Unable to create kmem cache for hugeptes\n",
-		      __func__);
-
-	/* Default hpage size = 4M */
-	if (mmu_psize_defs[MMU_PAGE_4M].shift)
-		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
-	else
-		panic("%s: Unable to set default huge page size\n", __func__);
-
-
-	return 0;
-}
-#else
-static int __init hugetlbpage_init(void)
-{
-	int psize;
-
+#if !defined(CONFIG_PPC_FSL_BOOK3E)
 	if (!radix_enabled() && !mmu_has_feature(MMU_FTR_16M_PAGE))
 		return -ENODEV;
-
+#endif
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		unsigned shift;
 		unsigned pdshift;
@@ -860,16 +803,34 @@  static int __init hugetlbpage_init(void)
 		 * if we have pdshift and shift value same, we don't
 		 * use pgt cache for hugepd.
 		 */
-		if (pdshift != shift) {
+		if (pdshift > shift) {
 			pgtable_cache_add(pdshift - shift, NULL);
 			if (!PGT_CACHE(pdshift - shift))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n", shift);
 		}
+#ifdef CONFIG_PPC_FSL_BOOK3E
+		else if (!hugepte_cache) {
+			/*
+			 * Create a kmem cache for hugeptes.  The bottom bits in
+			 * the pte have size information encoded in them, so
+			 * align them to allow this
+			 */
+			hugepte_cache = kmem_cache_create("hugepte-cache",
+							  sizeof(pte_t),
+							  HUGEPD_SHIFT_MASK + 1,
+							  0, NULL);
+			if (hugepte_cache == NULL)
+				panic("%s: Unable to create kmem cache "
+				      "for hugeptes\n", __func__);
+
+		}
+#endif
 	}
 
 	/* Set default large page size. Currently, we pick 16M or 1M
 	 * depending on what is available
+	 * We select 4M on other ones.
 	 */
 	if (mmu_psize_defs[MMU_PAGE_16M].shift)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_16M].shift;
@@ -877,11 +838,14 @@  static int __init hugetlbpage_init(void)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_1M].shift;
 	else if (mmu_psize_defs[MMU_PAGE_2M].shift)
 		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_2M].shift;
-
+	else if (mmu_psize_defs[MMU_PAGE_4M].shift)
+		HPAGE_SHIFT = mmu_psize_defs[MMU_PAGE_4M].shift;
+	else
+		panic("%s: Unable to set default huge page size\n", __func__);
 
 	return 0;
 }
-#endif
+
 arch_initcall(hugetlbpage_init);
 
 void flush_dcache_icache_hugepage(struct page *page)