mbox series

[RFC,0/8] Reimplement huge pages without hugepd on powerpc 8xx

Message ID cover.1711377230.git.christophe.leroy@csgroup.eu (mailing list archive)
Headers show
Series Reimplement huge pages without hugepd on powerpc 8xx | expand

Message

Christophe Leroy March 25, 2024, 2:55 p.m. UTC
This series reimplements hugepages with hugepd on powerpc 8xx.

Unlike most architectures, powerpc 8xx HW requires a two-level
pagetable topology for all page sizes. So a leaf PMD-contig approach
is not feasible as such.

Possible sizes are 4k, 16k, 512k and 8M.

First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
must point to a single entry level-2 page table. Until now that was
done using hugepd. This series changes it to use standard page tables
where the entry is replicated 1024 times on each of the two pagetables
refered by the two associated PMD entries for that 8M page.

At the moment it has to look into each helper to know if the
hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
a lower size. I hope this can me handled by core-mm in the future.

There are probably several ways to implement stuff, so feedback is
very welcome.

Christophe Leroy (8):
  mm: Provide pagesize to pmd_populate()
  mm: Provide page size to pte_alloc_huge()
  mm: Provide pmd to pte_leaf_size()
  mm: Provide mm_struct and address to huge_ptep_get()
  powerpc/mm: Allow hugepages without hugepd
  powerpc/8xx: Fix size given to set_huge_pte_at()
  powerpc/8xx: Remove support for 8M pages
  powerpc/8xx: Add back support for 8M pages using contiguous PTE
    entries

 arch/arm64/include/asm/hugetlb.h              |  2 +-
 arch/arm64/include/asm/pgtable.h              |  2 +-
 arch/arm64/mm/hugetlbpage.c                   |  2 +-
 arch/parisc/mm/hugetlbpage.c                  |  2 +-
 arch/powerpc/Kconfig                          |  1 -
 arch/powerpc/include/asm/hugetlb.h            | 13 +++-
 .../include/asm/nohash/32/hugetlb-8xx.h       | 54 ++++++++---------
 arch/powerpc/include/asm/nohash/32/pgalloc.h  |  2 +
 arch/powerpc/include/asm/nohash/32/pte-8xx.h  | 59 +++++++++++++------
 arch/powerpc/include/asm/nohash/pgtable.h     | 12 ++--
 arch/powerpc/include/asm/page.h               |  5 --
 arch/powerpc/include/asm/pgtable.h            |  1 +
 arch/powerpc/kernel/head_8xx.S                | 10 +---
 arch/powerpc/mm/hugetlbpage.c                 | 23 +++++++-
 arch/powerpc/mm/nohash/8xx.c                  | 46 +++++++--------
 arch/powerpc/mm/pgtable.c                     | 26 +++++---
 arch/powerpc/mm/pgtable_32.c                  |  2 +-
 arch/powerpc/platforms/Kconfig.cputype        |  2 +
 arch/riscv/include/asm/pgtable.h              |  2 +-
 arch/riscv/mm/hugetlbpage.c                   |  2 +-
 arch/sh/mm/hugetlbpage.c                      |  2 +-
 arch/sparc/include/asm/pgtable_64.h           |  2 +-
 arch/sparc/mm/hugetlbpage.c                   |  4 +-
 fs/hugetlbfs/inode.c                          |  2 +-
 fs/proc/task_mmu.c                            |  8 +--
 fs/userfaultfd.c                              |  2 +-
 include/asm-generic/hugetlb.h                 |  2 +-
 include/linux/hugetlb.h                       |  4 +-
 include/linux/mm.h                            | 12 ++--
 include/linux/pgtable.h                       |  2 +-
 include/linux/swapops.h                       |  2 +-
 kernel/events/core.c                          |  2 +-
 mm/damon/vaddr.c                              |  6 +-
 mm/filemap.c                                  |  2 +-
 mm/gup.c                                      |  2 +-
 mm/hmm.c                                      |  2 +-
 mm/hugetlb.c                                  | 46 +++++++--------
 mm/internal.h                                 |  2 +-
 mm/memory-failure.c                           |  2 +-
 mm/memory.c                                   | 19 +++---
 mm/mempolicy.c                                |  2 +-
 mm/migrate.c                                  |  4 +-
 mm/mincore.c                                  |  2 +-
 mm/pgalloc-track.h                            |  2 +-
 mm/userfaultfd.c                              |  6 +-
 45 files changed, 229 insertions(+), 180 deletions(-)

Comments

Jason Gunthorpe March 25, 2024, 4:38 p.m. UTC | #1
On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> There are probably several ways to implement stuff, so feedback is
> very welcome.

I thought it looks pretty good!

Thanks,
Jason
Peter Xu April 11, 2024, 4:15 p.m. UTC | #2
On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> > This series reimplements hugepages with hugepd on powerpc 8xx.
> > 
> > Unlike most architectures, powerpc 8xx HW requires a two-level
> > pagetable topology for all page sizes. So a leaf PMD-contig approach
> > is not feasible as such.
> > 
> > Possible sizes are 4k, 16k, 512k and 8M.
> > 
> > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> > must point to a single entry level-2 page table. Until now that was
> > done using hugepd. This series changes it to use standard page tables
> > where the entry is replicated 1024 times on each of the two pagetables
> > refered by the two associated PMD entries for that 8M page.
> > 
> > At the moment it has to look into each helper to know if the
> > hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> > a lower size. I hope this can me handled by core-mm in the future.
> > 
> > There are probably several ways to implement stuff, so feedback is
> > very welcome.
> 
> I thought it looks pretty good!

I second it.

I saw the discussions in patch 1.  Christophe, I suppose you're exploring
the big hammer over hugepd, and perhaps went already with the 32bit pmd
solution for nohash/32bit challenge you mentioned?

I'm trying to position my next step; it seems like at least I should not
adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
or you're going to have an RFC soon then I can base on top?

Thanks,
Christophe Leroy April 12, 2024, 2:08 p.m. UTC | #3
Le 11/04/2024 à 18:15, Peter Xu a écrit :
> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>
>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>> is not feasible as such.
>>>
>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>
>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>> must point to a single entry level-2 page table. Until now that was
>>> done using hugepd. This series changes it to use standard page tables
>>> where the entry is replicated 1024 times on each of the two pagetables
>>> refered by the two associated PMD entries for that 8M page.
>>>
>>> At the moment it has to look into each helper to know if the
>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>> a lower size. I hope this can me handled by core-mm in the future.
>>>
>>> There are probably several ways to implement stuff, so feedback is
>>> very welcome.
>>
>> I thought it looks pretty good!
> 
> I second it.
> 
> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> solution for nohash/32bit challenge you mentioned?
> 
> I'm trying to position my next step; it seems like at least I should not
> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> or you're going to have an RFC soon then I can base on top?

Depends on what you expect by "soon".

I sure won't be able to send any RFC before end of April.

Should be possible to have something during May.

Christophe
Peter Xu April 12, 2024, 2:30 p.m. UTC | #4
On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
> 
> 
> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> > On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>
> >>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>> is not feasible as such.
> >>>
> >>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>
> >>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> >>> must point to a single entry level-2 page table. Until now that was
> >>> done using hugepd. This series changes it to use standard page tables
> >>> where the entry is replicated 1024 times on each of the two pagetables
> >>> refered by the two associated PMD entries for that 8M page.
> >>>
> >>> At the moment it has to look into each helper to know if the
> >>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>> a lower size. I hope this can me handled by core-mm in the future.
> >>>
> >>> There are probably several ways to implement stuff, so feedback is
> >>> very welcome.
> >>
> >> I thought it looks pretty good!
> > 
> > I second it.
> > 
> > I saw the discussions in patch 1.  Christophe, I suppose you're exploring
> > the big hammer over hugepd, and perhaps went already with the 32bit pmd
> > solution for nohash/32bit challenge you mentioned?
> > 
> > I'm trying to position my next step; it seems like at least I should not
> > adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
> > or you're going to have an RFC soon then I can base on top?
> 
> Depends on what you expect by "soon".
> 
> I sure won't be able to send any RFC before end of April.
> 
> Should be possible to have something during May.

That's good enough, thanks.  I'll see what is the best I can do.

Then do you think I can leave p4d/pgd leaves alone?  Please check the other
email where I'm not sure whether pgd leaves ever existed for any of
PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
recognize pud and lower for all leaves.  Then if Power can switch from
hugepd to this it should just work.

Even if pgd exists (then something I overlooked..), I'm wondering whether
we can push that downwards to be either pud/pmd (and looks like we all
agree p4d is never used on Power).  That may involve some pgtable
operations moving from pgd level to lower, e.g. my pure imagination would
look like starting with:

#define PTE_INDEX_SIZE	PTE_SHIFT
#define PMD_INDEX_SIZE	0
#define PUD_INDEX_SIZE	0
#define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)

To:

#define PTE_INDEX_SIZE	PTE_SHIFT
#define PMD_INDEX_SIZE	(32 - PMD_SHIFT)
#define PUD_INDEX_SIZE	0
#define PGD_INDEX_SIZE	0

And the rest will need care too.  I hope moving downward is easier
(e.g. the walker should always exist for lower levels but not always for
higher levels), but I actually have little idea on whether there's any
other implications, so please bare with me on stupid mistakes.

I just hope pgd leaves don't exist already, then I think it'll be simpler.

Thanks,
Christophe Leroy April 15, 2024, 7:12 p.m. UTC | #5
Le 12/04/2024 à 16:30, Peter Xu a écrit :
> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>
>>
>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>
>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>> is not feasible as such.
>>>>>
>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>
>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>>>>> must point to a single entry level-2 page table. Until now that was
>>>>> done using hugepd. This series changes it to use standard page tables
>>>>> where the entry is replicated 1024 times on each of the two pagetables
>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>
>>>>> At the moment it has to look into each helper to know if the
>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>
>>>>> There are probably several ways to implement stuff, so feedback is
>>>>> very welcome.
>>>>
>>>> I thought it looks pretty good!
>>>
>>> I second it.
>>>
>>> I saw the discussions in patch 1.  Christophe, I suppose you're exploring
>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>> solution for nohash/32bit challenge you mentioned?
>>>
>>> I'm trying to position my next step; it seems like at least I should not
>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD checks,
>>> or you're going to have an RFC soon then I can base on top?
>>
>> Depends on what you expect by "soon".
>>
>> I sure won't be able to send any RFC before end of April.
>>
>> Should be possible to have something during May.
> 
> That's good enough, thanks.  I'll see what is the best I can do.
> 
> Then do you think I can leave p4d/pgd leaves alone?  Please check the other
> email where I'm not sure whether pgd leaves ever existed for any of
> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> recognize pud and lower for all leaves.  Then if Power can switch from
> hugepd to this it should just work.

Well, if I understand correctly, something with no PMD will include 
<asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
(0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
case.

And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
checked.


> 
> Even if pgd exists (then something I overlooked..), I'm wondering whether
> we can push that downwards to be either pud/pmd (and looks like we all
> agree p4d is never used on Power).  That may involve some pgtable
> operations moving from pgd level to lower, e.g. my pure imagination would
> look like starting with:

Yes I think there is no doubt that p4d is never used:

arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
<asm-generic/pgtable-nop4d.h>
arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
<asm-generic/pgtable-nopmd.h>
arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
<asm-generic/pgtable-nop4d.h>

But that means that PPC32 have pmd entries and PPC64 have p4d entries ...

> 
> #define PTE_INDEX_SIZE	PTE_SHIFT
> #define PMD_INDEX_SIZE	0
> #define PUD_INDEX_SIZE	0
> #define PGD_INDEX_SIZE	(32 - PGDIR_SHIFT)
> 
> To:
> 
> #define PTE_INDEX_SIZE	PTE_SHIFT
> #define PMD_INDEX_SIZE	(32 - PMD_SHIFT)
> #define PUD_INDEX_SIZE	0
> #define PGD_INDEX_SIZE	0

But then you can't anymore have #define PTRS_PER_PMD 1 from 
<asm-generic/pgtable-nop4d.h>

> 
> And the rest will need care too.  I hope moving downward is easier
> (e.g. the walker should always exist for lower levels but not always for
> higher levels), but I actually have little idea on whether there's any
> other implications, so please bare with me on stupid mistakes.
> 
> I just hope pgd leaves don't exist already, then I think it'll be simpler.
> 
> Thanks,
>
Christophe Leroy April 16, 2024, 10:58 a.m. UTC | #6
Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> 
> 
> Le 12/04/2024 à 16:30, Peter Xu a écrit :
>> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
>>>
>>>
>>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
>>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>>>>>
>>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
>>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>>>>>> is not feasible as such.
>>>>>>
>>>>>> Possible sizes are 4k, 16k, 512k and 8M.
>>>>>>
>>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
>>>>>> entries
>>>>>> must point to a single entry level-2 page table. Until now that was
>>>>>> done using hugepd. This series changes it to use standard page tables
>>>>>> where the entry is replicated 1024 times on each of the two 
>>>>>> pagetables
>>>>>> refered by the two associated PMD entries for that 8M page.
>>>>>>
>>>>>> At the moment it has to look into each helper to know if the
>>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>>>>>> a lower size. I hope this can me handled by core-mm in the future.
>>>>>>
>>>>>> There are probably several ways to implement stuff, so feedback is
>>>>>> very welcome.
>>>>>
>>>>> I thought it looks pretty good!
>>>>
>>>> I second it.
>>>>
>>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
>>>> exploring
>>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
>>>> solution for nohash/32bit challenge you mentioned?
>>>>
>>>> I'm trying to position my next step; it seems like at least I should 
>>>> not
>>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
>>>> checks,
>>>> or you're going to have an RFC soon then I can base on top?
>>>
>>> Depends on what you expect by "soon".
>>>
>>> I sure won't be able to send any RFC before end of April.
>>>
>>> Should be possible to have something during May.
>>
>> That's good enough, thanks.  I'll see what is the best I can do.
>>
>> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
>> other
>> email where I'm not sure whether pgd leaves ever existed for any of
>> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
>> recognize pud and lower for all leaves.  Then if Power can switch from
>> hugepd to this it should just work.
> 
> Well, if I understand correctly, something with no PMD will include 
> <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
> entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> case.
> 
> And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
> have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> checked.
> 
> 
>>
>> Even if pgd exists (then something I overlooked..), I'm wondering whether
>> we can push that downwards to be either pud/pmd (and looks like we all
>> agree p4d is never used on Power).  That may involve some pgtable
>> operations moving from pgd level to lower, e.g. my pure imagination would
>> look like starting with:
> 
> Yes I think there is no doubt that p4d is never used:
> 
> arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> <asm-generic/pgtable-nop4d.h>
> arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> <asm-generic/pgtable-nopmd.h>
> arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> <asm-generic/pgtable-nop4d.h>
> 
> But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> 
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    0
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
>>
>> To:
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    0
> 
> But then you can't anymore have #define PTRS_PER_PMD 1 from 
> <asm-generic/pgtable-nop4d.h>
> 
>>
>> And the rest will need care too.  I hope moving downward is easier
>> (e.g. the walker should always exist for lower levels but not always for
>> higher levels), but I actually have little idea on whether there's any
>> other implications, so please bare with me on stupid mistakes.
>>
>> I just hope pgd leaves don't exist already, then I think it'll be 
>> simpler.
>>
>> Thanks,
>>

Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
("riscv/mm: fix two page table check related issues")

So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 
and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

Christophe
Peter Xu April 16, 2024, 7:40 p.m. UTC | #7
On Tue, Apr 16, 2024 at 10:58:33AM +0000, Christophe Leroy wrote:
> 
> 
> Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> > 
> > 
> > Le 12/04/2024 à 16:30, Peter Xu a écrit :
> >> On Fri, Apr 12, 2024 at 02:08:03PM +0000, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
> >>>> On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> >>>>> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >>>>>> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>>>>>
> >>>>>> Unlike most architectures, powerpc 8xx HW requires a two-level
> >>>>>> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >>>>>> is not feasible as such.
> >>>>>>
> >>>>>> Possible sizes are 4k, 16k, 512k and 8M.
> >>>>>>
> >>>>>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
> >>>>>> entries
> >>>>>> must point to a single entry level-2 page table. Until now that was
> >>>>>> done using hugepd. This series changes it to use standard page tables
> >>>>>> where the entry is replicated 1024 times on each of the two 
> >>>>>> pagetables
> >>>>>> refered by the two associated PMD entries for that 8M page.
> >>>>>>
> >>>>>> At the moment it has to look into each helper to know if the
> >>>>>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >>>>>> a lower size. I hope this can me handled by core-mm in the future.
> >>>>>>
> >>>>>> There are probably several ways to implement stuff, so feedback is
> >>>>>> very welcome.
> >>>>>
> >>>>> I thought it looks pretty good!
> >>>>
> >>>> I second it.
> >>>>
> >>>> I saw the discussions in patch 1.  Christophe, I suppose you're 
> >>>> exploring
> >>>> the big hammer over hugepd, and perhaps went already with the 32bit pmd
> >>>> solution for nohash/32bit challenge you mentioned?
> >>>>
> >>>> I'm trying to position my next step; it seems like at least I should 
> >>>> not
> >>>> adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
> >>>> checks,
> >>>> or you're going to have an RFC soon then I can base on top?
> >>>
> >>> Depends on what you expect by "soon".
> >>>
> >>> I sure won't be able to send any RFC before end of April.
> >>>
> >>> Should be possible to have something during May.
> >>
> >> That's good enough, thanks.  I'll see what is the best I can do.
> >>
> >> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
> >> other
> >> email where I'm not sure whether pgd leaves ever existed for any of
> >> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> >> recognize pud and lower for all leaves.  Then if Power can switch from
> >> hugepd to this it should just work.
> > 
> > Well, if I understand correctly, something with no PMD will include 
> > <asm-generic/pgtable-nopmd.h> and will therefore only have .... pmd 
> > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> > case.
> > 
> > And therefore including <asm-generic/pgtable-nop4d.h> means .... you 
> > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> > checked.
> > 
> > 
> >>
> >> Even if pgd exists (then something I overlooked..), I'm wondering whether
> >> we can push that downwards to be either pud/pmd (and looks like we all
> >> agree p4d is never used on Power).  That may involve some pgtable
> >> operations moving from pgd level to lower, e.g. my pure imagination would
> >> look like starting with:
> > 
> > Yes I think there is no doubt that p4d is never used:
> > 
> > arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> > <asm-generic/pgtable-nop4d.h>
> > arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> > <asm-generic/pgtable-nopmd.h>
> > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> > <asm-generic/pgtable-nop4d.h>
> > 
> > But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> > 
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    0
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
> >>
> >> To:
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    0
> > 
> > But then you can't anymore have #define PTRS_PER_PMD 1 from 
> > <asm-generic/pgtable-nop4d.h>
> > 
> >>
> >> And the rest will need care too.  I hope moving downward is easier
> >> (e.g. the walker should always exist for lower levels but not always for
> >> higher levels), but I actually have little idea on whether there's any
> >> other implications, so please bare with me on stupid mistakes.
> >>
> >> I just hope pgd leaves don't exist already, then I think it'll be 
> >> simpler.
> >>
> >> Thanks,
> >>
> 
> Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
> pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
> ("riscv/mm: fix two page table check related issues")
> 
> So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 

I think should be yes for this, and..

> and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

.. probably no for this?  It seems pgd is just slightly special.

Firstly, I notice that the -nopmd.h actually includes -nopud.h already, and
further that includes -nop4d.h.  It means we can only have below options:

  a) nopmd + nopud + nop4d
  b) nopud + nop4d
  c) nop4d

It should also mean we can't randomly choose which layer to skip with the
current header arrangements.. at the meantime, all 32bit PowerPC should
perhaps fall into a), while 64 bits fall into c).  That looks all fine for
now.

Above p4d_leaf()==false [1] should be fine when -nopud.h included, because
that already included nop4d.h, it means "p4d level is skipped" in the
pgtable.  Then it doesn't make sense if p4d_leaf() can ever return true.
That's the same when pud_leaf()==false looks sane when an arch included
-nopmd.h as that in turn implies -nopud too.

pgd seems different, because -nop4d.h didn't include anything else like
"-nopgd.h"..  so I don't see further implication on pgd sololy from the
headers. I guess that's also why 32bit Power uses pgd+pte for the two
levels; it looks like pgd is special among the others.

However I think it still didn't mean that we can't push pgd to pmd, making
pgd+pte into pmd+pte.  Though here we may want to move from:

  #include <asm-generic/pgtable-nopmd.h>
  (pmd/pud/p4d not used)

Into:

  #include <asm-generic/pgtable-nopud.h>
  (pud/p4d not used)

Then we may need to provide something similar to what's in -nopXd.h for
pgds.

But let's loop back to the very beginning: I don't think we have either pgd
leaves or p4d leaves for PowerPC.  Note that hugepd looks possible to exist
in pgd entries (per wiki page [1]), however I don't even think it's true in
reality, as I mentioned elsewhere on reading __find_linux_pte() and it
always go directly into p4d.  I highly doubt in reality the "pgd hugepd
entries" are actually processed by the p4d layer here:

	if (is_hugepd(__hugepd(p4d_val(p4d)))) {
		hpdp = (hugepd_t *)&p4d;
		goto out_huge;
	}

Because when with "nop4d" it doesn't mean "there is no p4d" but what it
really meant is "we have only one p4d entry (which is actually exactly the
pgd entry..)".  After all, is_hugepd() works for all levels.

Taking E500 nohash 32 as example, it has these:

  "pgd entry covering 2M" -> "hugepd with one hugepte covering one 4M hugepage"
                                   ^
  "pgd entry covering 2M" ---------+

I suspect this "pgd covering 2M" is processed by the p4d code above, rather
than pgd level.  Then in the future world where there's no hugepd, it'll
naturally become a pgtable page at pte level.

Thanks,

[1] https://github.com/linuxppc/wiki/wiki/Huge-pages
Oscar Salvador May 17, 2024, 2:27 p.m. UTC | #8
On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> This series reimplements hugepages with hugepd on powerpc 8xx.
> 
> Unlike most architectures, powerpc 8xx HW requires a two-level
> pagetable topology for all page sizes. So a leaf PMD-contig approach
> is not feasible as such.
> 
> Possible sizes are 4k, 16k, 512k and 8M.
> 
> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
> must point to a single entry level-2 page table. Until now that was
> done using hugepd. This series changes it to use standard page tables
> where the entry is replicated 1024 times on each of the two pagetables
> refered by the two associated PMD entries for that 8M page.
> 
> At the moment it has to look into each helper to know if the
> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> a lower size. I hope this can me handled by core-mm in the future.
> 
> There are probably several ways to implement stuff, so feedback is
> very welcome.


Hi Christophe,

I have been looking into this because I am interested in the ongoing work of
the hugetlb unification, but my knowledge of ppc pagetables tends to zero,
So be prepared for some stupid questions.

First, let me have a clear picture of the current situation:

power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level 
pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them
as PMD + PTE though.

And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of
of 4GB.

Looking at the page table diagram for power8xx, it seems power8xx has also some
sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing
bigger page sizes?
I also guess that although power8xx supports all these different sizes, only one
of them can be active at any time, right?

It also seems that this whole hugepd thing is only used when we are using 8MB
PAGE_SIZE, right?
And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB
hugepd.
E.g:
                        ____________
 [PGD#0] ------------> |            |
                       | 8MB hugepd |
 [PGD#1] ------------> |____________|

What you want to do with this work is to get rid of that hugepd abstraction
because it is something power8xx/hugetlb specific and cannot be represented
with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE).
I did not check, but I guess we cannot walk the hugepd thing with a normal
page table walker, or can we? (how special is a hugepd? can you describe its
internal layout?)

So, what you proprose, is something like the following?

 [PGD#X] -----------> [PTE#0]
         -----------> [PTE..#1023]
 [PGD#Y] -----------> [PTE#0]
         -----------> [PTE..#1023]

so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs.

The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is
just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless
of the pagesize, but either I read that wrong or..else.
Because on 16K page size, they show that each pte covers for 16KB memory chunk.
But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what
is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous
PTEs for bigger sizes?

Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right?

Peter mentioned that we should have something like:

X       X
[PGD] - [P4D] - [PUD] - [PMD] - [PTE]

where the PMD and PTE would be the ones we use for representing the 2Level
ptage table, and PGD,P4D and PUD would just be dummies.

But, is not the convention to at least have PGD-PTE always, and have anything
in between as optional? E.g:

  X       ?       ?       ?       X
[PGD] - [P4D] - [PUD] - [PMD] - [PTE]

I mean, are page table walkers ready to deal with non-PGD? I thought they were
not.

Also, in patch#1, you mentioned:

"At the time being, for 512k pages the flag is kept in the PTE and inserted in
the PMD entry at TLB miss exception".

Can you point out where to look for that in the code?

Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()?
Is the size of the current mapping we are establishing?
I see that you only make a distinction when the mapping size is 8MB.
So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous
representing a 4MB chunk?

I will start looking deeper into this series on Monday, but just wanted to have a better
insight of what is going on.

PD: I think we could make the changelog of the coverletter a bit fatter and cover some
details in there, e.g: layout of page-tables for different page sizes, layout of hugepd,
expected layout after the work, etc.
I think it would help in reviewing this series.

Thanks!

[1] https://github.com/linuxppc/wiki/wiki/Huge-pages
Christophe Leroy May 22, 2024, 9:08 a.m. UTC | #9
Le 17/05/2024 à 16:27, Oscar Salvador a écrit :
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>
>> Unlike most architectures, powerpc 8xx HW requires a two-level
>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>> is not feasible as such.
>>
>> Possible sizes are 4k, 16k, 512k and 8M.
>>
>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>> must point to a single entry level-2 page table. Until now that was
>> done using hugepd. This series changes it to use standard page tables
>> where the entry is replicated 1024 times on each of the two pagetables
>> refered by the two associated PMD entries for that 8M page.
>>
>> At the moment it has to look into each helper to know if the
>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>> a lower size. I hope this can me handled by core-mm in the future.
>>
>> There are probably several ways to implement stuff, so feedback is
>> very welcome.
> 
> 
> Hi Christophe,
> 
> I have been looking into this because I am interested in the ongoing work of
> the hugetlb unification, but my knowledge of ppc pagetables tends to zero,
> So be prepared for some stupid questions.
> 
> First, let me have a clear picture of the current situation:
> 
> power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level
> pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them
> as PMD + PTE though.
> 
> And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of
> of 4GB.
> 
> Looking at the page table diagram for power8xx, it seems power8xx has also some
> sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing
> bigger page sizes?
> I also guess that although power8xx supports all these different sizes, only one
> of them can be active at any time, right?

Don't know what you mean by "active at any time". In a running system 
with PAGE_SIZE defined as 4k, you can at any time have some hugepages of 
size 16K, some 512K and some 8M.

> 
> It also seems that this whole hugepd thing is only used when we are using 8MB
> PAGE_SIZE, right?

Today yes. In the past it was also used for 512K pages, until commit 
b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")

> And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB
> hugepd.
> E.g:
>                          ____________
>   [PGD#0] ------------> |            |
>                         | 8MB hugepd |
>   [PGD#1] ------------> |____________|
> 
> What you want to do with this work is to get rid of that hugepd abstraction
> because it is something power8xx/hugetlb specific and cannot be represented
> with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE).

It is more than 8xx, also used on e500 and book3s/64 but for sure that's 
specific to powerpc and it would help to get rid of it completely.

> I did not check, but I guess we cannot walk the hugepd thing with a normal
> page table walker, or can we? (how special is a hugepd? can you describe its
> internal layout?)

depends on what you mean by "normal". For instance 
walk_page_range_novma() handles hugepd.

> 
> So, what you proprose, is something like the following?
> 
>   [PGD#X] -----------> [PTE#0]
>           -----------> [PTE..#1023]
>   [PGD#Y] -----------> [PTE#0]
>           -----------> [PTE..#1023]
> 
> so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs.
> 
> The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is
> just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless
> of the pagesize, but either I read that wrong or..else.

What it means is that when PAGE_SIZE is 16k, the pte_t is a table of 4 
long, see 
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/include/asm/pgtable-types.h#L11

> Because on 16K page size, they show that each pte covers for 16KB memory chunk.
> But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what
> is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous
> PTEs for bigger sizes?

In a way yes, we cheat the HW by defining a PTE as a table of 4 u32 
values to behave like a cont-PTE.

> 
> Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right?
> 
> Peter mentioned that we should have something like:
> 
> X       X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
> 
> where the PMD and PTE would be the ones we use for representing the 2Level
> ptage table, and PGD,P4D and PUD would just be dummies.
> 
> But, is not the convention to at least have PGD-PTE always, and have anything
> in between as optional? E.g:
> 
>    X       ?       ?       ?       X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
> 

That's also my understanding hence the discussion with Peter. The fog is 
that there is a mix between PGD and PMD, for instance to populate a PGD 
entry you use pmd_populate(). To clear a PGD entry you use pmd_clear, 
pgd_clear only exists when you have P4Ds.

> I mean, are page table walkers ready to deal with non-PGD? I thought they were
> not.

Yes that's how it is today but Peter was thinking for the future.

> 
> Also, in patch#1, you mentioned:
> 
> "At the time being, for 512k pages the flag is kept in the PTE and inserted in
> the PMD entry at TLB miss exception".
> 
> Can you point out where to look for that in the code?

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L219

and

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L277

> 
> Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()?
> Is the size of the current mapping we are establishing?
> I see that you only make a distinction when the mapping size is 8MB.
> So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous
> representing a 4MB chunk?
> 
> I will start looking deeper into this series on Monday, but just wanted to have a better
> insight of what is going on.
> 
> PD: I think we could make the changelog of the coverletter a bit fatter and cover some
> details in there, e.g: layout of page-tables for different page sizes, layout of hugepd,
> expected layout after the work, etc.
> I think it would help in reviewing this series.

I prefer keeping details in individual patches and keep the cover letter 
for a high level summary and only administrative information because 
information in cover letter are usually lost on the long term.

> 
> Thanks!
> 
> [1] https://github.com/linuxppc/wiki/wiki/Huge-pages
> 
>