Patchwork powerpc/mm: Export HPAGE_SHIFT

login
register
mail settings
Submitter Roland Dreier
Date Feb. 4, 2009, 1:08 a.m.
Message ID <ada8wonnh3t.fsf@cisco.com>
Download mbox | patch
Permalink /patch/21822/
State Not Applicable
Headers show

Comments

Roland Dreier - Feb. 4, 2009, 1:08 a.m.
Forwarding Eli's patch below, since PowerPC guys may have missed it.  I
guess the question for Ben et al is whether there is any issue with
exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel
it's an internal detail).  It would probably make sense to roll this
change into the mlx4 change that Eli alludes to below and merge through
my tree (with ppc maintainer acks of course), rather than splitting this
patch out and introducing cross-tree dependencies (and also separating
the rationale for the change from the change itself).

Thanks,
  Roland


Drivers may want to take advantage of the large pages used for memory obtained
from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in
the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale
significantly better when registering larger memory regions. Other drivers
could also benefit from this.

Signed-off-by: Eli Cohen <eli@mellanox.co.il>
---
 arch/powerpc/mm/hash_utils_64.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
Benjamin Herrenschmidt - Feb. 4, 2009, 1:50 a.m.
On Tue, 2009-02-03 at 17:08 -0800, Roland Dreier wrote:
> Forwarding Eli's patch below, since PowerPC guys may have missed it.  I
> guess the question for Ben et al is whether there is any issue with
> exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel
> it's an internal detail).  It would probably make sense to roll this
> change into the mlx4 change that Eli alludes to below and merge through
> my tree (with ppc maintainer acks of course), rather than splitting this
> patch out and introducing cross-tree dependencies (and also separating
> the rationale for the change from the change itself).
> 
> Thanks,
>   Roland
> 
> 
> Drivers may want to take advantage of the large pages used for memory obtained
> from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in
> the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale
> significantly better when registering larger memory regions. Other drivers
> could also benefit from this.

Except that we support multiple large page sizes nowadays ... I think
the size can be specified per mountpoint of hugetlbfs no ? Thus things
like mellanox would have to query the page size used for a given
mapping.

Do the generic hugetlbfs code provides such an API ? If not, we may need
to add one.

Cheers,
Ben.

> Signed-off-by: Eli Cohen <eli@mellanox.co.il>
> ---
>  arch/powerpc/mm/hash_utils_64.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
> index 8d5b475..6cff8c7 100644
> --- a/arch/powerpc/mm/hash_utils_64.c
> +++ b/arch/powerpc/mm/hash_utils_64.c
> @@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M;
>  u16 mmu_slb_size = 64;
>  #ifdef CONFIG_HUGETLB_PAGE
>  unsigned int HPAGE_SHIFT;
> +EXPORT_SYMBOL(HPAGE_SHIFT);
>  #endif
>  #ifdef CONFIG_PPC_64K_PAGES
>  int mmu_ci_restrictions;
Andrew Morton - Feb. 4, 2009, 5:13 a.m.
On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Tue, 2009-02-03 at 17:08 -0800, Roland Dreier wrote:
> > Forwarding Eli's patch below, since PowerPC guys may have missed it.  I
> > guess the question for Ben et al is whether there is any issue with
> > exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel
> > it's an internal detail).  It would probably make sense to roll this
> > change into the mlx4 change that Eli alludes to below and merge through
> > my tree (with ppc maintainer acks of course), rather than splitting this
> > patch out and introducing cross-tree dependencies (and also separating
> > the rationale for the change from the change itself).
> > 
> > Thanks,
> >   Roland
> > 
> > 
> > Drivers may want to take advantage of the large pages used for memory obtained
> > from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in
> > the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale
> > significantly better when registering larger memory regions. Other drivers
> > could also benefit from this.
> 
> Except that we support multiple large page sizes nowadays ... I think
> the size can be specified per mountpoint of hugetlbfs no ? Thus things
> like mellanox would have to query the page size used for a given
> mapping.
> 
> Do the generic hugetlbfs code provides such an API ? If not, we may need
> to add one.
> 

I think it's something like

	huge_page_size(page_hstate(page))
Nick Piggin - Feb. 4, 2009, 5:31 a.m.
On Wednesday 04 February 2009 16:13:29 Andrew Morton wrote:
> On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt 
> > Do the generic hugetlbfs code provides such an API ? If not, we may need
> > to add one.
>
> I think it's something like
>
> 	huge_page_size(page_hstate(page))

That would work if you have a page, yes. If you want to query which hugepage
sizes are available, then you probably want for_each_hstate() (which is only
within mm/hugetlb.c at the moment, but I have no objections to exporting it
and symbols it requires).
Roland Dreier - Feb. 4, 2009, 6:16 a.m.
> I think it's something like
 > 
 > 	huge_page_size(page_hstate(page))

That would suit.  I assume the intention is for that to be usable by
driver modules on any architecture?

 - R.
wli@movementarian.org - Feb. 4, 2009, 6:17 a.m.
On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt 
>>> Do the generic hugetlbfs code provides such an API ? If not, we may need
>>> to add one.

On Wednesday 04 February 2009 16:13:29 Andrew Morton wrote:
>> I think it's something like
>> 	huge_page_size(page_hstate(page))

On Wed, Feb 04, 2009 at 04:31:38PM +1100, Nick Piggin wrote:
> That would work if you have a page, yes. If you want to query which hugepage
> sizes are available, then you probably want for_each_hstate() (which is only
> within mm/hugetlb.c at the moment, but I have no objections to exporting it
> and symbols it requires).

Exciting things have happened in mm/hugetlb.c while I was out sick.
I've got quite some catching up to do.


-- wli
Andrew Morton - Feb. 4, 2009, 6:26 a.m.
On Tue, 03 Feb 2009 22:16:08 -0800 Roland Dreier <rdreier@cisco.com> wrote:

>  > I think it's something like
>  > 
>  > 	huge_page_size(page_hstate(page))
> 
> That would suit.  I assume the intention is for that to be usable by
> driver modules on any architecture?
> 

erm, you overestimate the amount of planning and forethought which goes
into these things ;)

The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint.
Roland Dreier - Feb. 4, 2009, 7:11 p.m.
> >  > 	huge_page_size(page_hstate(page))

 > > That would suit.  I assume the intention is for that to be usable by
 > > driver modules on any architecture?

 > erm, you overestimate the amount of planning and forethought which goes
 > into these things ;)

 > The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint.

Heh.  Looking into the implementation, it seems that I could actually do

	PAGE_SIZE << compound_order(page)

directly (since there's no reason to go from size to hstate and back to
size.  I don't know all the details of these VM internals, but that
seems to only work on the first (small) page of a giant page?  Which
causes problems for what we're trying to do here...

To summarize the goal, we are mapping user memory to a device that has
its own page tables, where the device's page tables can also use
multiple page sizes.  Using big pages on the device leads to similar
efficiencies as hugetlb pages do on the CPU, and in fact if a user has
used hugetlb pages for the memory they're giving to the device, that's a
very strong hint that the device should use big pages too.

But one valid situation we have to handle in the driver is if, say,
userspace has a hugetlb mapped at virtual address 0x200000, and wants to
map 0x80000 bytes at 0x280000 to the device.  In that case, we're going
to do essentially

	get_user_pages(..., 0x280000, 0x80000 / PAGE_SIZE, ...)

and get_user_pages() is going to give us a bunch of normal PAGE_SIZE
pages starting at offset 0x800000 within the compound page that makes up
the huge page mapped at 0x200000.

get_user_pages() also gives us the vma back, and we can see from
is_vm_hugetlb_page() (-- BTW can I just say that a function
is_xxx_page() that operates on vmas is horribly misnamed --) that these
pages all come from a hugetlb mapping, but figuring out the size of that
mapping is I guess a challenge.

 - R.
wli@movementarian.org - Feb. 4, 2009, 9 p.m.
On Wed, Feb 04, 2009 at 11:11:22AM -0800, Roland Dreier wrote:
> Heh.  Looking into the implementation, it seems that I could actually do
> 	PAGE_SIZE << compound_order(page)
> directly (since there's no reason to go from size to hstate and back to
> size.  I don't know all the details of these VM internals, but that
> seems to only work on the first (small) page of a giant page?  Which
> causes problems for what we're trying to do here...

You should be able to find the head of a compound page using the
compound_head() inline, so try

	PAGE_SIZE << compound_order(compound_head(page))


-- wli
Andrew Morton - Feb. 4, 2009, 9:23 p.m.
On Wed, 04 Feb 2009 11:11:22 -0800
Roland Dreier <rdreier@cisco.com> wrote:

>  > >  > 	huge_page_size(page_hstate(page))
> 
>  > > That would suit.  I assume the intention is for that to be usable by
>  > > driver modules on any architecture?
> 
>  > erm, you overestimate the amount of planning and forethought which goes
>  > into these things ;)
> 
>  > The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint.
> 
> Heh.  Looking into the implementation, it seems that I could actually do
> 
> 	PAGE_SIZE << compound_order(page)
> 
> directly (since there's no reason to go from size to hstate and back to
> size.  I don't know all the details of these VM internals, but that
> seems to only work on the first (small) page of a giant page?  Which
> causes problems for what we're trying to do here...
> 
> To summarize the goal, we are mapping user memory to a device that has
> its own page tables, where the device's page tables can also use
> multiple page sizes.  Using big pages on the device leads to similar
> efficiencies as hugetlb pages do on the CPU, and in fact if a user has
> used hugetlb pages for the memory they're giving to the device, that's a
> very strong hint that the device should use big pages too.
> 
> But one valid situation we have to handle in the driver is if, say,
> userspace has a hugetlb mapped at virtual address 0x200000, and wants to
> map 0x80000 bytes at 0x280000 to the device.  In that case, we're going
> to do essentially
> 
> 	get_user_pages(..., 0x280000, 0x80000 / PAGE_SIZE, ...)
> 
> and get_user_pages() is going to give us a bunch of normal PAGE_SIZE
> pages starting at offset 0x800000 within the compound page that makes up
> the huge page mapped at 0x200000.
> 
> get_user_pages() also gives us the vma back, and we can see from
> is_vm_hugetlb_page() (-- BTW can I just say that a function
> is_xxx_page() that operates on vmas is horribly misnamed --) that these
> pages all come from a hugetlb mapping, but figuring out the size of that
> mapping is I guess a challenge.

compound_head() will convert any page* inside a hugepage into a pointer
to the head page.  It should work OK for regular pages as well as
CONFIG_HUGETLB=n.

So..

	PAGE_SIZE << compound_order(compound_head(page))

?
Roland Dreier - Feb. 4, 2009, 9:31 p.m.
> You should be able to find the head of a compound page using the
 > compound_head() inline, so try

 > 	PAGE_SIZE << compound_order(compound_head(page))

Thanks!  Looks like that should be exactly what we need.

 - R.
Benjamin Herrenschmidt - Feb. 4, 2009, 11:55 p.m.
> get_user_pages() also gives us the vma back, and we can see from
> is_vm_hugetlb_page() (-- BTW can I just say that a function
> is_xxx_page() that operates on vmas is horribly misnamed --) that these
> pages all come from a hugetlb mapping, but figuring out the size of that
> mapping is I guess a challenge.

Note that g_u_p() has all sort of shortcommings... we were discussing
some of that recently due to bugs reported from the field.

The problem mostly is that you cannot guarantee that the physical page
will remain mapped to that virtual address in the process. For example,
if your code is part of some library used by an application, and that
application somewhere does a fork/exec (for example, a system() call to
run a shell helper), copy-on-write will hit, and you may end up with
the child process getting the original physical page and the original
process getting the copy...

So your HW will still DMA to a valid page (ie, it's count will have
been incremented) but it's not going to be the one the application
uses any more.

There are similar issues that can be cause, afaik, by madvise, etc...

We've been discussing that at KS with various people, Linus says g_u_p()
sucks, don't do that :-) Most of the time, the other approach should be
used, ie, the driver allocates memory, and userspace mmap's it, in which
case you get access to the VMA to set flags such as don't copy on fork.

An option possibly would be to make fork() pre-COW pages with an
elevated count to ensure that at least the original process is the one
to keep the original physical page... but that has other potential side
effects or performance issues.

A can of worms..

Cheers,
Ben.
Roland Dreier - Feb. 5, 2009, 5:10 a.m.
> Note that g_u_p() has all sort of shortcommings... we were discussing
 > some of that recently due to bugs reported from the field.
 > 
 > The problem mostly is that you cannot guarantee that the physical page
 > will remain mapped to that virtual address in the process. For example,
 > if your code is part of some library used by an application, and that
 > application somewhere does a fork/exec (for example, a system() call to
 > run a shell helper), copy-on-write will hit, and you may end up with
 > the child process getting the original physical page and the original
 > process getting the copy...

Believe me, I'm well aware of that.  We added VM_DONTCOPY to allow apps
to fork() without the child triggering COW, but that means only the
parent process can use the mapped memory (and the app has to worry about
page boundaries etc).

 > We've been discussing that at KS with various people, Linus says g_u_p()
 > sucks, don't do that :-) Most of the time, the other approach should be
 > used, ie, the driver allocates memory, and userspace mmap's it, in which
 > case you get access to the VMA to set flags such as don't copy on fork.

Yes, but unfortunately MPI says apps can allocate memory however they
damn well please... in any case these issues are all-too-well-known in
the RDMA world for quite a while.

Thanks,
  Roland
Benjamin Herrenschmidt - Feb. 5, 2009, 5:24 a.m.
On Wed, 2009-02-04 at 21:10 -0800, Roland Dreier wrote:
> > Note that g_u_p() has all sort of shortcommings... we were discussing
>  > some of that recently due to bugs reported from the field.
>  > 
>  > The problem mostly is that you cannot guarantee that the physical page
>  > will remain mapped to that virtual address in the process. For example,
>  > if your code is part of some library used by an application, and that
>  > application somewhere does a fork/exec (for example, a system() call to
>  > run a shell helper), copy-on-write will hit, and you may end up with
>  > the child process getting the original physical page and the original
>  > process getting the copy...
> 
> Believe me, I'm well aware of that.  We added VM_DONTCOPY to allow apps
> to fork() without the child triggering COW, but that means only the
> parent process can use the mapped memory (and the app has to worry about
> page boundaries etc).

Right, but then you need to set that in the VMA's, and thus gone is your
nice fast g_u_p() that doesn't touch VMAs :-)

> Yes, but unfortunately MPI says apps can allocate memory however they
> damn well please... in any case these issues are all-too-well-known in
> the RDMA world for quite a while.

Yup. What do you think of the idea of pre-COWing pages with an elevated
count at fork time ?

Cheers,
Ben.
Roland Dreier - Feb. 5, 2009, 5:33 a.m.
> Right, but then you need to set that in the VMA's, and thus gone is your
 > nice fast g_u_p() that doesn't touch VMAs :-)

Registering memory is a slow path thing in the RDMA world.  Speeding it
up is nice, so we make userspace do the madvise(VM_DONTCOPY) if it cares
but if it doesn't it can leave it out.

 > > Yes, but unfortunately MPI says apps can allocate memory however they
 > > damn well please... in any case these issues are all-too-well-known in
 > > the RDMA world for quite a while.

 > Yup. What do you think of the idea of pre-COWing pages with an elevated
 > count at fork time ?

Super-duper sucks if the first thing the child does is exec() :)
Also if the parent has registered > half the memory in the system then
it's instant OOM.  So not that useful for the RDMA case :)

The one thing that might make sense is to pre-COW any partial pages that
the parent has registered -- ie if half a page can be used by the child,
at least pre-COW that, but leave all the full pages with VM_DONTCOPY.

 - R.

Patch

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8d5b475..6cff8c7 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -104,6 +104,7 @@  int mmu_highuser_ssize = MMU_SEGSIZE_256M;
 u16 mmu_slb_size = 64;
 #ifdef CONFIG_HUGETLB_PAGE
 unsigned int HPAGE_SHIFT;
+EXPORT_SYMBOL(HPAGE_SHIFT);
 #endif
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;