diff mbox

[10/17] mm: rmap preparation for remap_anon_pages

Message ID 1412356087-16115-11-git-send-email-aarcange@redhat.com
State New
Headers show

Commit Message

Andrea Arcangeli Oct. 3, 2014, 5:08 p.m. UTC
remap_anon_pages (unlike remap_file_pages) tries to be non intrusive
in the rmap code.

As far as the rmap code is concerned, rmap_anon_pages only alters the
page->mapping and page->index. It does it while holding the page
lock. However there are a few places that in presence of anon pages
are allowed to do rmap walks without the page lock (split_huge_page
and page_referenced_anon). Those places that are doing rmap walks
without taking the page lock first, must be updated to re-check that
the page->mapping didn't change after they obtained the anon_vma
lock. remap_anon_pages takes the anon_vma lock for writing before
altering the page->mapping, so if the page->mapping is still the same
after obtaining the anon_vma lock (without the page lock), the rmap
walks can go ahead safely (and remap_anon_pages will wait them to
complete before proceeding).

remap_anon_pages serializes against itself with the page lock.

All other places taking the anon_vma lock while holding the mmap_sem
for writing, don't need to check if the page->mapping has changed
after taking the anon_vma lock, regardless of the page lock, because
remap_anon_pages holds the mmap_sem for reading.

Overall this looks a fairly small change to the rmap code, notably
less intrusive than the nonlinear vmas created by remap_file_pages.

There's one constraint enforced to allow this simplification: the
source pages passed to remap_anon_pages must be mapped only in one
vma, but this is not a limitation when used to handle userland page
faults with MADV_USERFAULT. The source addresses passed to
remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
avoid any risk of the mapcount of the pages increasing, if fork runs
in parallel in another thread, before or while remap_anon_pages runs.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/huge_memory.c | 24 ++++++++++++++++++++----
 mm/rmap.c        |  9 +++++++++
 2 files changed, 29 insertions(+), 4 deletions(-)

Comments

Linus Torvalds Oct. 3, 2014, 6:31 p.m. UTC | #1
On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Overall this looks a fairly small change to the rmap code, notably
> less intrusive than the nonlinear vmas created by remap_file_pages.

Considering that remap_file_pages() was an unmitigated disaster, and
-mm has a patch to remove it entirely, I'm not at all convinced this
is a good argument.

We thought remap_file_pages() was a good idea, and it really really
really wasn't. Almost nobody used it, why would the anonymous page
case be any different?

            Linus
Dr. David Alan Gilbert Oct. 6, 2014, 8:55 a.m. UTC | #2
* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Overall this looks a fairly small change to the rmap code, notably
> > less intrusive than the nonlinear vmas created by remap_file_pages.
> 
> Considering that remap_file_pages() was an unmitigated disaster, and
> -mm has a patch to remove it entirely, I'm not at all convinced this
> is a good argument.
> 
> We thought remap_file_pages() was a good idea, and it really really
> really wasn't. Almost nobody used it, why would the anonymous page
> case be any different?

I've posted code that uses this interface to qemu-devel and it works nicely;
so chalk up at least one user.

For the postcopy case I'm using it for, we need to place a page, atomically
  some thread might try and access it, and must either
     1) get caught by userfault etc or
     2) must succeed in it's access

and we'll have that happening somewhere between thousands and millions of times
to pages in no particular order, so we need to avoid creating millions of mappings.

Dave



> 
>             Linus
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Andrea Arcangeli Oct. 6, 2014, 4:41 p.m. UTC | #3
Hello,

On Mon, Oct 06, 2014 at 09:55:41AM +0100, Dr. David Alan Gilbert wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > >
> > > Overall this looks a fairly small change to the rmap code, notably
> > > less intrusive than the nonlinear vmas created by remap_file_pages.
> > 
> > Considering that remap_file_pages() was an unmitigated disaster, and
> > -mm has a patch to remove it entirely, I'm not at all convinced this
> > is a good argument.
> > 
> > We thought remap_file_pages() was a good idea, and it really really
> > really wasn't. Almost nobody used it, why would the anonymous page
> > case be any different?
> 
> I've posted code that uses this interface to qemu-devel and it works nicely;
> so chalk up at least one user.
> 
> For the postcopy case I'm using it for, we need to place a page, atomically
>   some thread might try and access it, and must either
>      1) get caught by userfault etc or
>      2) must succeed in it's access
> 
> and we'll have that happening somewhere between thousands and millions of times
> to pages in no particular order, so we need to avoid creating millions of mappings.

Yes, that's our current use case.

Of course if somebody has better ideas on how to resolve an anonymous
userfault they're welcome.

How to resolve an userfault is orthogonal on how to detect it and to
notify userland about it and to be notified when the userfault has
been resolved. The latter is what the userfault and userfaultfd
do. The former is what remap_anon_pages is used for but we could use
something else too if there are better ways. mremap would clearly work
too, but it would be less strict (it could lead to silent data
corruption if there are bugs in the userland code), it would be slower
and it would eventually a hit a -ENOMEM failure because there would be
too many vmas.

I could in theory drop remap_anon_pages from this patchset, but
without an optimal way to resolve an userfault, the rest isn't so
useful.

We're currently discussing on what would be the best way to resolve a
MAP_SHARED userfault on tmpfs in fact (that's not sorted yet), but so
far, it seems remap_anon_pages fits the bill for anonymous memory.

remap_anon_pages is not as problematic to maintain as remap_file_pages
for the reason explained in the commit header, but there are other
reasons: it doesn't require special pte_file and it changes nothing of
how anonymous page faults works. All it requires is a loop to catch a
changed page->index (previously page->index couldn't change, not it
can, that's the only thing it changes).

remap_file_pages complexity derives from not being allowed to change
page->index during a move because the page_mapping may be bigger than
1, while that is precisely what remap_anon_pages does.

As long as this "rmap preparation" is the only constraints that
remap_anon_pages introduces in terms of rmap, it looks a nice
not-too-intrusive solution to resolve anonymous userfaults
efficiently.

Introducing remap_anon_pages in fact doesn't reduce the
simplification derived from the removal of remap_file_pages.

As opposed removing remap_anon_pages later would only have the benefit
of removing this very patch 10/17 and no other benefit.

In short remap_anon_pages does this (heavily simplified):

   pte = *src_pte;
   *src_pte = 0;
   pte_page(pte)->index = adjusted according to src_vma/dst_vma->vm_pgoff
   *dst_pte = pte;

It guarantees not to modify the vmas and in turn it doesn't require to
take the mmap_sem for writing.

To use remap_anon_pages, each thread has to create its own temporary
vma with MADV_DONTFORK set on it (not formally required by the syscall
strict checks, but then the application must never fork if
MADV_DONTFORK isn't set or remap_anon_pages could return -EBUSY:
there's no risk of silent data corruption even if the thread forks
without setting MADV_DONTFORK) as source region where receive data
through the network. Then after the data is fully received
rmap_anon_pages moves the page from the temporary vma to the address
where the userfault triggered atomically (while other threads may be
attempting to access the userfault address too, thanks to
remap_anon_pages atomic behavior they won't risk to ever see partial
data coming from the network).

remap_anon_pages as side effect creates an hole in the temporary
(source) vma, so the next recv() syscall receiving data from the
network will fault-in a new anonymous page without requiring any
further malloc/free or other kind of vma mangling.

Thanks,
Andrea
Kirill A. Shutemov Oct. 7, 2014, 11:10 a.m. UTC | #4
On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> There's one constraint enforced to allow this simplification: the
> source pages passed to remap_anon_pages must be mapped only in one
> vma, but this is not a limitation when used to handle userland page
> faults with MADV_USERFAULT. The source addresses passed to
> remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> avoid any risk of the mapcount of the pages increasing, if fork runs
> in parallel in another thread, before or while remap_anon_pages runs.

Have you considered triggering COW instead of adding limitation on
pages' mapcount? The limitation looks artificial from interface POV.
Linus Torvalds Oct. 7, 2014, 12:47 p.m. UTC | #5
On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> Of course if somebody has better ideas on how to resolve an anonymous
> userfault they're welcome.

So I'd *much* rather have a "write()" style interface (ie _copying_
bytes from user space into a newly allocated page that gets mapped)
than a "remap page" style interface

remapping anonymous pages involves page table games that really aren't
necessarily a good idea, and tlb invalidates for the old page etc.
Just don't do it.

           Linus
Andrea Arcangeli Oct. 7, 2014, 1:37 p.m. UTC | #6
Hi Kirill,

On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote:
> On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote:
> > There's one constraint enforced to allow this simplification: the
> > source pages passed to remap_anon_pages must be mapped only in one
> > vma, but this is not a limitation when used to handle userland page
> > faults with MADV_USERFAULT. The source addresses passed to
> > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to
> > avoid any risk of the mapcount of the pages increasing, if fork runs
> > in parallel in another thread, before or while remap_anon_pages runs.
> 
> Have you considered triggering COW instead of adding limitation on
> pages' mapcount? The limitation looks artificial from interface POV.

I haven't considered it, mostly because I see it as a feature that it
returns -EBUSY. I prefer to avoid the risk of userland getting a
successful retval but internally the kernel silently behaving
non-zerocopy by mistake because some userland bug forgot to set
MADV_DONTFORK on the src_vma.

COW would be not zerocopy so it's not ok. We get sub 1msec latency for
userfaults through 10gbit and we don't want to risk wasting CPU
caches.

I however considered allowing to extend the strict behavior (i.e. the
feature) later in a backwards compatible way. We could provide a
non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn
the -EBUSY error into a copy.

It's also more complex to implement the cow now, so it would make the
code that really matters, harder to review. So it may be preferable to
extend this later in a backwards compatible way with a new
RAP_ALLOW_COW flag.

The current handling the flags is already written in a way that should
allow backwards compatible extension with RAP_ALLOW_*:

#define RAP_ALLOW_SRC_HOLES (1UL<<0)

SYSCALL_DEFINE4(remap_anon_pages,
		unsigned long, dst_start, unsigned long, src_start,
		unsigned long, len, unsigned long, flags)
[..]
	long err = -EINVAL;
[..]
	if (flags & ~RAP_ALLOW_SRC_HOLES)
		return err;
Andrea Arcangeli Oct. 7, 2014, 2:19 p.m. UTC | #7
Hello,

On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface
> 
> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.

I see what you mean. The only cons I see is that we couldn't use then
recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

Ideally if we could prevent the page data coming from the network to
ever become visible in the kernel we could avoid the TLB flush and
also be zerocopy but I can't see how we could achieve that.

The page data could come through a ssh pipe or anything (qemu supports
all kind of network transports for live migration), this is why
leaving the network protocol into userland is preferable.

As things stands now, I'm afraid with a write() syscall we couldn't do
it zerocopy. We'd still need to receive the memory in a temporary page
and then copy it to a kernel page (invisible to userland while we
write to it) to later map into the userfault address.

If it wasn't for the TLB flush of the old page, the remap_anon_pages
variant would be more optimal than doing a copy through a write
syscall. Is the copy cheaper than a TLB flush? I probably naively
assumed the TLB flush was always cheaper.

Now another idea that comes to mind to be able to add the ability to
switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that
would then do a copy inside remap_anon_pages and leave the original
page mapped in place... (and such flag would also disable the -EBUSY
error if page_mapcount is > 1).

So then if the RAP_FORCE_COPY flag is set remap_anon_pages would
behave like you suggested (but with a mremap-like interface, instead
of a write syscall) and we could benchmark the difference between copy
and TLB flush too. We could even periodically benchmark it at runtime
and switch over the faster method (the more CPUs there are in the host
and the more threads the process has, the faster the copy will be
compared to the TLB flush).

Of course in terms of API I could implement the exact same mechanism
as described above for remap_anon_pages inside a write() to the
userfaultfd (it's a pseudo inode). It'd need two different commands to
prepare for the coming write (with a len multiple of PAGE_SIZE) to
know the address where the page should be mapped into and if to behave
zerocopy or if to skip the TLB flush and copy.

Because the copy vs TLB flush trade off is possible to achieve with
both interfaces, I think it really boils down to choosing between a
mremap like interface, or file+commands protocol interface. I tend to
like mremap more, that's why I opted for a remap_anon_pages syscall
kept orthogonal to the userfaultfd functionality (remap_anon_pages
could be also used standalone as an accelerated mremap in some
circumstances) but nothing prevents to just embed the same mechanism
inside userfaultfd if a file+commands API is preferable. Or we could
add a different syscall (separated from userfaultfd) that creates
another pseudofd to write a command plus the page data into it. Just I
wouldn't see the point of creating a pseudofd just to copy a page
atomically, the write() syscall would look more ideal if the
userfaultfd is already open for other reasons and the pseudofd
overhead is required anyway.

Last thing to keep in mind is that if using userfaults with SIGBUS and
without userfaultfd, remap_anon_pages would have been still useful, so
if we retain the SIGBUS behavior for volatile pages and we don't force
the usage for userfaultfd, it may be cleaner not to use userfaultfd
but a separate pseudofd to do the write() syscall though. Otherwise
the app would need to open the userfaultfd to resolve the fault even
though it's not using the userfaultfd protocol which doesn't look an
intuitive interface to me.

Comments welcome.

Thanks,
Andrea
Andrea Arcangeli Oct. 7, 2014, 3:52 p.m. UTC | #8
On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
> mremap like interface, or file+commands protocol interface. I tend to
> like mremap more, that's why I opted for a remap_anon_pages syscall
> kept orthogonal to the userfaultfd functionality (remap_anon_pages
> could be also used standalone as an accelerated mremap in some
> circumstances) but nothing prevents to just embed the same mechanism

Sorry for the self followup, but something else comes to mind to
elaborate this further.

In term of interfaces, the most efficient I could think of to minimize
the enter/exit kernel, would be to append the "source address" of the
data received from the network transport, to the userfaultfd_write()
command (by appending 8 bytes to the wakeup command). Said that,
mixing the mechanism to be notified about userfaults with the
mechanism to resolve an userfault to me looks a complication. I kind
of liked to keep the userfaultfd protocol is very simple and doing
just its thing. The userfaultfd doesn't need to know how the userfault
was resolved, even mremap would work theoretically (until we run out
of vmas). I thought it was simpler to keep it that way. However if we
want to resolve the fault with a "write()" syscall this may be the
most efficient way to do it, as we're already doing a write() into the
pseudofd to wakeup the page fault that contains the destination
address, I just need to append the source address to the wakeup command.

I probably grossly overestimated the benefits of resolving the
userfault with a zerocopy page move, sorry. So if we entirely drop the
zerocopy behavior and the TLB flush of the old page like you
suggested, the way to keep the userfaultfd mechanism decoupled from
the userfault resolution mechanism would be to implement an
atomic-copy syscall. That would work for SIGBUS userfaults too without
requiring a pseudofd then. It would be enough then to call
mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
wouldn't page fault or call GUP into the destination address (it can't
otherwise the in-flight partial copy would be visible to the process,
breaking the atomicity of the copy), but it would fill in the
pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
currently has (in turn it would by design bypass the VM_USERFAULT
check and be ideal for resolving userfaults).

mcopy_atomic could then be also extended to tmpfs and it would work
without requiring the source page to be a tmpfs page too without
having to convert page types on the fly.

If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
course so it'd be even less intrusive than the current
remap_anon_pages and it would require zero TLB flush during its
runtime (it would just require an atomic copy).

So should I try to embed a mcopy_atomic inside userfault_write or can
I expose it to userland as a standalone new syscall? Or should I do
something different? Comments?

Thanks,
Andrea
Andy Lutomirski Oct. 7, 2014, 3:54 p.m. UTC | #9
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote:
>> mremap like interface, or file+commands protocol interface. I tend to
>> like mremap more, that's why I opted for a remap_anon_pages syscall
>> kept orthogonal to the userfaultfd functionality (remap_anon_pages
>> could be also used standalone as an accelerated mremap in some
>> circumstances) but nothing prevents to just embed the same mechanism
>
> Sorry for the self followup, but something else comes to mind to
> elaborate this further.
>
> In term of interfaces, the most efficient I could think of to minimize
> the enter/exit kernel, would be to append the "source address" of the
> data received from the network transport, to the userfaultfd_write()
> command (by appending 8 bytes to the wakeup command). Said that,
> mixing the mechanism to be notified about userfaults with the
> mechanism to resolve an userfault to me looks a complication. I kind
> of liked to keep the userfaultfd protocol is very simple and doing
> just its thing. The userfaultfd doesn't need to know how the userfault
> was resolved, even mremap would work theoretically (until we run out
> of vmas). I thought it was simpler to keep it that way. However if we
> want to resolve the fault with a "write()" syscall this may be the
> most efficient way to do it, as we're already doing a write() into the
> pseudofd to wakeup the page fault that contains the destination
> address, I just need to append the source address to the wakeup command.
>
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).

At the risk of asking a possibly useless question, would it make sense
to splice data into a userfaultfd?

--Andy

>
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
>
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).
>
> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?
>
> Thanks,
> Andrea
Peter Feiner Oct. 7, 2014, 4:13 p.m. UTC | #10
On Tue, Oct 07, 2014 at 05:52:47PM +0200, Andrea Arcangeli wrote:
> I probably grossly overestimated the benefits of resolving the
> userfault with a zerocopy page move, sorry. [...]

For posterity, I think it's worth noting that most expensive aspect of a TLB
shootdown is the interprocessor interrupt necessary to flush other CPUs' TLBs.
On a many-core machine, copying 4K of data looks pretty cheap compared to
taking an interrupt and invalidating TLBs on many cores :-)

> [...] So if we entirely drop the
> zerocopy behavior and the TLB flush of the old page like you
> suggested, the way to keep the userfaultfd mechanism decoupled from
> the userfault resolution mechanism would be to implement an
> atomic-copy syscall. That would work for SIGBUS userfaults too without
> requiring a pseudofd then. It would be enough then to call
> mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints
> that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic
> wouldn't page fault or call GUP into the destination address (it can't
> otherwise the in-flight partial copy would be visible to the process,
> breaking the atomicity of the copy), but it would fill in the
> pte/trans_huge_pmd with the same strict behavior that remap_anon_pages
> currently has (in turn it would by design bypass the VM_USERFAULT
> check and be ideal for resolving userfaults).
> 
> mcopy_atomic could then be also extended to tmpfs and it would work
> without requiring the source page to be a tmpfs page too without
> having to convert page types on the fly.
> 
> If I add mcopy_atomic, the patch in subject (10/17) can be dropped of
> course so it'd be even less intrusive than the current
> remap_anon_pages and it would require zero TLB flush during its
> runtime (it would just require an atomic copy).

I like this new approach. It will be good to have a single interface for
resolving anon and tmpfs userfaults.

> So should I try to embed a mcopy_atomic inside userfault_write or can
> I expose it to userland as a standalone new syscall? Or should I do
> something different? Comments?

One interesting (ab)use of userfault_write would be that the faulting process
and the fault-handling process could be different, which would be necessary
for post-copy live migration in CRIU (http://criu.org).

Aside from the asthetic difference, I can't think of any advantage in favor of
a syscall.

Peter
Linus Torvalds Oct. 7, 2014, 4:56 p.m. UTC | #11
On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I see what you mean. The only cons I see is that we couldn't use then
> recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr,
> PAGE_SIZE, ..)  and retain the zerocopy behavior. Or how could we?
> There's no recvfile(userfaultfd, socketfd, PAGE_SIZE).

You're doing completelt faulty math, and you haven't thought it through.

Your "zero-copy" case is no such thing. Who cares if some packet
receive is zero-copy, when you need to set up the anonymous page to
*receive* the zero copy into, which involves page allocation, page
zeroing, page table setup with VM and page table locking, etc etc.

The thing is, the whole concept of "zero-copy" is pure and utter
bullshit. Sun made a big deal about the whole concept back in the
nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's
crap. It's made-up and not real.

Then, once you've allocated and cleared the page, mapped it in, your
"zero-copy" model involves looking up the page in the page tables
again (locking etc), then doing that zero-copy to the page. Then, when
you remap it, you look it up in the page tables AGAIN, with locking,
move it around, have to clear the old page table entry (which involves
a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI -
since the people who do this are all threaded and want many CPU's, and
then you insert the page into the new place.

That's *insane*. It's crap. All just to try to avoid one page copy.

Don't do it. remapping games really are complete BS. They never beat
just copying the data. It's that simple.

> As things stands now, I'm afraid with a write() syscall we couldn't do
> it zerocopy.

Really, you need to rethink your whole "zerocopy" model. It's broken.
Nobody sane cares. You've bought into a model that Sun already showed
doesn't work.

The only time zero-copy works is in random benchmarks that are very
careful to not touch the data at all at any point, and also try to
make sure that the benchmark is very much single-threaded so that you
never have the whole cross-CPU IPI issue for the TLB invalidate. Then,
and only then, can zero-copy win. And it's just not a realistic
scenario.

> If it wasn't for the TLB flush of the old page, the remap_anon_pages
> variant would be more optimal than doing a copy through a write
> syscall. Is the copy cheaper than a TLB flush? I probably naively
> assumed the TLB flush was always cheaper.

A local TLB flush is cheap. That's not the problem. The problem is the
setup of the page, and the clearing of the page, and the cross-CPU TLB
flush. And the page table locking, etc etc.

So no, I'm not AT ALL worried about a single "invlpg" instruction.
That's nothing. Local CPU TLB flushes of single pages are basically
free. But that really isn't where the costs are.

Quite frankly, the *only* time page remapping has ever made sense is
when it is used for realloc() kind of purposes, where you need to
remap pages not because of zero-copy, but because you need to change
the virtual address space layout. And then you make sure it's not a
common operation, because you're not doing it as a performance win,
you're doing it because you're changing your virtual layout.

Really. Zerocopy is for benchmarks, and for things like "splice()"
when you can avoid the page tables *entirely*. But the notion that
page remapping of user pages is faster than a copy is pure and utter
garbage. It's simply not true.

So I really think you should aim for a "write()": kind of interface.

With write, you may not get the zero-copy, but on the other hand it
allows you to re-use the source buffer many times without having to
allocate new pages and map it in etc. So a "read()+write()" loop (or,
quite commonly a "generate data computationally from a compressed
source + write()" loop) is actually much more efficient than the
zero-copy remapping, because you don't have all the complexities and
overheads in creating the source page.

It is possible that that could involve "splice()" too, although I
don't really think the source data tends to be in page-aligned chunks.
But hey, splice() at least *can* be faster than copying (and then we
have vmsplice() not because it's magically faster, but because it can
under certain circumstances be worth it, and it kind of made sense to
allow the interface, but I really don't think it's used very much or
very useful).

               Linus
Dr. David Alan Gilbert Oct. 7, 2014, 5:07 p.m. UTC | #12
* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > Of course if somebody has better ideas on how to resolve an anonymous
> > userfault they're welcome.
> 
> So I'd *much* rather have a "write()" style interface (ie _copying_
> bytes from user space into a newly allocated page that gets mapped)
> than a "remap page" style interface

Something like that might work for the postcopy case; it doesn't work
for some of the other uses that need to stop a page being changed by the
guest, but then need to somehow get a copy of that page internally to QEMU,
and perhaps provide it back later.  remap_anon_pages worked for those cases
as well; I can't think of another current way of doing it in userspace.

I'm thinking here of systems for making VMs with memory larger than a single
host; that's something that's not as well thought out.  I've also seen people
writing emulation that want to trap and emulate some page accesses while
still having the original data available to the emulator itself.

So yes, OK for now, but the result is less general.

Dave


> remapping anonymous pages involves page table games that really aren't
> necessarily a good idea, and tlb invalidates for the old page etc.
> Just don't do it.
> 
>            Linus
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini Oct. 7, 2014, 5:14 p.m. UTC | #13
Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
>> > 
>> > So I'd *much* rather have a "write()" style interface (ie _copying_
>> > bytes from user space into a newly allocated page that gets mapped)
>> > than a "remap page" style interface
> Something like that might work for the postcopy case; it doesn't work
> for some of the other uses that need to stop a page being changed by the
> guest, but then need to somehow get a copy of that page internally to QEMU,
> and perhaps provide it back later.

I cannot parse this.  Which uses do you have in mind?  Is it for
QEMU-specific or is it for other applications of userfaults?

As long as the page is atomically mapped, I'm not sure what the
difference from remap_anon_pages are (as far as the destination page is
concerned).  Are you thinking of having userfaults enabled on the source
as well?

Paolo

> remap_anon_pages worked for those cases
> as well; I can't think of another current way of doing it in userspace.
Dr. David Alan Gilbert Oct. 7, 2014, 5:25 p.m. UTC | #14
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto:
> >> > 
> >> > So I'd *much* rather have a "write()" style interface (ie _copying_
> >> > bytes from user space into a newly allocated page that gets mapped)
> >> > than a "remap page" style interface
> > Something like that might work for the postcopy case; it doesn't work
> > for some of the other uses that need to stop a page being changed by the
> > guest, but then need to somehow get a copy of that page internally to QEMU,
> > and perhaps provide it back later.
> 
> I cannot parse this.  Which uses do you have in mind?  Is it for
> QEMU-specific or is it for other applications of userfaults?

> As long as the page is atomically mapped, I'm not sure what the
> difference from remap_anon_pages are (as far as the destination page is
> concerned).  Are you thinking of having userfaults enabled on the source
> as well?

What I'm talking about here is when I want to stop a page being accessed by the
guest, do something with the data in qemu, and give it back to the guest sometime
later.

The main example is: Memory pools for guests where you swap RAM between a series of
VM hosts.  You have to take the page out, send it over the wire, sometime later if
the guest tries to access it you userfault and pull it back.
(There's at least one or two companies selling something like this, and at least
one Linux based implementations with their own much more involved kernel hacks)

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b402d60..4277ed7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1921,6 +1921,7 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 {
 	struct anon_vma *anon_vma;
 	int ret = 1;
+	struct address_space *mapping;
 
 	BUG_ON(is_huge_zero_page(page));
 	BUG_ON(!PageAnon(page));
@@ -1932,10 +1933,24 @@  int split_huge_page_to_list(struct page *page, struct list_head *list)
 	 * page_lock_anon_vma_read except the write lock is taken to serialise
 	 * against parallel split or collapse operations.
 	 */
-	anon_vma = page_get_anon_vma(page);
-	if (!anon_vma)
-		goto out;
-	anon_vma_lock_write(anon_vma);
+	for (;;) {
+		mapping = ACCESS_ONCE(page->mapping);
+		anon_vma = page_get_anon_vma(page);
+		if (!anon_vma)
+			goto out;
+		anon_vma_lock_write(anon_vma);
+		/*
+		 * We don't hold the page lock here so
+		 * remap_anon_pages_huge_pmd can change the anon_vma
+		 * from under us until we obtain the anon_vma
+		 * lock. Verify that we obtained the anon_vma lock
+		 * before remap_anon_pages did.
+		 */
+		if (likely(mapping == ACCESS_ONCE(page->mapping)))
+			break;
+		anon_vma_unlock_write(anon_vma);
+		put_anon_vma(anon_vma);
+	}
 
 	ret = 0;
 	if (!PageCompound(page))
@@ -2460,6 +2475,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later hanlded by the ptep_clear_flush and the VM
 	 * handled by the anon_vma lock + PG_lock.
+	 * remap_anon_pages is prevented to race as well by the mmap_sem.
 	 */
 	down_write(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
diff --git a/mm/rmap.c b/mm/rmap.c
index 3e8491c..6d875eb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -450,6 +450,7 @@  struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	struct anon_vma *root_anon_vma;
 	unsigned long anon_mapping;
 
+repeat:
 	rcu_read_lock();
 	anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
@@ -488,6 +489,14 @@  struct anon_vma *page_lock_anon_vma_read(struct page *page)
 	rcu_read_unlock();
 	anon_vma_lock_read(anon_vma);
 
+	/* check if remap_anon_pages changed the anon_vma */
+	if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) {
+		anon_vma_unlock_read(anon_vma);
+		put_anon_vma(anon_vma);
+		anon_vma = NULL;
+		goto repeat;
+	}
+
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock