Message ID | 1412356087-16115-11-git-send-email-aarcange@redhat.com |
---|---|
State | New |
Headers | show |
On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Overall this looks a fairly small change to the rmap code, notably > less intrusive than the nonlinear vmas created by remap_file_pages. Considering that remap_file_pages() was an unmitigated disaster, and -mm has a patch to remove it entirely, I'm not at all convinced this is a good argument. We thought remap_file_pages() was a good idea, and it really really really wasn't. Almost nobody used it, why would the anonymous page case be any different? Linus
* Linus Torvalds (torvalds@linux-foundation.org) wrote: > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Overall this looks a fairly small change to the rmap code, notably > > less intrusive than the nonlinear vmas created by remap_file_pages. > > Considering that remap_file_pages() was an unmitigated disaster, and > -mm has a patch to remove it entirely, I'm not at all convinced this > is a good argument. > > We thought remap_file_pages() was a good idea, and it really really > really wasn't. Almost nobody used it, why would the anonymous page > case be any different? I've posted code that uses this interface to qemu-devel and it works nicely; so chalk up at least one user. For the postcopy case I'm using it for, we need to place a page, atomically some thread might try and access it, and must either 1) get caught by userfault etc or 2) must succeed in it's access and we'll have that happening somewhere between thousands and millions of times to pages in no particular order, so we need to avoid creating millions of mappings. Dave > > Linus -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Hello, On Mon, Oct 06, 2014 at 09:55:41AM +0100, Dr. David Alan Gilbert wrote: > * Linus Torvalds (torvalds@linux-foundation.org) wrote: > > On Fri, Oct 3, 2014 at 10:08 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > > > Overall this looks a fairly small change to the rmap code, notably > > > less intrusive than the nonlinear vmas created by remap_file_pages. > > > > Considering that remap_file_pages() was an unmitigated disaster, and > > -mm has a patch to remove it entirely, I'm not at all convinced this > > is a good argument. > > > > We thought remap_file_pages() was a good idea, and it really really > > really wasn't. Almost nobody used it, why would the anonymous page > > case be any different? > > I've posted code that uses this interface to qemu-devel and it works nicely; > so chalk up at least one user. > > For the postcopy case I'm using it for, we need to place a page, atomically > some thread might try and access it, and must either > 1) get caught by userfault etc or > 2) must succeed in it's access > > and we'll have that happening somewhere between thousands and millions of times > to pages in no particular order, so we need to avoid creating millions of mappings. Yes, that's our current use case. Of course if somebody has better ideas on how to resolve an anonymous userfault they're welcome. How to resolve an userfault is orthogonal on how to detect it and to notify userland about it and to be notified when the userfault has been resolved. The latter is what the userfault and userfaultfd do. The former is what remap_anon_pages is used for but we could use something else too if there are better ways. mremap would clearly work too, but it would be less strict (it could lead to silent data corruption if there are bugs in the userland code), it would be slower and it would eventually a hit a -ENOMEM failure because there would be too many vmas. I could in theory drop remap_anon_pages from this patchset, but without an optimal way to resolve an userfault, the rest isn't so useful. We're currently discussing on what would be the best way to resolve a MAP_SHARED userfault on tmpfs in fact (that's not sorted yet), but so far, it seems remap_anon_pages fits the bill for anonymous memory. remap_anon_pages is not as problematic to maintain as remap_file_pages for the reason explained in the commit header, but there are other reasons: it doesn't require special pte_file and it changes nothing of how anonymous page faults works. All it requires is a loop to catch a changed page->index (previously page->index couldn't change, not it can, that's the only thing it changes). remap_file_pages complexity derives from not being allowed to change page->index during a move because the page_mapping may be bigger than 1, while that is precisely what remap_anon_pages does. As long as this "rmap preparation" is the only constraints that remap_anon_pages introduces in terms of rmap, it looks a nice not-too-intrusive solution to resolve anonymous userfaults efficiently. Introducing remap_anon_pages in fact doesn't reduce the simplification derived from the removal of remap_file_pages. As opposed removing remap_anon_pages later would only have the benefit of removing this very patch 10/17 and no other benefit. In short remap_anon_pages does this (heavily simplified): pte = *src_pte; *src_pte = 0; pte_page(pte)->index = adjusted according to src_vma/dst_vma->vm_pgoff *dst_pte = pte; It guarantees not to modify the vmas and in turn it doesn't require to take the mmap_sem for writing. To use remap_anon_pages, each thread has to create its own temporary vma with MADV_DONTFORK set on it (not formally required by the syscall strict checks, but then the application must never fork if MADV_DONTFORK isn't set or remap_anon_pages could return -EBUSY: there's no risk of silent data corruption even if the thread forks without setting MADV_DONTFORK) as source region where receive data through the network. Then after the data is fully received rmap_anon_pages moves the page from the temporary vma to the address where the userfault triggered atomically (while other threads may be attempting to access the userfault address too, thanks to remap_anon_pages atomic behavior they won't risk to ever see partial data coming from the network). remap_anon_pages as side effect creates an hole in the temporary (source) vma, so the next recv() syscall receiving data from the network will fault-in a new anonymous page without requiring any further malloc/free or other kind of vma mangling. Thanks, Andrea
On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote: > There's one constraint enforced to allow this simplification: the > source pages passed to remap_anon_pages must be mapped only in one > vma, but this is not a limitation when used to handle userland page > faults with MADV_USERFAULT. The source addresses passed to > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to > avoid any risk of the mapcount of the pages increasing, if fork runs > in parallel in another thread, before or while remap_anon_pages runs. Have you considered triggering COW instead of adding limitation on pages' mapcount? The limitation looks artificial from interface POV.
On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > Of course if somebody has better ideas on how to resolve an anonymous > userfault they're welcome. So I'd *much* rather have a "write()" style interface (ie _copying_ bytes from user space into a newly allocated page that gets mapped) than a "remap page" style interface remapping anonymous pages involves page table games that really aren't necessarily a good idea, and tlb invalidates for the old page etc. Just don't do it. Linus
Hi Kirill, On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote: > On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote: > > There's one constraint enforced to allow this simplification: the > > source pages passed to remap_anon_pages must be mapped only in one > > vma, but this is not a limitation when used to handle userland page > > faults with MADV_USERFAULT. The source addresses passed to > > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to > > avoid any risk of the mapcount of the pages increasing, if fork runs > > in parallel in another thread, before or while remap_anon_pages runs. > > Have you considered triggering COW instead of adding limitation on > pages' mapcount? The limitation looks artificial from interface POV. I haven't considered it, mostly because I see it as a feature that it returns -EBUSY. I prefer to avoid the risk of userland getting a successful retval but internally the kernel silently behaving non-zerocopy by mistake because some userland bug forgot to set MADV_DONTFORK on the src_vma. COW would be not zerocopy so it's not ok. We get sub 1msec latency for userfaults through 10gbit and we don't want to risk wasting CPU caches. I however considered allowing to extend the strict behavior (i.e. the feature) later in a backwards compatible way. We could provide a non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn the -EBUSY error into a copy. It's also more complex to implement the cow now, so it would make the code that really matters, harder to review. So it may be preferable to extend this later in a backwards compatible way with a new RAP_ALLOW_COW flag. The current handling the flags is already written in a way that should allow backwards compatible extension with RAP_ALLOW_*: #define RAP_ALLOW_SRC_HOLES (1UL<<0) SYSCALL_DEFINE4(remap_anon_pages, unsigned long, dst_start, unsigned long, src_start, unsigned long, len, unsigned long, flags) [..] long err = -EINVAL; [..] if (flags & ~RAP_ALLOW_SRC_HOLES) return err;
Hello, On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote: > On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Of course if somebody has better ideas on how to resolve an anonymous > > userfault they're welcome. > > So I'd *much* rather have a "write()" style interface (ie _copying_ > bytes from user space into a newly allocated page that gets mapped) > than a "remap page" style interface > > remapping anonymous pages involves page table games that really aren't > necessarily a good idea, and tlb invalidates for the old page etc. > Just don't do it. I see what you mean. The only cons I see is that we couldn't use then recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr, PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we? There's no recvfile(userfaultfd, socketfd, PAGE_SIZE). Ideally if we could prevent the page data coming from the network to ever become visible in the kernel we could avoid the TLB flush and also be zerocopy but I can't see how we could achieve that. The page data could come through a ssh pipe or anything (qemu supports all kind of network transports for live migration), this is why leaving the network protocol into userland is preferable. As things stands now, I'm afraid with a write() syscall we couldn't do it zerocopy. We'd still need to receive the memory in a temporary page and then copy it to a kernel page (invisible to userland while we write to it) to later map into the userfault address. If it wasn't for the TLB flush of the old page, the remap_anon_pages variant would be more optimal than doing a copy through a write syscall. Is the copy cheaper than a TLB flush? I probably naively assumed the TLB flush was always cheaper. Now another idea that comes to mind to be able to add the ability to switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that would then do a copy inside remap_anon_pages and leave the original page mapped in place... (and such flag would also disable the -EBUSY error if page_mapcount is > 1). So then if the RAP_FORCE_COPY flag is set remap_anon_pages would behave like you suggested (but with a mremap-like interface, instead of a write syscall) and we could benchmark the difference between copy and TLB flush too. We could even periodically benchmark it at runtime and switch over the faster method (the more CPUs there are in the host and the more threads the process has, the faster the copy will be compared to the TLB flush). Of course in terms of API I could implement the exact same mechanism as described above for remap_anon_pages inside a write() to the userfaultfd (it's a pseudo inode). It'd need two different commands to prepare for the coming write (with a len multiple of PAGE_SIZE) to know the address where the page should be mapped into and if to behave zerocopy or if to skip the TLB flush and copy. Because the copy vs TLB flush trade off is possible to achieve with both interfaces, I think it really boils down to choosing between a mremap like interface, or file+commands protocol interface. I tend to like mremap more, that's why I opted for a remap_anon_pages syscall kept orthogonal to the userfaultfd functionality (remap_anon_pages could be also used standalone as an accelerated mremap in some circumstances) but nothing prevents to just embed the same mechanism inside userfaultfd if a file+commands API is preferable. Or we could add a different syscall (separated from userfaultfd) that creates another pseudofd to write a command plus the page data into it. Just I wouldn't see the point of creating a pseudofd just to copy a page atomically, the write() syscall would look more ideal if the userfaultfd is already open for other reasons and the pseudofd overhead is required anyway. Last thing to keep in mind is that if using userfaults with SIGBUS and without userfaultfd, remap_anon_pages would have been still useful, so if we retain the SIGBUS behavior for volatile pages and we don't force the usage for userfaultfd, it may be cleaner not to use userfaultfd but a separate pseudofd to do the write() syscall though. Otherwise the app would need to open the userfaultfd to resolve the fault even though it's not using the userfaultfd protocol which doesn't look an intuitive interface to me. Comments welcome. Thanks, Andrea
On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: > mremap like interface, or file+commands protocol interface. I tend to > like mremap more, that's why I opted for a remap_anon_pages syscall > kept orthogonal to the userfaultfd functionality (remap_anon_pages > could be also used standalone as an accelerated mremap in some > circumstances) but nothing prevents to just embed the same mechanism Sorry for the self followup, but something else comes to mind to elaborate this further. In term of interfaces, the most efficient I could think of to minimize the enter/exit kernel, would be to append the "source address" of the data received from the network transport, to the userfaultfd_write() command (by appending 8 bytes to the wakeup command). Said that, mixing the mechanism to be notified about userfaults with the mechanism to resolve an userfault to me looks a complication. I kind of liked to keep the userfaultfd protocol is very simple and doing just its thing. The userfaultfd doesn't need to know how the userfault was resolved, even mremap would work theoretically (until we run out of vmas). I thought it was simpler to keep it that way. However if we want to resolve the fault with a "write()" syscall this may be the most efficient way to do it, as we're already doing a write() into the pseudofd to wakeup the page fault that contains the destination address, I just need to append the source address to the wakeup command. I probably grossly overestimated the benefits of resolving the userfault with a zerocopy page move, sorry. So if we entirely drop the zerocopy behavior and the TLB flush of the old page like you suggested, the way to keep the userfaultfd mechanism decoupled from the userfault resolution mechanism would be to implement an atomic-copy syscall. That would work for SIGBUS userfaults too without requiring a pseudofd then. It would be enough then to call mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic wouldn't page fault or call GUP into the destination address (it can't otherwise the in-flight partial copy would be visible to the process, breaking the atomicity of the copy), but it would fill in the pte/trans_huge_pmd with the same strict behavior that remap_anon_pages currently has (in turn it would by design bypass the VM_USERFAULT check and be ideal for resolving userfaults). mcopy_atomic could then be also extended to tmpfs and it would work without requiring the source page to be a tmpfs page too without having to convert page types on the fly. If I add mcopy_atomic, the patch in subject (10/17) can be dropped of course so it'd be even less intrusive than the current remap_anon_pages and it would require zero TLB flush during its runtime (it would just require an atomic copy). So should I try to embed a mcopy_atomic inside userfault_write or can I expose it to userland as a standalone new syscall? Or should I do something different? Comments? Thanks, Andrea
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: >> mremap like interface, or file+commands protocol interface. I tend to >> like mremap more, that's why I opted for a remap_anon_pages syscall >> kept orthogonal to the userfaultfd functionality (remap_anon_pages >> could be also used standalone as an accelerated mremap in some >> circumstances) but nothing prevents to just embed the same mechanism > > Sorry for the self followup, but something else comes to mind to > elaborate this further. > > In term of interfaces, the most efficient I could think of to minimize > the enter/exit kernel, would be to append the "source address" of the > data received from the network transport, to the userfaultfd_write() > command (by appending 8 bytes to the wakeup command). Said that, > mixing the mechanism to be notified about userfaults with the > mechanism to resolve an userfault to me looks a complication. I kind > of liked to keep the userfaultfd protocol is very simple and doing > just its thing. The userfaultfd doesn't need to know how the userfault > was resolved, even mremap would work theoretically (until we run out > of vmas). I thought it was simpler to keep it that way. However if we > want to resolve the fault with a "write()" syscall this may be the > most efficient way to do it, as we're already doing a write() into the > pseudofd to wakeup the page fault that contains the destination > address, I just need to append the source address to the wakeup command. > > I probably grossly overestimated the benefits of resolving the > userfault with a zerocopy page move, sorry. So if we entirely drop the > zerocopy behavior and the TLB flush of the old page like you > suggested, the way to keep the userfaultfd mechanism decoupled from > the userfault resolution mechanism would be to implement an > atomic-copy syscall. That would work for SIGBUS userfaults too without > requiring a pseudofd then. It would be enough then to call > mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints > that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic > wouldn't page fault or call GUP into the destination address (it can't > otherwise the in-flight partial copy would be visible to the process, > breaking the atomicity of the copy), but it would fill in the > pte/trans_huge_pmd with the same strict behavior that remap_anon_pages > currently has (in turn it would by design bypass the VM_USERFAULT > check and be ideal for resolving userfaults). At the risk of asking a possibly useless question, would it make sense to splice data into a userfaultfd? --Andy > > mcopy_atomic could then be also extended to tmpfs and it would work > without requiring the source page to be a tmpfs page too without > having to convert page types on the fly. > > If I add mcopy_atomic, the patch in subject (10/17) can be dropped of > course so it'd be even less intrusive than the current > remap_anon_pages and it would require zero TLB flush during its > runtime (it would just require an atomic copy). > > So should I try to embed a mcopy_atomic inside userfault_write or can > I expose it to userland as a standalone new syscall? Or should I do > something different? Comments? > > Thanks, > Andrea
On Tue, Oct 07, 2014 at 05:52:47PM +0200, Andrea Arcangeli wrote: > I probably grossly overestimated the benefits of resolving the > userfault with a zerocopy page move, sorry. [...] For posterity, I think it's worth noting that most expensive aspect of a TLB shootdown is the interprocessor interrupt necessary to flush other CPUs' TLBs. On a many-core machine, copying 4K of data looks pretty cheap compared to taking an interrupt and invalidating TLBs on many cores :-) > [...] So if we entirely drop the > zerocopy behavior and the TLB flush of the old page like you > suggested, the way to keep the userfaultfd mechanism decoupled from > the userfault resolution mechanism would be to implement an > atomic-copy syscall. That would work for SIGBUS userfaults too without > requiring a pseudofd then. It would be enough then to call > mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints > that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic > wouldn't page fault or call GUP into the destination address (it can't > otherwise the in-flight partial copy would be visible to the process, > breaking the atomicity of the copy), but it would fill in the > pte/trans_huge_pmd with the same strict behavior that remap_anon_pages > currently has (in turn it would by design bypass the VM_USERFAULT > check and be ideal for resolving userfaults). > > mcopy_atomic could then be also extended to tmpfs and it would work > without requiring the source page to be a tmpfs page too without > having to convert page types on the fly. > > If I add mcopy_atomic, the patch in subject (10/17) can be dropped of > course so it'd be even less intrusive than the current > remap_anon_pages and it would require zero TLB flush during its > runtime (it would just require an atomic copy). I like this new approach. It will be good to have a single interface for resolving anon and tmpfs userfaults. > So should I try to embed a mcopy_atomic inside userfault_write or can > I expose it to userland as a standalone new syscall? Or should I do > something different? Comments? One interesting (ab)use of userfault_write would be that the faulting process and the fault-handling process could be different, which would be necessary for post-copy live migration in CRIU (http://criu.org). Aside from the asthetic difference, I can't think of any advantage in favor of a syscall. Peter
On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > I see what you mean. The only cons I see is that we couldn't use then > recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr, > PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we? > There's no recvfile(userfaultfd, socketfd, PAGE_SIZE). You're doing completelt faulty math, and you haven't thought it through. Your "zero-copy" case is no such thing. Who cares if some packet receive is zero-copy, when you need to set up the anonymous page to *receive* the zero copy into, which involves page allocation, page zeroing, page table setup with VM and page table locking, etc etc. The thing is, the whole concept of "zero-copy" is pure and utter bullshit. Sun made a big deal about the whole concept back in the nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's crap. It's made-up and not real. Then, once you've allocated and cleared the page, mapped it in, your "zero-copy" model involves looking up the page in the page tables again (locking etc), then doing that zero-copy to the page. Then, when you remap it, you look it up in the page tables AGAIN, with locking, move it around, have to clear the old page table entry (which involves a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI - since the people who do this are all threaded and want many CPU's, and then you insert the page into the new place. That's *insane*. It's crap. All just to try to avoid one page copy. Don't do it. remapping games really are complete BS. They never beat just copying the data. It's that simple. > As things stands now, I'm afraid with a write() syscall we couldn't do > it zerocopy. Really, you need to rethink your whole "zerocopy" model. It's broken. Nobody sane cares. You've bought into a model that Sun already showed doesn't work. The only time zero-copy works is in random benchmarks that are very careful to not touch the data at all at any point, and also try to make sure that the benchmark is very much single-threaded so that you never have the whole cross-CPU IPI issue for the TLB invalidate. Then, and only then, can zero-copy win. And it's just not a realistic scenario. > If it wasn't for the TLB flush of the old page, the remap_anon_pages > variant would be more optimal than doing a copy through a write > syscall. Is the copy cheaper than a TLB flush? I probably naively > assumed the TLB flush was always cheaper. A local TLB flush is cheap. That's not the problem. The problem is the setup of the page, and the clearing of the page, and the cross-CPU TLB flush. And the page table locking, etc etc. So no, I'm not AT ALL worried about a single "invlpg" instruction. That's nothing. Local CPU TLB flushes of single pages are basically free. But that really isn't where the costs are. Quite frankly, the *only* time page remapping has ever made sense is when it is used for realloc() kind of purposes, where you need to remap pages not because of zero-copy, but because you need to change the virtual address space layout. And then you make sure it's not a common operation, because you're not doing it as a performance win, you're doing it because you're changing your virtual layout. Really. Zerocopy is for benchmarks, and for things like "splice()" when you can avoid the page tables *entirely*. But the notion that page remapping of user pages is faster than a copy is pure and utter garbage. It's simply not true. So I really think you should aim for a "write()": kind of interface. With write, you may not get the zero-copy, but on the other hand it allows you to re-use the source buffer many times without having to allocate new pages and map it in etc. So a "read()+write()" loop (or, quite commonly a "generate data computationally from a compressed source + write()" loop) is actually much more efficient than the zero-copy remapping, because you don't have all the complexities and overheads in creating the source page. It is possible that that could involve "splice()" too, although I don't really think the source data tends to be in page-aligned chunks. But hey, splice() at least *can* be faster than copying (and then we have vmsplice() not because it's magically faster, but because it can under certain circumstances be worth it, and it kind of made sense to allow the interface, but I really don't think it's used very much or very useful). Linus
* Linus Torvalds (torvalds@linux-foundation.org) wrote: > On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > Of course if somebody has better ideas on how to resolve an anonymous > > userfault they're welcome. > > So I'd *much* rather have a "write()" style interface (ie _copying_ > bytes from user space into a newly allocated page that gets mapped) > than a "remap page" style interface Something like that might work for the postcopy case; it doesn't work for some of the other uses that need to stop a page being changed by the guest, but then need to somehow get a copy of that page internally to QEMU, and perhaps provide it back later. remap_anon_pages worked for those cases as well; I can't think of another current way of doing it in userspace. I'm thinking here of systems for making VMs with memory larger than a single host; that's something that's not as well thought out. I've also seen people writing emulation that want to trap and emulate some page accesses while still having the original data available to the emulator itself. So yes, OK for now, but the result is less general. Dave > remapping anonymous pages involves page table games that really aren't > necessarily a good idea, and tlb invalidates for the old page etc. > Just don't do it. > > Linus -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto: >> > >> > So I'd *much* rather have a "write()" style interface (ie _copying_ >> > bytes from user space into a newly allocated page that gets mapped) >> > than a "remap page" style interface > Something like that might work for the postcopy case; it doesn't work > for some of the other uses that need to stop a page being changed by the > guest, but then need to somehow get a copy of that page internally to QEMU, > and perhaps provide it back later. I cannot parse this. Which uses do you have in mind? Is it for QEMU-specific or is it for other applications of userfaults? As long as the page is atomically mapped, I'm not sure what the difference from remap_anon_pages are (as far as the destination page is concerned). Are you thinking of having userfaults enabled on the source as well? Paolo > remap_anon_pages worked for those cases > as well; I can't think of another current way of doing it in userspace.
* Paolo Bonzini (pbonzini@redhat.com) wrote: > Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto: > >> > > >> > So I'd *much* rather have a "write()" style interface (ie _copying_ > >> > bytes from user space into a newly allocated page that gets mapped) > >> > than a "remap page" style interface > > Something like that might work for the postcopy case; it doesn't work > > for some of the other uses that need to stop a page being changed by the > > guest, but then need to somehow get a copy of that page internally to QEMU, > > and perhaps provide it back later. > > I cannot parse this. Which uses do you have in mind? Is it for > QEMU-specific or is it for other applications of userfaults? > As long as the page is atomically mapped, I'm not sure what the > difference from remap_anon_pages are (as far as the destination page is > concerned). Are you thinking of having userfaults enabled on the source > as well? What I'm talking about here is when I want to stop a page being accessed by the guest, do something with the data in qemu, and give it back to the guest sometime later. The main example is: Memory pools for guests where you swap RAM between a series of VM hosts. You have to take the page out, send it over the wire, sometime later if the guest tries to access it you userfault and pull it back. (There's at least one or two companies selling something like this, and at least one Linux based implementations with their own much more involved kernel hacks) Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index b402d60..4277ed7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1921,6 +1921,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) { struct anon_vma *anon_vma; int ret = 1; + struct address_space *mapping; BUG_ON(is_huge_zero_page(page)); BUG_ON(!PageAnon(page)); @@ -1932,10 +1933,24 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) * page_lock_anon_vma_read except the write lock is taken to serialise * against parallel split or collapse operations. */ - anon_vma = page_get_anon_vma(page); - if (!anon_vma) - goto out; - anon_vma_lock_write(anon_vma); + for (;;) { + mapping = ACCESS_ONCE(page->mapping); + anon_vma = page_get_anon_vma(page); + if (!anon_vma) + goto out; + anon_vma_lock_write(anon_vma); + /* + * We don't hold the page lock here so + * remap_anon_pages_huge_pmd can change the anon_vma + * from under us until we obtain the anon_vma + * lock. Verify that we obtained the anon_vma lock + * before remap_anon_pages did. + */ + if (likely(mapping == ACCESS_ONCE(page->mapping))) + break; + anon_vma_unlock_write(anon_vma); + put_anon_vma(anon_vma); + } ret = 0; if (!PageCompound(page)) @@ -2460,6 +2475,7 @@ static void collapse_huge_page(struct mm_struct *mm, * Prevent all access to pagetables with the exception of * gup_fast later hanlded by the ptep_clear_flush and the VM * handled by the anon_vma lock + PG_lock. + * remap_anon_pages is prevented to race as well by the mmap_sem. */ down_write(&mm->mmap_sem); if (unlikely(khugepaged_test_exit(mm))) diff --git a/mm/rmap.c b/mm/rmap.c index 3e8491c..6d875eb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -450,6 +450,7 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page) struct anon_vma *root_anon_vma; unsigned long anon_mapping; +repeat: rcu_read_lock(); anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping); if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON) @@ -488,6 +489,14 @@ struct anon_vma *page_lock_anon_vma_read(struct page *page) rcu_read_unlock(); anon_vma_lock_read(anon_vma); + /* check if remap_anon_pages changed the anon_vma */ + if (unlikely((unsigned long) ACCESS_ONCE(page->mapping) != anon_mapping)) { + anon_vma_unlock_read(anon_vma); + put_anon_vma(anon_vma); + anon_vma = NULL; + goto repeat; + } + if (atomic_dec_and_test(&anon_vma->refcount)) { /* * Oops, we held the last refcount, release the lock
remap_anon_pages (unlike remap_file_pages) tries to be non intrusive in the rmap code. As far as the rmap code is concerned, rmap_anon_pages only alters the page->mapping and page->index. It does it while holding the page lock. However there are a few places that in presence of anon pages are allowed to do rmap walks without the page lock (split_huge_page and page_referenced_anon). Those places that are doing rmap walks without taking the page lock first, must be updated to re-check that the page->mapping didn't change after they obtained the anon_vma lock. remap_anon_pages takes the anon_vma lock for writing before altering the page->mapping, so if the page->mapping is still the same after obtaining the anon_vma lock (without the page lock), the rmap walks can go ahead safely (and remap_anon_pages will wait them to complete before proceeding). remap_anon_pages serializes against itself with the page lock. All other places taking the anon_vma lock while holding the mmap_sem for writing, don't need to check if the page->mapping has changed after taking the anon_vma lock, regardless of the page lock, because remap_anon_pages holds the mmap_sem for reading. Overall this looks a fairly small change to the rmap code, notably less intrusive than the nonlinear vmas created by remap_file_pages. There's one constraint enforced to allow this simplification: the source pages passed to remap_anon_pages must be mapped only in one vma, but this is not a limitation when used to handle userland page faults with MADV_USERFAULT. The source addresses passed to remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to avoid any risk of the mapcount of the pages increasing, if fork runs in parallel in another thread, before or while remap_anon_pages runs. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- mm/huge_memory.c | 24 ++++++++++++++++++++---- mm/rmap.c | 9 +++++++++ 2 files changed, 29 insertions(+), 4 deletions(-)