Patchwork rdma: don't make pages writeable if not requiested

login
register
mail settings
Submitter Michael S. Tsirkin
Date March 21, 2013, 6:18 a.m.
Message ID <20130321061838.GA28319@redhat.com>
Download mbox | patch
Permalink /patch/229561/
State New
Headers show

Comments

Michael S. Tsirkin - March 21, 2013, 6:18 a.m.
core/umem.c seems to get the arguments to get_user_pages
in the reverse order: it sets writeable flag and
breaks COW for MAP_SHARED if and only if hardware needs to
write the page.

This breaks memory overcommit for users such as KVM:
each time we try to register a page to send it to remote, this
breaks COW.  It seems that for applications that only have
REMOTE_READ permission, there is no reason to break COW at all.

If the page that is COW has lots of copies, this makes the user process
quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
for virtualization, thus the stable tag.

Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Note: compile-tested only, I don't have RDMA hardware at the moment.
Michael, could you please try this patch (also fixing your
usespace code not to request write access) and report?

Note2: grep for get_user_pages in infiniband drivers turns up
lots of users who set write to 1 unconditionally.
These might be bugs too, should be checked.

 drivers/infiniband/core/umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Roland Dreier - March 21, 2013, 6:55 a.m.
On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> core/umem.c seems to get the arguments to get_user_pages
> in the reverse order: it sets writeable flag and
> breaks COW for MAP_SHARED if and only if hardware needs to
> write the page.
>
> This breaks memory overcommit for users such as KVM:
> each time we try to register a page to send it to remote, this
> breaks COW.  It seems that for applications that only have
> REMOTE_READ permission, there is no reason to break COW at all.

I proposed a similar (but not exactly the same, see below) patch a
while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
especially https://lkml.org/lkml/2012/2/6/265

I think this change will break the case where userspace tries to
register an MR with read-only permission, but intends locally through
the CPU to write to the memory.  If the memory registration is done
while the memory is mapped read-only but has VM_MAYWRITE, then
userspace gets into trouble when COW happens.  In the case you're
describing (although I'm not sure where in KVM we're talking about
using RDMA), what happens if you register memory with only REMOTE_READ
and then COW is triggered because of a local write?  (I'm assuming you
don't want remote access to continue to get the old contents of the
page)

I have to confess that I still haven't had a chance to implement the
proposed FOLL_FOLLOW solution to all of this.

> If the page that is COW has lots of copies, this makes the user process
> quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> for virtualization, thus the stable tag.

The actual problem description here is a bit too terse for me to
understand.  How do we end up with lots of copies of a COW page?  Why
is RDMA registering the memory any more  special than having everyone
who maps this page actually writing to it and triggering COW?

>                 ret = get_user_pages(current, current->mm, cur_base,
>                                      min_t(unsigned long, npages,
>                                            PAGE_SIZE / sizeof (struct page *)),
> -                                    1, !umem->writable, page_list, vma_list);
> +                                    !umem->writable, 1, page_list, vma_list);

The first two parameters in this line being changed are "write" and "force".

I think if we do change this, then we need to pass umem->writable (as
opposed to !umem->writable) for the "write" parameter.  Not sure
whether "force" makes sense or not.

 - R.
Michael S. Tsirkin - March 21, 2013, 7:03 a.m.
On Wed, Mar 20, 2013 at 11:55:54PM -0700, Roland Dreier wrote:
> On Wed, Mar 20, 2013 at 11:18 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > core/umem.c seems to get the arguments to get_user_pages
> > in the reverse order: it sets writeable flag and
> > breaks COW for MAP_SHARED if and only if hardware needs to
> > write the page.
> >
> > This breaks memory overcommit for users such as KVM:
> > each time we try to register a page to send it to remote, this
> > breaks COW.  It seems that for applications that only have
> > REMOTE_READ permission, there is no reason to break COW at all.
> 
> I proposed a similar (but not exactly the same, see below) patch a
> while ago: https://lkml.org/lkml/2012/1/26/7 but read the thread,
> especially https://lkml.org/lkml/2012/2/6/265
> 
> I think this change will break the case where userspace tries to
> register an MR with read-only permission, but intends locally through
> the CPU to write to the memory.

Shouldn't it set LOCAL_WRITE then?
Roland Dreier - March 21, 2013, 7:15 a.m.
>> I think this change will break the case where userspace tries to
>> register an MR with read-only permission, but intends locally through
>> the CPU to write to the memory.

> Shouldn't it set LOCAL_WRITE then?

We're talking about the permissions for the register MR operation,
right?  (That's what the kernel RDMA driver code that does
get_user_pages() sees)

In that case, no, I don't see any reason for LOCAL_WRITE, since the
only RDMA operations that will access this memory are remote reads.
The writing (that triggers COW) is coming from normal process access
triggering a page fault, etc.  This is a pretty standard way of using
RDMA... For example, I allocate some memory and register it for RDMA
read (and pass the R_Key to the remote system) with only REMOTE_READ
permission.  Then I fill in the memory with the results of some
computation and the remote system does an RDMA read to get those
results.

 - R.
Michael S. Tsirkin - March 21, 2013, 8:51 a.m.
On Thu, Mar 21, 2013 at 12:15:33AM -0700, Roland Dreier wrote:
> >> I think this change will break the case where userspace tries to
> >> register an MR with read-only permission, but intends locally through
> >> the CPU to write to the memory.
> 
> > Shouldn't it set LOCAL_WRITE then?
> 
> We're talking about the permissions for the register MR operation,
> right?  (That's what the kernel RDMA driver code that does
> get_user_pages() sees)
> 
> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> only RDMA operations that will access this memory are remote reads.

What is the meaning of LOCAL_WRITE then? There are no local
RDMA writes as far as I can see.

> The writing (that triggers COW) is coming from normal process access
> triggering a page fault, etc.  This is a pretty standard way of using
> RDMA... For example, I allocate some memory and register it for RDMA
> read (and pass the R_Key to the remote system) with only REMOTE_READ
> permission.  Then I fill in the memory with the results of some
> computation and the remote system does an RDMA read to get those
> results.
> 
>  - R.

OK then what we need is a new flag saying "I really do not
intend to write into this memory please do not break
COW or do anything else just in case I do".
Roland Dreier - March 21, 2013, 9:13 a.m.
On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> In that case, no, I don't see any reason for LOCAL_WRITE, since the
>> only RDMA operations that will access this memory are remote reads.
>
> What is the meaning of LOCAL_WRITE then? There are no local
> RDMA writes as far as I can see.

Umm, it means you're giving the local adapter permission to write to
that memory.  So you can use it as a receive buffer or as the target
for remote data from an RDMA read operation.

> OK then what we need is a new flag saying "I really do not
> intend to write into this memory please do not break
> COW or do anything else just in case I do".

Isn't that a shared read-only mapping?

 - R.
Michael S. Tsirkin - March 21, 2013, 9:39 a.m.
On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> >> only RDMA operations that will access this memory are remote reads.
> >
> > What is the meaning of LOCAL_WRITE then? There are no local
> > RDMA writes as far as I can see.
> 
> Umm, it means you're giving the local adapter permission to write to
> that memory.  So you can use it as a receive buffer or as the target
> for remote data from an RDMA read operation.

Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
I don't see why do you need to give adapter permission
to use memory as a receive buffer given that
you must pass the address in the post receive verb,
but maybe that's what the IB spec says so that's what
the applications assumed?

> > OK then what we need is a new flag saying "I really do not
> > intend to write into this memory please do not break
> > COW or do anything else just in case I do".
> 
> Isn't that a shared read-only mapping?
> 
>  - R.

Nothing to do with how the page is mapped.  We can and do write the page
before registration. BTW umem.c passes in force so it breaks COW
even for read-only mappings, right?
mrhines@linux.vnet.ibm.com - March 21, 2013, 12:23 p.m.
Yes, I'd be happy to try the patch.

Got meetings all day...... but will dive in soon.

On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote:
> core/umem.c seems to get the arguments to get_user_pages
> in the reverse order: it sets writeable flag and
> breaks COW for MAP_SHARED if and only if hardware needs to
> write the page.
>
> This breaks memory overcommit for users such as KVM:
> each time we try to register a page to send it to remote, this
> breaks COW.  It seems that for applications that only have
> REMOTE_READ permission, there is no reason to break COW at all.
>
> If the page that is COW has lots of copies, this makes the user process
> quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> for virtualization, thus the stable tag.
>
> Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> Note: compile-tested only, I don't have RDMA hardware at the moment.
> Michael, could you please try this patch (also fixing your
> usespace code not to request write access) and report?
>
> Note2: grep for get_user_pages in infiniband drivers turns up
> lots of users who set write to 1 unconditionally.
> These might be bugs too, should be checked.
>
>   drivers/infiniband/core/umem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index a841123..5929598 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
>   		ret = get_user_pages(current, current->mm, cur_base,
>   				     min_t(unsigned long, npages,
>   					   PAGE_SIZE / sizeof (struct page *)),
> -				     1, !umem->writable, page_list, vma_list);
> +				     !umem->writable, 1, page_list, vma_list);
>
>   		if (ret < 0)
>   			goto out;
Michael S. Tsirkin - March 21, 2013, 12:32 p.m.
On Thu, Mar 21, 2013 at 08:23:48AM -0400, Michael R. Hines wrote:
> Yes, I'd be happy to try the patch.
> 
> Got meetings all day...... but will dive in soon.

The patch is unlikely to be the final version. In particular
you need to change !umem->writable to umem->writable.

> On 03/21/2013 02:18 AM, Michael S. Tsirkin wrote:
> >core/umem.c seems to get the arguments to get_user_pages
> >in the reverse order: it sets writeable flag and
> >breaks COW for MAP_SHARED if and only if hardware needs to
> >write the page.
> >
> >This breaks memory overcommit for users such as KVM:
> >each time we try to register a page to send it to remote, this
> >breaks COW.  It seems that for applications that only have
> >REMOTE_READ permission, there is no reason to break COW at all.
> >
> >If the page that is COW has lots of copies, this makes the user process
> >quickly exceed the cgroups memory limit.  This makes RDMA mostly useless
> >for virtualization, thus the stable tag.
> >
> >Reported-by: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
> >Cc: stable@vger.kernel.org
> >Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >---
> >
> >Note: compile-tested only, I don't have RDMA hardware at the moment.
> >Michael, could you please try this patch (also fixing your
> >usespace code not to request write access) and report?
> >
> >Note2: grep for get_user_pages in infiniband drivers turns up
> >lots of users who set write to 1 unconditionally.
> >These might be bugs too, should be checked.
> >
> >  drivers/infiniband/core/umem.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >index a841123..5929598 100644
> >--- a/drivers/infiniband/core/umem.c
> >+++ b/drivers/infiniband/core/umem.c
> >@@ -152,7 +152,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
> >  		ret = get_user_pages(current, current->mm, cur_base,
> >  				     min_t(unsigned long, npages,
> >  					   PAGE_SIZE / sizeof (struct page *)),
> >-				     1, !umem->writable, page_list, vma_list);
> >+				     !umem->writable, 1, page_list, vma_list);
> >
> >  		if (ret < 0)
> >  			goto out;
Jason Gunthorpe - March 21, 2013, 5:11 p.m.
On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > >> only RDMA operations that will access this memory are remote reads.
> > >
> > > What is the meaning of LOCAL_WRITE then? There are no local
> > > RDMA writes as far as I can see.
> > 
> > Umm, it means you're giving the local adapter permission to write to
> > that memory.  So you can use it as a receive buffer or as the target
> > for remote data from an RDMA read operation.
> 
> Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> I don't see why do you need to give adapter permission

The access flags have to do with what independent access remote nodes
get. There are four major cases:

access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
read the memory.

access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
the direct control of the application, can read this memory. Remote
nodes are barred.

access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
will let remote nodes write the memory

access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
memory. Only the adaptor, under the direct control of the application,
can write the memory.

The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what
you want - it guarentees the adaptor will never write to this memory
under any circumstances, so you can leave the page COW'd. If
LOCAL_WRITE was implied then you'd have to COW everything..

Would it be better to drive the COW break decision off the region's MM
flags? Ie if the memory is mapped read only into the process then you
can keep the COW at the RDMA layer, otherwise you should break
it. That seems more natural than a new flag?

Jason
Michael S. Tsirkin - March 21, 2013, 5:15 p.m.
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > > >> only RDMA operations that will access this memory are remote reads.
> > > >
> > > > What is the meaning of LOCAL_WRITE then? There are no local
> > > > RDMA writes as far as I can see.
> > > 
> > > Umm, it means you're giving the local adapter permission to write to
> > > that memory.  So you can use it as a receive buffer or as the target
> > > for remote data from an RDMA read operation.
> > 
> > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> > I don't see why do you need to give adapter permission
> 
> The access flags have to do with what independent access remote nodes
> get. There are four major cases:
> 
> access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
> read the memory.
> 
> access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
> the direct control of the application, can read this memory. Remote
> nodes are barred.
> 
> access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
> will let remote nodes write the memory
> 
> access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
> memory. Only the adaptor, under the direct control of the application,
> can write the memory.
> 
> The fact LOCAL_READ/REMOTE_READ exists makes it possible to do what
> you want - it guarentees the adaptor will never write to this memory
> under any circumstances, so you can leave the page COW'd. If
> LOCAL_WRITE was implied then you'd have to COW everything..
> 
> Would it be better to drive the COW break decision off the region's MM
> flags? Ie if the memory is mapped read only into the process then you
> can keep the COW at the RDMA layer, otherwise you should break
> it. That seems more natural than a new flag?
> 
> Jason

No because application does this:
init page

...

after a lot of time

..

register
send
unregister

so it can not be read only.
Jason Gunthorpe - March 21, 2013, 5:21 p.m.
On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote:

> No because application does this:
> init page
> 
> ...
> 
> after a lot of time
> 
> ..
> 
> register
> send
> unregister
> 
> so it can not be read only.

mprotect(READONLY)
register
send
unregister
mprotect(WRITABLE)

?

With something like GIFT the app already has to give up writing to the
pages while they are moving, so changing the protection seems in line
with that?

Jason
Michael S. Tsirkin - March 21, 2013, 5:42 p.m.
On Thu, Mar 21, 2013 at 11:21:50AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 07:15:25PM +0200, Michael S. Tsirkin wrote:
> 
> > No because application does this:
> > init page
> > 
> > ...
> > 
> > after a lot of time
> > 
> > ..
> > 
> > register
> > send
> > unregister
> > 
> > so it can not be read only.
> 
> mprotect(READONLY)
> register
> send
> unregister
> mprotect(WRITABLE)
> 
> ?
> With something like GIFT the app already has to give up writing to the
> pages while they are moving, so changing the protection seems in line
> with that?
> 
> Jason

It doesn't actually, and our app would sometimes write to these pages.
It simply does not care which version does the remote get in this case
since we track writes and resend later.

Also this is per-page, MRs have byte granularity so easier to use.
Jason Gunthorpe - March 21, 2013, 5:57 p.m.
On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote:

> It doesn't actually, and our app would sometimes write to these pages.
> It simply does not care which version does the remote get in this case
> since we track writes and resend later.

Heh, somehow I thought you might say that :)

A new flag seems like the only way then - maybe:
  IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share
     a coherent view of registered memory. Memory writes from the CPU
     after ibv_reg_mr completes may not be reflected in the memory
     viewed by the adaptor.

     Can only be combined with read only access permissions.

Jason
Michael S. Tsirkin - March 21, 2013, 6:03 p.m.
On Thu, Mar 21, 2013 at 11:57:32AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 07:42:37PM +0200, Michael S. Tsirkin wrote:
> 
> > It doesn't actually, and our app would sometimes write to these pages.
> > It simply does not care which version does the remote get in this case
> > since we track writes and resend later.
> 
> Heh, somehow I thought you might say that :)
> 
> A new flag seems like the only way then - maybe:
>   IBV_ACCESS_NON_COHERENT - The adaptor and the CPU do not share
>      a coherent view of registered memory. Memory writes from the CPU
>      after ibv_reg_mr completes may not be reflected in the memory
>      viewed by the adaptor.
> 
>      Can only be combined with read only access permissions.
> 
> Jason

I kind of like _GIFT for name, gifts are nice :) But yes that's exactly
the semantics we need.
Michael S. Tsirkin - March 21, 2013, 6:16 p.m.
On Thu, Mar 21, 2013 at 11:11:15AM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 11:39:47AM +0200, Michael S. Tsirkin wrote:
> > On Thu, Mar 21, 2013 at 02:13:38AM -0700, Roland Dreier wrote:
> > > On Thu, Mar 21, 2013 at 1:51 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >> In that case, no, I don't see any reason for LOCAL_WRITE, since the
> > > >> only RDMA operations that will access this memory are remote reads.
> > > >
> > > > What is the meaning of LOCAL_WRITE then? There are no local
> > > > RDMA writes as far as I can see.
> > > 
> > > Umm, it means you're giving the local adapter permission to write to
> > > that memory.  So you can use it as a receive buffer or as the target
> > > for remote data from an RDMA read operation.
> > 
> > Well RDMA read has it's own flag, IB_ACCESS_REMOTE_READ.
> > I don't see why do you need to give adapter permission
> 
> The access flags have to do with what independent access remote nodes
> get. There are four major cases:
> 
> access = IBV_ACCESS_REMOTE_READ says the adaptor will let remote nodes
> read the memory.
> 
> access = 0 (ie IBV_ACCESS_LOCAL_READ) says that only the adaptor, under
> the direct control of the application, can read this memory. Remote
> nodes are barred.
> 
> access = IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE says the adaptor
> will let remote nodes write the memory
> 
> access = IBV_ACCESS_LOCAL_WRITE bars remote nodes from writing to that
> memory. Only the adaptor, under the direct control of the application,
> can write the memory.

This is the one I find redundant. Since the write will be done
by the adaptor under direct control by the application,
why does it make sense to declare this beforehand?
If you don't want to allow local write access to memory,
just do not post any receive WRs with this address.
If you posted and regret it, reset the QP to cancel.

But IB spec specified LOCAL_WRITE in this redundant way so I guess
applications expect it to have the semantics defined there, I just
didn't remember what they are.

No way around it then, need another flag.
Jason Gunthorpe - March 21, 2013, 6:41 p.m.
On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:

> This is the one I find redundant. Since the write will be done by
> the adaptor under direct control by the application, why does it
> make sense to declare this beforehand?  If you don't want to allow
> local write access to memory, just do not post any receive WRs with
> this address.  If you posted and regret it, reset the QP to cancel.

This is to support your COW scenario - the app declares before hand to
the kernel that it will write to the memory and the kernel ensures
pages are dedicated to the app at registration time. Or the app says
it will only read and the kernel could leave them shared.

The adaptor enforces the access control to prevent a naughty app from
writing to shared memory - think about mmap'ing libc.so and then using
RDMA to write to the shared pages. It is necessary to ensure that is
impossible.

Jason
Michael S. Tsirkin - March 21, 2013, 7:15 p.m.
On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:
> 
> > This is the one I find redundant. Since the write will be done by
> > the adaptor under direct control by the application, why does it
> > make sense to declare this beforehand?  If you don't want to allow
> > local write access to memory, just do not post any receive WRs with
> > this address.  If you posted and regret it, reset the QP to cancel.
> 
> This is to support your COW scenario - the app declares before hand to
> the kernel that it will write to the memory and the kernel ensures
> pages are dedicated to the app at registration time. Or the app says
> it will only read and the kernel could leave them shared.

Someone here is confused. LOCAL_WRITE/absence of it does not address
COW, it breaks COW anyway.  Are you now saying we should change rdma so
without LOCAL_WRITE it will not break COW?

> The adaptor enforces the access control to prevent a naughty app from
> writing to shared memory - think about mmap'ing libc.so and then using
> RDMA to write to the shared pages. It is necessary to ensure that is
> impossible.
> 
> Jason

That's why it's redundant: we can't trust an application to tell us
'this page is writeable', we must get this info from kernel.  And so
there's apparently no need for application to tell adaptor about
LOCAL_WRITE.
Jason Gunthorpe - March 21, 2013, 8:09 p.m.
On Thu, Mar 21, 2013 at 09:15:41PM +0200, Michael S. Tsirkin wrote:
> On Thu, Mar 21, 2013 at 12:41:35PM -0600, Jason Gunthorpe wrote:
> > On Thu, Mar 21, 2013 at 08:16:33PM +0200, Michael S. Tsirkin wrote:
> > 
> > > This is the one I find redundant. Since the write will be done by
> > > the adaptor under direct control by the application, why does it
> > > make sense to declare this beforehand?  If you don't want to allow
> > > local write access to memory, just do not post any receive WRs with
> > > this address.  If you posted and regret it, reset the QP to cancel.
> > 
> > This is to support your COW scenario - the app declares before hand to
> > the kernel that it will write to the memory and the kernel ensures
> > pages are dedicated to the app at registration time. Or the app says
> > it will only read and the kernel could leave them shared.
> 
> Someone here is confused. LOCAL_WRITE/absence of it does not address
> COW, it breaks COW anyway.  Are you now saying we should change rdma so
> without LOCAL_WRITE it will not break COW?

I am talking about 'from a spec' perspective - not what Linux does
today. The absence of LOCAL_WRITE is part of the specification to
support shared pages.

Pages can only be kept shared if all the ACCESS WRITE bits are clear -
today Linux always breaks the COW, but if you patch in the ability to
keep things shared then it must only happen when *all* the ACCESS
WRITE bits are clear.

> > The adaptor enforces the access control to prevent a naughty app from
> > writing to shared memory - think about mmap'ing libc.so and then using
> > RDMA to write to the shared pages. It is necessary to ensure that is
> > impossible.

> That's why it's redundant: we can't trust an application to tell us
> 'this page is writeable', we must get this info from kernel.  And so
> there's apparently no need for application to tell adaptor about
> LOCAL_WRITE.

The API design gives user space maximum flexibility, if it wants to
create an enforced no-write MR in otherwise writable pages by skipping
LOCAL_WRITE then it can do so.

The kernel's role in this should be to deny ibv_reg_mr with WRITE bits
set if the pages are not writable by the app - I don't know if it does
this today, it isn't critically important as long as the pages are
unshared.

Jason

Patch

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a841123..5929598 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -152,7 +152,7 @@  struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr,
 		ret = get_user_pages(current, current->mm, cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     1, !umem->writable, page_list, vma_list);
+				     !umem->writable, 1, page_list, vma_list);
 
 		if (ret < 0)
 			goto out;